Information Theory for Deep Learning: Entropy, Cross-Entropy, and KL Divergence

How entropy, cross-entropy, and KL divergence work, and why cross-entropy loss is the mathematically correct choice for classification.

Information Theory for Deep Learning: Entropy, Cross-Entropy, and KL Divergence

Cross-entropy loss is used to train the vast majority of classification models in deep learning, and it’s rarely explained why it’s the right choice rather than just an available option in a framework’s loss function list. The answer comes from information theory — a field originally developed for measuring how much information a message actually carries. Understanding it is what turns “cross-entropy is the standard loss for classification” from a memorized fact into something you can reason about.


Entropy: How Much Uncertainty a Distribution Contains

Entropy measures the average amount of “surprise” or uncertainty in a probability distribution. A distribution where one outcome is nearly certain has low entropy; a distribution where every outcome is equally likely has high entropy.

import numpy as np
def entropy(p):
return -np.sum(p * np.log(p + 1e-10)) # small epsilon avoids log(0)
certain = np.array([0.99, 0.01])
uncertain = np.array([0.5, 0.5])
print(entropy(certain)) # ~0.056 -- low entropy, low uncertainty
print(entropy(uncertain)) # ~0.693 -- higher entropy, maximum uncertainty for 2 outcomes

A perfectly confident, always-correct model would output distributions with entropy near zero. A model that’s essentially guessing produces high-entropy outputs — entropy is a direct, quantifiable measure of how uncertain a model’s predictions actually are, independent of whether those predictions are correct.


Cross-Entropy: Comparing a Predicted Distribution to the True One

Cross-entropy measures how well a predicted probability distribution matches a true (target) distribution — specifically, the average number of bits needed to describe the true distribution using a code optimized for the predicted distribution.

def cross_entropy(true_dist, predicted_dist):
return -np.sum(true_dist * np.log(predicted_dist + 1e-10))
true_label = np.array([1, 0, 0]) # one-hot: the correct class is index 0
predicted = np.array([0.7, 0.2, 0.1]) # model's predicted probabilities
loss = cross_entropy(true_label, predicted)
print(loss) # -log(0.7) ≈ 0.357

Because the true distribution is one-hot (a single class has probability 1, everything else 0), the formula simplifies dramatically — cross-entropy loss for classification reduces to just -log(predicted probability of the correct class). This is exactly what frameworks compute when you call CrossEntropyLoss or categorical_crossentropy, and it’s why a confident wrong prediction is penalized so much more heavily than an unconfident wrong one — -log(0.01) is a far larger loss than -log(0.4).


Why Cross-Entropy Is the Mathematically Correct Loss for Classification

Cross-entropy isn’t an arbitrary choice — minimizing cross-entropy loss is mathematically equivalent to maximizing the likelihood of the true labels under the model’s predicted distribution (maximum likelihood estimation). This connects the loss function directly back to the probability theory covered in Probability Fundamentals — training a classifier with cross-entropy loss is, formally, finding the weights that make the observed training labels as probable as possible under the model.


KL Divergence: Measuring the Gap Between Two Distributions

Kullback-Leibler (KL) divergence measures how different one probability distribution is from another — specifically, how much extra information is needed to describe the true distribution using a code built for the predicted distribution, beyond the true distribution’s own entropy.

def kl_divergence(true_dist, predicted_dist):
return np.sum(true_dist * np.log((true_dist + 1e-10) / (predicted_dist + 1e-10)))

The relationship between the three quantities is direct: Cross-Entropy = Entropy(true distribution) + KL Divergence(true || predicted). Since the entropy of the true distribution is a fixed constant (it doesn’t depend on the model), minimizing cross-entropy and minimizing KL divergence are effectively the same optimization — which is why you’ll see both terms used somewhat interchangeably in different papers and contexts.


Where KL Divergence Shows Up Beyond Basic Classification

KL divergence is central to training Variational Autoencoders, covered later in Generative Models, where it’s used to measure how far a learned latent distribution is from a desired prior distribution (typically a standard Gaussian). It also underlies knowledge distillation — training a smaller “student” model to match a larger “teacher” model’s output distribution, where KL divergence between student and teacher predictions is directly minimized as part of the training objective.

# Conceptual knowledge distillation loss
student_probs = softmax(student_logits / temperature)
teacher_probs = softmax(teacher_logits / temperature)
distillation_loss = kl_divergence(teacher_probs, student_probs)

Perplexity: Entropy’s Practical Cousin in Language Modeling

A closely related metric worth knowing specifically for language models, covered in Large Language Models, is perplexity — defined as the exponential of the cross-entropy loss, and interpretable as roughly “how many equally-likely choices was the model effectively choosing between at each step.” A perplexity of 1 means the model was always completely certain (and correct); a perplexity equal to the vocabulary size means the model was effectively guessing uniformly at random.

import numpy as np
cross_entropy_loss = 2.3 # nats, from a language modeling task
perplexity = np.exp(cross_entropy_loss)
print(perplexity) # ~9.97 -- roughly as uncertain as choosing between 10 equally likely options

Perplexity is widely reported when comparing language models specifically because it’s more intuitively interpretable than a raw cross-entropy loss value, even though the two measure exactly the same underlying quantity.

Watching perplexity trend downward across training checkpoints is often the very first signal that a language model is actually learning something useful, well before its generated text starts looking obviously coherent to a human reader.

Summary

ConceptMeasures
EntropyInherent uncertainty in a single distribution
Cross-entropyHow well a predicted distribution matches the true one — the standard classification loss
KL divergenceThe “extra cost” of using the wrong distribution, relative to entropy

Cross-entropy loss isn’t an arbitrary framework default — it’s the direct, information-theoretically justified consequence of wanting a model’s predicted probabilities to genuinely match reality, not just produce the right argmax.