Probability Fundamentals for Deep Learning: Random Variables and Bayes’ Theorem
A classifier that outputs “87% cat, 13% dog” isn’t reporting a fact — it’s reporting a probability distribution over possible answers. Nearly every output a deep learning model produces is fundamentally probabilistic: classification scores, language model next-token predictions, and even regression outputs under certain loss functions all rest on probability theory. Understanding the basics isn’t optional background reading — it’s what makes sense of what a model’s output actually means.
Random Variables: Quantifying Uncertain Outcomes
A random variable is a variable whose value is uncertain, but whose possible outcomes and their likelihoods can be described. Flipping a coin is a random variable with two outcomes; a neural network’s prediction for “which of 10 digit classes is this image” is a random variable with 10 possible outcomes.
import numpy as np
# A random variable representing a 10-class predictionclass_probabilities = np.array([0.02, 0.01, 0.85, 0.03, 0.01, 0.02, 0.01, 0.02, 0.02, 0.01])predicted_class = np.argmax(class_probabilities) # 2, with 85% confidenceEvery classification model’s final layer produces exactly this — a random variable’s probability distribution over the possible classes, not a certain answer.
Probability Distributions: Describing All Possible Outcomes
A probability distribution assigns a likelihood to every possible outcome of a random variable, and those likelihoods must sum to 1. The softmax function, which turns a neural network’s raw output scores into class probabilities, exists specifically to produce a valid probability distribution.
def softmax(scores): exp_scores = np.exp(scores - np.max(scores)) # subtract max for numerical stability return exp_scores / np.sum(exp_scores)
raw_scores = np.array([2.0, 1.0, 5.5])probabilities = softmax(raw_scores)print(probabilities) # array([0.017, 0.006, 0.976])print(np.sum(probabilities)) # 1.0This is the mechanism behind every multi-class classifier’s output layer — covered with more named distributions in Probability Distributions.
Conditional Probability: Probability Given Evidence
Conditional probability, written P(A|B), is the probability of event A occurring given that B has already happened. This is the exact structure of supervised learning’s core question: “what’s the probability of this label, given this input?”
P(label = "cat" | image pixels) = 0.87A trained classifier is, mathematically, an approximation of this conditional probability distribution — it has learned, from training data, to estimate P(label | input) for any input it’s shown, even ones it’s never seen before.
Bayes’ Theorem: Updating Beliefs With New Evidence
Bayes’ theorem relates a conditional probability to its reverse — expressing P(A|B) in terms of P(B|A), P(A), and P(B):
P(A|B) = P(B|A) * P(A) / P(B)This might look abstract, but it’s the exact mathematical structure behind naive Bayes classifiers, behind how you’d correctly interpret a medical test result (a positive result doesn’t mean you have the disease — it depends on the disease’s base rate, P(A)), and behind the probabilistic reasoning underlying generative models covered later in this series.
# Bayes' theorem example: spam classificationp_spam = 0.4 # prior: 40% of emails are spamp_word_given_spam = 0.9 # 90% of spam emails contain "free"p_word_given_not_spam = 0.1 # 10% of legit emails contain "free"p_not_spam = 1 - p_spam
p_word = (p_word_given_spam * p_spam) + (p_word_given_not_spam * p_not_spam)p_spam_given_word = (p_word_given_spam * p_spam) / p_word
print(p_spam_given_word) # 0.857 -- seeing "free" raises spam probability from 40% to ~86%Why This Matters for Loss Functions
Cross-entropy loss — the most widely used loss function for classification, covered in Loss Functions — is derived directly from probability theory. It measures how far a model’s predicted probability distribution is from the true distribution (where the correct class has probability 1 and everything else has probability 0). Understanding conditional probability is what makes it clear why cross-entropy is the natural loss for classification, rather than an arbitrary formula to memorize.
Probability and Model Confidence
A well-calibrated model’s probability outputs should genuinely reflect real-world frequency — if a model says “80% confident” across a thousand predictions, roughly 800 of them should actually be correct. Many modern deep networks are known to be poorly calibrated (overconfident) by default, which is why techniques like temperature scaling exist specifically to correct a model’s raw softmax outputs back toward genuinely meaningful probabilities — a direct, practical consequence of taking the probabilistic interpretation of model outputs seriously rather than treating “confidence score” as just a UI number.
Independence: A Simplifying Assumption Worth Recognizing
A related concept worth knowing: two events are independent if knowing one tells you nothing about the other — formally, P(A|B) = P(A). Many machine learning models make a simplifying independence assumption (naive Bayes classifiers, most explicitly, assume every feature is independent given the class label), which is rarely exactly true in real data but often works reasonably well in practice anyway, since the resulting model is simple, fast, and the independence violation’s practical impact on the final prediction is often smaller than the assumption’s theoretical inaccuracy might suggest. Recognizing when a model is relying on an independence assumption — and being appropriately skeptical of its outputs when that assumption is obviously violated in your specific data — is a useful, practical habit when evaluating any probabilistic model’s outputs.
Summary
| Concept | Deep Learning Application |
|---|---|
| Random variable | What a model’s prediction actually represents |
| Probability distribution | The shape of a classifier’s output (via softmax) |
| Conditional probability | P(label | input) — literally what supervised learning estimates |
| Bayes’ theorem | Foundation of naive Bayes, calibration, and generative modeling |
Every “confidence score” you’ve seen a model output is a claim rooted in this exact theory. Taking it literally — checking whether a model’s stated confidence matches its actual accuracy — is one of the most practically useful habits this foundation enables.