Supervised Learning Explained: Classification, Regression, and Real Examples
Supervised learning is the workhorse of applied machine learning — it’s the category behind image classifiers, spam filters, price predictors, and the vast majority of deep learning models deployed in production today. The name comes from the training process: the model is “supervised” by being shown the correct answer for every training example, and it learns by minimizing the gap between its predictions and those correct answers.
The Core Requirement: Labeled Data
Supervised learning requires a dataset where every input has a known, correct output attached — the “label.” Without labels, there’s nothing to supervise the learning process with.
# Labeled data: each email has a known correct answertraining_data = [ ("Win a free prize now!!!", "spam"), ("Meeting rescheduled to 3pm", "not_spam"), ("Claim your inheritance today", "spam"), ("Can you review my PR?", "not_spam"),]Collecting this labeled data — often the most expensive and time-consuming part of a real supervised learning project — is what Dataset Preparation covers in depth.
Classification: Predicting a Category
Classification tasks predict a discrete category from a fixed set of possibilities — spam or not spam, which digit (0–9), which of a thousand object classes.
import torch.nn as nn
classifier = nn.Sequential( nn.Linear(784, 128), nn.ReLU(), nn.Linear(128, 10), # 10 output classes nn.Softmax(dim=1) # converts to a probability distribution)The number of output neurons in a classifier equals the number of possible classes, and softmax (covered in Probability Distributions) converts the raw output scores into a valid probability distribution over those classes.
Regression: Predicting a Continuous Number
Regression tasks predict a continuous numerical value rather than a category — a house price, a temperature, tomorrow’s stock value.
regressor = nn.Sequential( nn.Linear(10, 64), nn.ReLU(), nn.Linear(64, 1) # single continuous output, no activation needed)The key architectural difference from classification: a regression model’s output layer typically has no final activation function (or occasionally a linear one), since the goal is an unbounded numeric prediction, not a bounded probability. Mean squared error, rather than cross-entropy, is the standard loss function here, covered in Loss Functions.
Classification vs. Regression: A Quick Comparison
| Aspect | Classification | Regression |
|---|---|---|
| Output type | Discrete category | Continuous number |
| Example task | Is this email spam? | What will this house sell for? |
| Typical output layer | Softmax (multi-class) or sigmoid (binary) | Linear (no activation) |
| Typical loss function | Cross-entropy | Mean squared error |
| Evaluation metric | Accuracy, precision, recall, F1 | Mean absolute error, R² |
Some problems can genuinely be framed either way — predicting a house’s exact price is regression, but predicting whether it falls into a “budget,” “mid-range,” or “luxury” bucket is classification. Choosing between them depends on what decision the prediction is actually meant to support downstream.
The Supervised Learning Workflow, End to End
1. Collect labeled data → (input, correct_output) pairs2. Split into train/val/test → covered in Dataset Preparation3. Choose a model architecture → MLP, CNN, transformer, etc.4. Choose a loss function → matched to the task type (see table above)5. Train via gradient descent → minimize the loss on training data6. Evaluate on held-out data → covered in Evaluation Metrics7. Deploy and monitor → covered in Deep Learning DeploymentEvery deep learning architecture covered later in this series — CNNs for images, transformers for text — is typically trained within exactly this supervised learning workflow, differing mainly in step 3 (architecture) and the specific data used in step 1.
When Supervised Learning Isn’t the Right Fit
Supervised learning’s biggest practical constraint is the need for labeled data, which is often expensive, slow, or simply unavailable at the scale needed. This is precisely the gap that Unsupervised Learning and increasingly, self-supervised pretraining (covered in Large Language Models) exist to address — techniques that extract useful structure from data without requiring a human to label every single example first.
Multi-Label Classification: A Third Pattern Worth Distinguishing
Beyond standard classification (exactly one correct class) and regression, a third common pattern is multi-label classification, where an example can genuinely belong to multiple categories simultaneously — a news article might be tagged as both “politics” and “economy” at once, not one or the other.
# Multi-label output: independent sigmoid per label, not a single softmaxmulti_label_output = nn.Sequential( nn.Linear(256, 10), # 10 possible labels nn.Sigmoid() # each output is an independent 0-1 probability, not a joint distribution)The key architectural difference: multi-label problems use an independent sigmoid activation per output label (each is its own binary yes/no decision) rather than a single softmax across all labels (which would force the model to assume exactly one label is correct, dividing probability mass across them). Recognizing which of these three patterns — classification, regression, or multi-label classification — actually matches your problem is a foundational step that shapes several downstream architectural decisions.
Getting this classification right early avoids a common downstream mistake: forcing a genuinely multi-label problem through a single softmax output, which structurally cannot represent an example belonging to more than one category at once.
Summary
| Concept | Meaning |
|---|---|
| Labeled data | Input/output pairs where the correct answer is known |
| Classification | Predicting a discrete category |
| Regression | Predicting a continuous value |
| Supervised workflow | Data → model → loss → training → evaluation → deployment |
Supervised learning is the default starting point for the majority of real-world deep learning problems — understanding whether your task is fundamentally a classification or regression problem is usually the very first architectural decision you’ll make, before choosing anything about the network itself.