Semi-Supervised Learning
Labeling data is expensive. A human expert reviewing medical images, legal documents, or fraud cases might label a few hundred examples per day — at significant cost. Yet your database has a million unlabeled records sitting idle.
Semi-supervised learning is the discipline of using both: the small set of expensive labeled examples to anchor the model, and the large set of cheap unlabeled examples to improve it.
Why Unlabeled Data Helps
At first glance, unlabeled data seems useless — there’s no target to train against. But unlabeled data reveals the structure of the input space: where clusters form, what the data distribution looks like, which regions are dense.
A good classifier should:
- Separate classes where labeled examples tell it to
- Place its decision boundary through low-density regions (not through clusters)
Unlabeled data helps with point 2 even without knowing the class.
[unlabeled] ●●● ○○○ ●● ●● ←── bad boundary ──→ ○○ ○○ ●●● ○○○ [unlabeled]
Semi-supervised model places boundary in the gap between clusters,even though those gap points have no labels.Key Approaches
1. Self-Training (Pseudo-Labeling)
The simplest approach, widely used in practice:
- Train a model on the labeled set
- Use it to label the unlabeled set (pseudo-labels)
- Retrain on labeled + high-confidence pseudo-labeled data
- Repeat
from sklearn.semi_supervised import SelfTrainingClassifierfrom sklearn.svm import SVCimport numpy as np
# -1 indicates unlabeled examplesy_mixed = y_labeled.copy() # True labels for 200 examplesy_mixed = np.concatenate([y_mixed, np.full(1800, -1)]) # 1800 unlabeled
X_all = np.vstack([X_labeled, X_unlabeled])
# Self-training wrapper around any base classifierbase = SVC(probability=True, kernel='rbf')model = SelfTrainingClassifier(base, threshold=0.9)model.fit(X_all, y_mixed)Risk: If the initial model makes wrong pseudo-labels with high confidence, errors compound. Work well when initial labeled data is representative.
2. Label Propagation
Spread labels through a graph built on feature similarity. Unlabeled points absorb class information from their labeled neighbors.
from sklearn.semi_supervised import LabelPropagation
lp = LabelPropagation(kernel='rbf', gamma=20)lp.fit(X_all, y_mixed)predicted_labels = lp.transduction_ # Labels for all pointsWorks well when the data has clear cluster structure and labeled points are spread across clusters.
3. Consistency Regularization
Train the model to produce the same predictions on an unlabeled example and its augmented/perturbed version. This is the secret behind models like MixMatch and FixMatch for image classification.
For unlabeled image x: augmented_1 = random_crop(x) augmented_2 = color_jitter(x)
Loss_consistency = KL_divergence(model(augmented_1), model(augmented_2))
Training loss = supervised_loss(labeled) + λ × consistency_loss(unlabeled)4. Generative Models (VAE/GAN-based)
Learn the data distribution from unlabeled examples, then use it to improve the classifier. More complex but powerful when data is high-dimensional (images, audio).
When Semi-Supervised Learning Makes Sense
| Scenario | Suitable? |
|---|---|
| Medical diagnosis with few expert-labeled scans | ✓ Strong fit |
| Legal document classification with 50 labeled | ✓ Strong fit |
| Customer churn with 10K labeled examples | Maybe — try supervised first |
| Tabular data with good label coverage | ✗ Unlikely to help much |
| Images where augmentation is meaningful | ✓ Consistency regularization |
The benefit is largest when labeled data is scarce (< 1% of total) and the data has meaningful cluster structure.
The Modern Picture: Foundation Models Changed Everything
In 2025–2026, the most impactful form of semi-supervised learning is pre-training + fine-tuning:
- Pre-train a large model on massive unlabeled data (self-supervised)
- Fine-tune on a small labeled dataset
This is what makes BERT, GPT, and vision transformers so powerful on small labeled datasets. The pre-training phase is technically self-supervised (the model creates its own labels from the data), but the end result — strong performance from few human labels — is exactly what semi-supervised learning promises.
For practitioners: if you have a text or image problem with limited labels, starting from a pre-trained foundation model (BERT, CLIP, ViT, LLaMA) and fine-tuning on your labeled data will typically outperform any classical semi-supervised approach.