Semi-Supervised Learning

Labeling data is expensive. A human expert reviewing medical images, legal documents, or fraud cases might label a few hundred examples per day — at significant cost. Yet your database has a million unlabeled records sitting idle.

Semi-supervised learning is the discipline of using both: the small set of expensive labeled examples to anchor the model, and the large set of cheap unlabeled examples to improve it.

Why Unlabeled Data Helps

At first glance, unlabeled data seems useless — there’s no target to train against. But unlabeled data reveals the structure of the input space: where clusters form, what the data distribution looks like, which regions are dense.

A good classifier should:

Separate classes where labeled examples tell it to
Place its decision boundary through low-density regions (not through clusters)

Unlabeled data helps with point 2 even without knowing the class.

                [unlabeled]
    ●●●                                 ○○○
  ●●  ●●    ←── bad boundary ──→   ○○  ○○
    ●●●                                 ○○○
                [unlabeled]

Semi-supervised model places boundary in the gap between clusters,
even though those gap points have no labels.

Key Approaches

1. Self-Training (Pseudo-Labeling)

The simplest approach, widely used in practice:

Train a model on the labeled set
Use it to label the unlabeled set (pseudo-labels)
Retrain on labeled + high-confidence pseudo-labeled data
Repeat

from sklearn.semi_supervised import SelfTrainingClassifier
from sklearn.svm import SVC
import numpy as np

# -1 indicates unlabeled examples
y_mixed = y_labeled.copy()  # True labels for 200 examples
y_mixed = np.concatenate([y_mixed, np.full(1800, -1)])  # 1800 unlabeled

X_all = np.vstack([X_labeled, X_unlabeled])

# Self-training wrapper around any base classifier
base = SVC(probability=True, kernel='rbf')
model = SelfTrainingClassifier(base, threshold=0.9)
model.fit(X_all, y_mixed)

Risk: If the initial model makes wrong pseudo-labels with high confidence, errors compound. Work well when initial labeled data is representative.

2. Label Propagation

Spread labels through a graph built on feature similarity. Unlabeled points absorb class information from their labeled neighbors.

from sklearn.semi_supervised import LabelPropagation

lp = LabelPropagation(kernel='rbf', gamma=20)
lp.fit(X_all, y_mixed)
predicted_labels = lp.transduction_  # Labels for all points

Works well when the data has clear cluster structure and labeled points are spread across clusters.

3. Consistency Regularization

Train the model to produce the same predictions on an unlabeled example and its augmented/perturbed version. This is the secret behind models like MixMatch and FixMatch for image classification.

For unlabeled image x:
  augmented_1 = random_crop(x)
  augmented_2 = color_jitter(x)

  Loss_consistency = KL_divergence(model(augmented_1), model(augmented_2))

Training loss = supervised_loss(labeled) + λ × consistency_loss(unlabeled)

4. Generative Models (VAE/GAN-based)

Learn the data distribution from unlabeled examples, then use it to improve the classifier. More complex but powerful when data is high-dimensional (images, audio).

When Semi-Supervised Learning Makes Sense

Scenario	Suitable?
Medical diagnosis with few expert-labeled scans	✓ Strong fit
Legal document classification with 50 labeled	✓ Strong fit
Customer churn with 10K labeled examples	Maybe — try supervised first
Tabular data with good label coverage	✗ Unlikely to help much
Images where augmentation is meaningful	✓ Consistency regularization

The benefit is largest when labeled data is scarce (< 1% of total) and the data has meaningful cluster structure.

The Modern Picture: Foundation Models Changed Everything

In 2025–2026, the most impactful form of semi-supervised learning is pre-training + fine-tuning:

Pre-train a large model on massive unlabeled data (self-supervised)
Fine-tune on a small labeled dataset

This is what makes BERT, GPT, and vision transformers so powerful on small labeled datasets. The pre-training phase is technically self-supervised (the model creates its own labels from the data), but the end result — strong performance from few human labels — is exactly what semi-supervised learning promises.

For practitioners: if you have a text or image problem with limited labels, starting from a pre-trained foundation model (BERT, CLIP, ViT, LLaMA) and fine-tuning on your labeled data will typically outperform any classical semi-supervised approach.