Online Learning
Most machine learning workflows follow a static pattern: collect data, train a model, deploy it, and retrain periodically when it degrades. Online learning breaks that pattern. Instead of batch training, the model updates incrementally — processing one example (or a small batch) at a time, continuously incorporating new information.
Why Online Learning Exists
The world changes. A fraud detection model trained on last year’s patterns misses new attack vectors. A recommendation model trained on summer behavior is wrong in December. A price prediction model trained before a market shock is immediately stale.
Retraining from scratch on all historical data is:
- Computationally expensive (hours or days)
- Memory intensive (can’t hold all data)
- Slow to respond to rapid changes
Online learning addresses this by updating the model incrementally as each new observation arrives.
The Core Mechanism
Batch learning: All data → Train once → Deploy → Wait for degradation → Retrain
Online learning: New example arrives → 1. Make prediction 2. Observe true label / feedback 3. Update model weights immediately 4. Repeat for next exampleThe model is always current — it’s seen every data point up to this moment.
Stochastic Gradient Descent (Online SGD)
The mathematical backbone of online learning. Instead of computing gradients over the full dataset (batch gradient descent), update weights after each example:
# Mini-batch SGD — the de facto standardfor epoch in range(n_epochs): for X_batch, y_batch in dataloader: # process one batch optimizer.zero_grad() loss = criterion(model(X_batch), y_batch) loss.backward() # compute gradients optimizer.step() # update weightsWith batch_size=1, this is true online learning. With small batches, it’s a practical compromise between the speed of online and the stability of batch.
Concept Drift: The Main Challenge
The biggest enemy of deployed models is concept drift — the statistical properties of the target variable changing over time.
Types of drift: Sudden drift: Distribution changes abruptly (market crash, new fraud method) Gradual drift: Slow shift over time (seasonal patterns, demographic change) Recurring drift: Cyclical changes (day/night, weekday/weekend) Incremental: Continuous small changesDetection methods:
- ADWIN (Adaptive Windowing): Detects distribution changes in a stream
- Page-Hinkley test: Statistical test for abrupt changes
- DDM/EDDM: Monitor error rate; trigger retraining when error rises significantly
- Population Stability Index (PSI): Compare feature distributions between windows
Practical Online Learning with River
River is the leading Python library for online/streaming ML:
from river import linear_model, preprocessing, metrics, drift
# Build an incremental pipelinescaler = preprocessing.StandardScaler()model = linear_model.LogisticRegression()metric = metrics.Accuracy()drift_detector = drift.ADWIN()
for x, y_true in data_stream: # Scale features incrementally (online) x_scaled = scaler.transform_one(x) scaler.learn_one(x)
# Predict before learning y_pred = model.predict_one(x_scaled) metric.update(y_true, y_pred)
# Learn from the new example model.learn_one(x_scaled, y_true)
# Detect concept drift drift_detector.update(int(y_pred != y_true)) if drift_detector.drift_detected: print(f"Drift detected! Accuracy: {metric.get():.3f}") metric = metrics.Accuracy() # reset metricOnline Learning in Production Systems
Real-world examples where online learning is essential:
Recommendation systems: User preferences evolve. Netflix, YouTube, and Spotify all use forms of online learning to update user embeddings in real time.
Fraud detection: Fraudsters adapt. Models that only update weekly fall behind attack patterns that change daily.
Ad click prediction: Billions of ad impressions per day; feedback arrives immediately. Batch retraining can’t keep up with fast-changing click rates.
Financial trading: Market microstructure changes; models must adapt continuously.
Online vs. Batch: When to Use Each
| Factor | Online Learning | Batch Learning |
|---|---|---|
| Data velocity | High (streaming) | Low-medium |
| Concept drift | High risk | Manageable |
| Memory constraint | Tight | Flexible |
| Training stability | Lower | Higher |
| Debugging complexity | High | Lower |
| Industry adoption | Niche | Dominant |
Recommendation: Start with batch learning and periodic retraining (e.g., daily). Move to online learning only when retraining latency or concept drift creates a measurable business impact. The operational complexity of online systems is significant — stream processing infrastructure, drift monitoring, rollback mechanisms.
2025–2026: Continual Learning
Academic research increasingly focuses on continual learning — neural networks that learn new tasks without forgetting old ones (the “catastrophic forgetting” problem). Methods like EWC (Elastic Weight Consolidation), progressive networks, and memory replay are making large models adaptable without full retraining. This will likely find its way into production LLM fine-tuning pipelines in the coming years.