Regularization Techniques: L1, L2, and Early Stopping Compared
Overfitting, covered in Overfitting and Underfitting, has several distinct, complementary countermeasures beyond dropout — L1 and L2 regularization directly penalize large weights, while early stopping addresses the problem from an entirely different angle by controlling training duration itself. Understanding how each one actually works, and when to combine them, turns “add some regularization” from a vague instruction into a specific, deliberate set of choices.
L2 Regularization (Weight Decay): Smoothly Discouraging Large Weights
L2 regularization adds a penalty proportional to the sum of squared weights directly to the loss function, encouraging the optimizer to keep weights small unless there’s a genuinely strong reason (a real reduction in data loss) for them to grow.
import numpy as np
def loss_with_l2(data_loss, weights, lambda_reg=0.01): l2_penalty = lambda_reg * np.sum(weights ** 2) return data_loss + l2_penaltyimport torch.optim as optim
# In PyTorch, L2 regularization is applied directly via the optimizer's weight_decay parameteroptimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=0.01)L2 regularization shrinks weights smoothly toward (but rarely exactly to) zero — every weight is nudged slightly smaller on every update, proportional to its current magnitude, connecting directly to the L2 norm covered in Norms and Distance Metrics.
L1 Regularization: Encouraging Sparsity
L1 regularization adds a penalty proportional to the sum of absolute weight values, and unlike L2, it tends to push many weights all the way to exactly zero rather than just shrinking them smaller.
def loss_with_l1(data_loss, weights, lambda_reg=0.01): l1_penalty = lambda_reg * np.sum(np.abs(weights)) return data_loss + l1_penaltyThis sparsity-inducing property makes L1 particularly useful when you want a form of automatic feature selection — weights connected to genuinely unimportant inputs are more likely to be driven to exactly zero, effectively removing that input’s influence entirely, rather than just reducing it.
Why L2 Is More Common in Deep Learning Specifically
While both are valid, L2 (often implemented as “weight decay” directly in the optimizer) is far more commonly used in deep learning specifically, for a few practical reasons: it produces smoother, more stable optimization behavior, it doesn’t create the non-differentiable kink at zero that L1’s absolute value function has (which requires special handling for gradient-based optimization), and sparsity is less critical in deep networks that already have strong implicit regularization from other sources (batch normalization, dropout, large datasets).
| L1 Regularization | L2 Regularization (Weight Decay) | |
|---|---|---|
| Effect on weights | Pushes many weights to exactly zero | Shrinks weights smoothly, rarely exactly zero |
| Produces sparsity | Yes | No |
| Optimization behavior | Non-smooth at zero, needs care | Smooth, well-behaved gradients |
| Common in deep learning | Less common | Very common (standard optimizer parameter) |
Early Stopping: Regularizing Through Training Duration
Early stopping takes an entirely different approach — instead of modifying the loss function, it monitors validation performance during training and stops before the model has trained long enough to start overfitting.
best_val_loss = float('inf')patience = 5patience_counter = 0
for epoch in range(max_epochs): train_one_epoch(model, train_loader) val_loss = evaluate(model, val_loader)
if val_loss < best_val_loss: best_val_loss = val_loss patience_counter = 0 save_checkpoint(model, "best_model.pt") # save the best version seen so far else: patience_counter += 1 if patience_counter >= patience: print(f"Stopping early at epoch {epoch} — no improvement for {patience} epochs") break
model.load_state_dict(torch.load("best_model.pt")) # restore the best checkpoint, not the last oneThe critical detail here: you restore the best checkpoint (lowest validation loss), not simply whatever the model looked like at the moment training stopped — the training loop typically continues a few epochs past the best point (waiting to confirm the trend via the patience counter) before actually halting.
Combining Multiple Regularization Techniques
These techniques aren’t mutually exclusive — a typical, well-regularized training setup combines several simultaneously, each addressing a slightly different aspect of overfitting.
model = nn.Sequential( nn.Linear(784, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Dropout(p=0.3), # dropout regularization nn.Linear(256, 10))
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=0.01) # L2 regularization
# Plus early stopping monitoring validation loss, as shown aboveThere’s no single “correct” combination — the right mix depends on how much overfitting is actually observed on your specific problem, following the diagnostic approach covered in Overfitting and Underfitting: add regularization proportional to the actual gap you observe between training and validation performance, rather than applying every technique maximally by default.
Elastic Net: Combining L1 and L2
For cases where both sparsity and smooth weight shrinkage are desirable, Elastic Net combines both penalties in a single loss term, weighted by a mixing parameter.
def loss_with_elastic_net(data_loss, weights, lambda_l1=0.01, lambda_l2=0.01): l1_penalty = lambda_l1 * np.sum(np.abs(weights)) l2_penalty = lambda_l2 * np.sum(weights ** 2) return data_loss + l1_penalty + l2_penaltyThis is more commonly seen in classical machine learning (linear and logistic regression with Elastic Net regularization) than in deep learning specifically, where L2/weight decay alone is typically sufficient given the additional regularization already provided by dropout, batch normalization, and large training datasets — but it remains a useful option to know about when sparsity is specifically desired alongside general weight shrinkage in a deep learning context.
Summary
| Technique | Mechanism | Best for |
|---|---|---|
| L1 regularization | Penalizes absolute weight values, induces sparsity | Feature selection, sparse models |
| L2 / weight decay | Penalizes squared weight values, smooth shrinkage | Standard, default choice for most deep networks |
| Early stopping | Halts training before overfitting sets in | Nearly always worth using, regardless of other techniques |
None of these techniques are mutually exclusive with dropout, covered separately in Dropout — a well-regularized modern training setup typically combines weight decay, dropout, and early stopping together, each addressing overfitting from a genuinely different angle.