Regularization Techniques: L1, L2, and Early Stopping Compared

Overfitting, covered in Overfitting and Underfitting, has several distinct, complementary countermeasures beyond dropout — L1 and L2 regularization directly penalize large weights, while early stopping addresses the problem from an entirely different angle by controlling training duration itself. Understanding how each one actually works, and when to combine them, turns “add some regularization” from a vague instruction into a specific, deliberate set of choices.

L2 Regularization (Weight Decay): Smoothly Discouraging Large Weights

L2 regularization adds a penalty proportional to the sum of squared weights directly to the loss function, encouraging the optimizer to keep weights small unless there’s a genuinely strong reason (a real reduction in data loss) for them to grow.

import numpy as np

def loss_with_l2(data_loss, weights, lambda_reg=0.01):
    l2_penalty = lambda_reg * np.sum(weights ** 2)
    return data_loss + l2_penalty

import torch.optim as optim

# In PyTorch, L2 regularization is applied directly via the optimizer's weight_decay parameter
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=0.01)

L2 regularization shrinks weights smoothly toward (but rarely exactly to) zero — every weight is nudged slightly smaller on every update, proportional to its current magnitude, connecting directly to the L2 norm covered in Norms and Distance Metrics.

L1 Regularization: Encouraging Sparsity

L1 regularization adds a penalty proportional to the sum of absolute weight values, and unlike L2, it tends to push many weights all the way to exactly zero rather than just shrinking them smaller.

def loss_with_l1(data_loss, weights, lambda_reg=0.01):
    l1_penalty = lambda_reg * np.sum(np.abs(weights))
    return data_loss + l1_penalty

This sparsity-inducing property makes L1 particularly useful when you want a form of automatic feature selection — weights connected to genuinely unimportant inputs are more likely to be driven to exactly zero, effectively removing that input’s influence entirely, rather than just reducing it.

Why L2 Is More Common in Deep Learning Specifically

While both are valid, L2 (often implemented as “weight decay” directly in the optimizer) is far more commonly used in deep learning specifically, for a few practical reasons: it produces smoother, more stable optimization behavior, it doesn’t create the non-differentiable kink at zero that L1’s absolute value function has (which requires special handling for gradient-based optimization), and sparsity is less critical in deep networks that already have strong implicit regularization from other sources (batch normalization, dropout, large datasets).

	L1 Regularization	L2 Regularization (Weight Decay)
Effect on weights	Pushes many weights to exactly zero	Shrinks weights smoothly, rarely exactly zero
Produces sparsity	Yes	No
Optimization behavior	Non-smooth at zero, needs care	Smooth, well-behaved gradients
Common in deep learning	Less common	Very common (standard optimizer parameter)

Early Stopping: Regularizing Through Training Duration

Early stopping takes an entirely different approach — instead of modifying the loss function, it monitors validation performance during training and stops before the model has trained long enough to start overfitting.

best_val_loss = float('inf')
patience = 5
patience_counter = 0

for epoch in range(max_epochs):
    train_one_epoch(model, train_loader)
    val_loss = evaluate(model, val_loader)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0
        save_checkpoint(model, "best_model.pt")   # save the best version seen so far
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print(f"Stopping early at epoch {epoch} — no improvement for {patience} epochs")
            break

model.load_state_dict(torch.load("best_model.pt"))   # restore the best checkpoint, not the last one

The critical detail here: you restore the best checkpoint (lowest validation loss), not simply whatever the model looked like at the moment training stopped — the training loop typically continues a few epochs past the best point (waiting to confirm the trend via the patience counter) before actually halting.

Combining Multiple Regularization Techniques

These techniques aren’t mutually exclusive — a typical, well-regularized training setup combines several simultaneously, each addressing a slightly different aspect of overfitting.

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.BatchNorm1d(256),
    nn.ReLU(),
    nn.Dropout(p=0.3),        # dropout regularization
    nn.Linear(256, 10)
)

optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=0.01)  # L2 regularization

# Plus early stopping monitoring validation loss, as shown above

There’s no single “correct” combination — the right mix depends on how much overfitting is actually observed on your specific problem, following the diagnostic approach covered in Overfitting and Underfitting: add regularization proportional to the actual gap you observe between training and validation performance, rather than applying every technique maximally by default.

Elastic Net: Combining L1 and L2

For cases where both sparsity and smooth weight shrinkage are desirable, Elastic Net combines both penalties in a single loss term, weighted by a mixing parameter.

def loss_with_elastic_net(data_loss, weights, lambda_l1=0.01, lambda_l2=0.01):
    l1_penalty = lambda_l1 * np.sum(np.abs(weights))
    l2_penalty = lambda_l2 * np.sum(weights ** 2)
    return data_loss + l1_penalty + l2_penalty

This is more commonly seen in classical machine learning (linear and logistic regression with Elastic Net regularization) than in deep learning specifically, where L2/weight decay alone is typically sufficient given the additional regularization already provided by dropout, batch normalization, and large training datasets — but it remains a useful option to know about when sparsity is specifically desired alongside general weight shrinkage in a deep learning context.

Summary

Technique	Mechanism	Best for
L1 regularization	Penalizes absolute weight values, induces sparsity	Feature selection, sparse models
L2 / weight decay	Penalizes squared weight values, smooth shrinkage	Standard, default choice for most deep networks
Early stopping	Halts training before overfitting sets in	Nearly always worth using, regardless of other techniques

None of these techniques are mutually exclusive with dropout, covered separately in Dropout — a well-regularized modern training setup typically combines weight decay, dropout, and early stopping together, each addressing overfitting from a genuinely different angle.

Written by NPBlue Engineering Team — Practitioners who writes every guide from hands-on production experience, not paraphrased documentation.

Reviewed for technical accuracy. Spot an error? Let us know.