Regularization Techniques: Controlling Model Complexity in Machine Learning

Learn regularization in ML — L1 Lasso, L2 Ridge, ElasticNet, early stopping, weight decay, when to use each technique, and how regularization prevents overfitting.

Regularization

Regularization adds a penalty to the model’s objective function to discourage excessive complexity. Without it, models learn every quirk of the training data — including noise. With it, models focus on patterns that generalize to new data.


The Core Idea

Standard loss: L(θ) = Error(predictions, targets)
Regularized loss: L(θ) = Error(predictions, targets) + λ × Complexity(θ)
λ (lambda): regularization strength
λ = 0: no regularization
λ → ∞: model approaches zero/constant (maximum simplification)

L2 Regularization (Ridge)

Adds the sum of squared weights to the loss. Shrinks all weights toward zero, but never exactly to zero:

L(θ) = MSE + λ Σ wᵢ²
Effect: Large weights are heavily penalized.
All features remain in the model.
Best when many features contribute moderately.
from sklearn.linear_model import Ridge, RidgeCV
# Fixed alpha
ridge = Ridge(alpha=1.0) # alpha = λ
ridge.fit(X_train, y_train)
# Cross-validated alpha selection
ridge_cv = RidgeCV(alphas=[0.001, 0.01, 0.1, 1.0, 10.0, 100.0], cv=5)
ridge_cv.fit(X_train, y_train)
print(f"Best alpha: {ridge_cv.alpha_}")

L1 Regularization (Lasso)

Adds the sum of absolute weights to the loss. Drives some weights exactly to zero — performs automatic feature selection:

L(θ) = MSE + λ Σ |wᵢ|
Effect: Sparse solutions — some features are completely excluded.
Better when only a few features are truly important.
from sklearn.linear_model import Lasso, LassoCV
lasso = Lasso(alpha=0.1, max_iter=5000)
lasso.fit(X_train, y_train)
# Check which features were selected (non-zero coefficients)
import pandas as pd
coef = pd.Series(lasso.coef_, index=feature_names)
selected = coef[coef != 0]
print(f"Selected {len(selected)} of {len(feature_names)} features")
# Cross-validated
lasso_cv = LassoCV(cv=5, random_state=42, max_iter=5000)
lasso_cv.fit(X_train, y_train)
print(f"Best alpha: {lasso_cv.alpha_:.6f}")

ElasticNet: Best of Both

Combines L1 and L2. The l1_ratio parameter controls the mix:

from sklearn.linear_model import ElasticNet, ElasticNetCV
# l1_ratio=1 → pure Lasso, l1_ratio=0 → pure Ridge
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5, max_iter=5000)
elastic.fit(X_train, y_train)
# Cross-validated
elastic_cv = ElasticNetCV(l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 1.0], cv=5)
elastic_cv.fit(X_train, y_train)

Weight Decay in Neural Networks (L2 via Optimizer)

For neural networks, L2 regularization is applied as weight decay in the optimizer:

import torch.optim as optim
# weight_decay is the λ parameter
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
# Different weight decay for different parameter groups
optimizer = optim.Adam([
{'params': model.feature_layers.parameters(), 'weight_decay': 1e-4},
{'params': model.classifier.parameters(), 'weight_decay': 1e-3} # Stronger on final layers
], lr=1e-3)

Other Regularization Techniques

TechniqueMechanismBest For
DropoutRandom neuron deactivationDense layers in neural networks
Early stoppingStop at minimum validation lossNeural networks
Data augmentationExpand training varietyImages, text
Batch normalizationNormalize layer inputsDeep networks
Max-norm constraintClip weight vectorsNeural networks

Choosing Regularization Strength

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
alphas = np.logspace(-4, 4, 100) # 10^-4 to 10^4
cv_scores = []
for alpha in alphas:
ridge = Ridge(alpha=alpha)
scores = cross_val_score(ridge, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
cv_scores.append(-scores.mean()) # Negate for MSE
plt.semilogx(alphas, cv_scores)
plt.xlabel('Alpha (log scale)')
plt.ylabel('CV MSE')
plt.title('Regularization Path: Ridge')
plt.axvline(alphas[np.argmin(cv_scores)], color='r', linestyle='--',
label=f'Best alpha: {alphas[np.argmin(cv_scores)]:.4f}')
plt.legend()

L1 vs L2 Decision Guide

Use L1 (Lasso) when:
→ You believe only a few features are truly relevant
→ Interpretability matters (sparse coefficients)
→ Feature count >> sample count
Use L2 (Ridge) when:
→ Many features contribute small, equally relevant effects
→ Features are correlated (Lasso picks one arbitrarily, Ridge spreads)
→ You need stable coefficient estimates
Use ElasticNet when:
→ Both sparsity and stability are needed
→ Correlated features, but you still want some to be excluded

Regularization is not optional for most real-world ML models. Even tree-based models have implicit regularization through depth limits and minimum sample requirements. The only question is which form and how much.