Regularization

Regularization adds a penalty to the model’s objective function to discourage excessive complexity. Without it, models learn every quirk of the training data — including noise. With it, models focus on patterns that generalize to new data.

The Core Idea

Standard loss:     L(θ) = Error(predictions, targets)
Regularized loss:  L(θ) = Error(predictions, targets) + λ × Complexity(θ)

λ (lambda): regularization strength
  λ = 0: no regularization
  λ → ∞: model approaches zero/constant (maximum simplification)

L2 Regularization (Ridge)

Adds the sum of squared weights to the loss. Shrinks all weights toward zero, but never exactly to zero:

L(θ) = MSE + λ Σ wᵢ²

Effect: Large weights are heavily penalized.
        All features remain in the model.
        Best when many features contribute moderately.

from sklearn.linear_model import Ridge, RidgeCV

# Fixed alpha
ridge = Ridge(alpha=1.0)  # alpha = λ
ridge.fit(X_train, y_train)

# Cross-validated alpha selection
ridge_cv = RidgeCV(alphas=[0.001, 0.01, 0.1, 1.0, 10.0, 100.0], cv=5)
ridge_cv.fit(X_train, y_train)
print(f"Best alpha: {ridge_cv.alpha_}")

L1 Regularization (Lasso)

Adds the sum of absolute weights to the loss. Drives some weights exactly to zero — performs automatic feature selection:

L(θ) = MSE + λ Σ |wᵢ|

Effect: Sparse solutions — some features are completely excluded.
        Better when only a few features are truly important.

from sklearn.linear_model import Lasso, LassoCV

lasso = Lasso(alpha=0.1, max_iter=5000)
lasso.fit(X_train, y_train)

# Check which features were selected (non-zero coefficients)
import pandas as pd
coef = pd.Series(lasso.coef_, index=feature_names)
selected = coef[coef != 0]
print(f"Selected {len(selected)} of {len(feature_names)} features")

# Cross-validated
lasso_cv = LassoCV(cv=5, random_state=42, max_iter=5000)
lasso_cv.fit(X_train, y_train)
print(f"Best alpha: {lasso_cv.alpha_:.6f}")

ElasticNet: Best of Both

Combines L1 and L2. The l1_ratio parameter controls the mix:

from sklearn.linear_model import ElasticNet, ElasticNetCV

# l1_ratio=1 → pure Lasso, l1_ratio=0 → pure Ridge
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5, max_iter=5000)
elastic.fit(X_train, y_train)

# Cross-validated
elastic_cv = ElasticNetCV(l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 1.0], cv=5)
elastic_cv.fit(X_train, y_train)

Weight Decay in Neural Networks (L2 via Optimizer)

For neural networks, L2 regularization is applied as weight decay in the optimizer:

import torch.optim as optim

# weight_decay is the λ parameter
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)

# Different weight decay for different parameter groups
optimizer = optim.Adam([
    {'params': model.feature_layers.parameters(), 'weight_decay': 1e-4},
    {'params': model.classifier.parameters(), 'weight_decay': 1e-3}  # Stronger on final layers
], lr=1e-3)

Other Regularization Techniques

Technique	Mechanism	Best For
Dropout	Random neuron deactivation	Dense layers in neural networks
Early stopping	Stop at minimum validation loss	Neural networks
Data augmentation	Expand training variety	Images, text
Batch normalization	Normalize layer inputs	Deep networks
Max-norm constraint	Clip weight vectors	Neural networks

Choosing Regularization Strength

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

alphas = np.logspace(-4, 4, 100)  # 10^-4 to 10^4
cv_scores = []

for alpha in alphas:
    ridge = Ridge(alpha=alpha)
    scores = cross_val_score(ridge, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
    cv_scores.append(-scores.mean())  # Negate for MSE

plt.semilogx(alphas, cv_scores)
plt.xlabel('Alpha (log scale)')
plt.ylabel('CV MSE')
plt.title('Regularization Path: Ridge')
plt.axvline(alphas[np.argmin(cv_scores)], color='r', linestyle='--',
            label=f'Best alpha: {alphas[np.argmin(cv_scores)]:.4f}')
plt.legend()

L1 vs L2 Decision Guide

Use L1 (Lasso) when:
  → You believe only a few features are truly relevant
  → Interpretability matters (sparse coefficients)
  → Feature count >> sample count

Use L2 (Ridge) when:
  → Many features contribute small, equally relevant effects
  → Features are correlated (Lasso picks one arbitrarily, Ridge spreads)
  → You need stable coefficient estimates

Use ElasticNet when:
  → Both sparsity and stability are needed
  → Correlated features, but you still want some to be excluded

Regularization is not optional for most real-world ML models. Even tree-based models have implicit regularization through depth limits and minimum sample requirements. The only question is which form and how much.