Regularization
Regularization adds a penalty to the model’s objective function to discourage excessive complexity. Without it, models learn every quirk of the training data — including noise. With it, models focus on patterns that generalize to new data.
The Core Idea
Standard loss: L(θ) = Error(predictions, targets)Regularized loss: L(θ) = Error(predictions, targets) + λ × Complexity(θ)
λ (lambda): regularization strength λ = 0: no regularization λ → ∞: model approaches zero/constant (maximum simplification)L2 Regularization (Ridge)
Adds the sum of squared weights to the loss. Shrinks all weights toward zero, but never exactly to zero:
L(θ) = MSE + λ Σ wᵢ²
Effect: Large weights are heavily penalized. All features remain in the model. Best when many features contribute moderately.from sklearn.linear_model import Ridge, RidgeCV
# Fixed alpharidge = Ridge(alpha=1.0) # alpha = λridge.fit(X_train, y_train)
# Cross-validated alpha selectionridge_cv = RidgeCV(alphas=[0.001, 0.01, 0.1, 1.0, 10.0, 100.0], cv=5)ridge_cv.fit(X_train, y_train)print(f"Best alpha: {ridge_cv.alpha_}")L1 Regularization (Lasso)
Adds the sum of absolute weights to the loss. Drives some weights exactly to zero — performs automatic feature selection:
L(θ) = MSE + λ Σ |wᵢ|
Effect: Sparse solutions — some features are completely excluded. Better when only a few features are truly important.from sklearn.linear_model import Lasso, LassoCV
lasso = Lasso(alpha=0.1, max_iter=5000)lasso.fit(X_train, y_train)
# Check which features were selected (non-zero coefficients)import pandas as pdcoef = pd.Series(lasso.coef_, index=feature_names)selected = coef[coef != 0]print(f"Selected {len(selected)} of {len(feature_names)} features")
# Cross-validatedlasso_cv = LassoCV(cv=5, random_state=42, max_iter=5000)lasso_cv.fit(X_train, y_train)print(f"Best alpha: {lasso_cv.alpha_:.6f}")ElasticNet: Best of Both
Combines L1 and L2. The l1_ratio parameter controls the mix:
from sklearn.linear_model import ElasticNet, ElasticNetCV
# l1_ratio=1 → pure Lasso, l1_ratio=0 → pure Ridgeelastic = ElasticNet(alpha=0.1, l1_ratio=0.5, max_iter=5000)elastic.fit(X_train, y_train)
# Cross-validatedelastic_cv = ElasticNetCV(l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 1.0], cv=5)elastic_cv.fit(X_train, y_train)Weight Decay in Neural Networks (L2 via Optimizer)
For neural networks, L2 regularization is applied as weight decay in the optimizer:
import torch.optim as optim
# weight_decay is the λ parameteroptimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
# Different weight decay for different parameter groupsoptimizer = optim.Adam([ {'params': model.feature_layers.parameters(), 'weight_decay': 1e-4}, {'params': model.classifier.parameters(), 'weight_decay': 1e-3} # Stronger on final layers], lr=1e-3)Other Regularization Techniques
| Technique | Mechanism | Best For |
|---|---|---|
| Dropout | Random neuron deactivation | Dense layers in neural networks |
| Early stopping | Stop at minimum validation loss | Neural networks |
| Data augmentation | Expand training variety | Images, text |
| Batch normalization | Normalize layer inputs | Deep networks |
| Max-norm constraint | Clip weight vectors | Neural networks |
Choosing Regularization Strength
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.linear_model import Ridgefrom sklearn.model_selection import cross_val_score
alphas = np.logspace(-4, 4, 100) # 10^-4 to 10^4cv_scores = []
for alpha in alphas: ridge = Ridge(alpha=alpha) scores = cross_val_score(ridge, X_train, y_train, cv=5, scoring='neg_mean_squared_error') cv_scores.append(-scores.mean()) # Negate for MSE
plt.semilogx(alphas, cv_scores)plt.xlabel('Alpha (log scale)')plt.ylabel('CV MSE')plt.title('Regularization Path: Ridge')plt.axvline(alphas[np.argmin(cv_scores)], color='r', linestyle='--', label=f'Best alpha: {alphas[np.argmin(cv_scores)]:.4f}')plt.legend()L1 vs L2 Decision Guide
Use L1 (Lasso) when: → You believe only a few features are truly relevant → Interpretability matters (sparse coefficients) → Feature count >> sample count
Use L2 (Ridge) when: → Many features contribute small, equally relevant effects → Features are correlated (Lasso picks one arbitrarily, Ridge spreads) → You need stable coefficient estimates
Use ElasticNet when: → Both sparsity and stability are needed → Correlated features, but you still want some to be excludedRegularization is not optional for most real-world ML models. Even tree-based models have implicit regularization through depth limits and minimum sample requirements. The only question is which form and how much.