Linear Regression
Linear regression is the starting point of predictive modeling — not because it’s always the best algorithm, but because it’s the one that makes every concept clear. Loss functions, gradient descent, regularization, model evaluation — all of these are easiest to understand through the lens of linear regression before applying them to neural networks and complex ensembles.
The Model
Linear regression assumes the target y is a linear function of the inputs x plus some noise:
y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε
Where: β₀ = intercept (bias) β₁...βₙ = coefficients (weights) — what we learn ε = irreducible noiseIn matrix form: y = Xβ + ε
The goal: find β that minimizes prediction error on training data, while generalizing to new data.
Fitting: Ordinary Least Squares
The standard approach minimizes the Mean Squared Error (MSE) — the average squared difference between predictions and true values:
MSE = (1/n) Σ (yᵢ - ŷᵢ)²There’s a closed-form solution: β = (XᵀX)⁻¹Xᵀy
Works perfectly for small-to-medium datasets. Expensive for very large datasets (matrix inversion scales as O(n³)).
from sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_squared_error, r2_scoreimport numpy as np
# Fitmodel = LinearRegression()model.fit(X_train, y_train)
# Predicty_pred = model.predict(X_test)
# Evaluatermse = np.sqrt(mean_squared_error(y_test, y_pred))r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.2f}")print(f"R²: {r2:.4f}")print(f"Coefficients: {model.coef_}")print(f"Intercept: {model.intercept_}")Interpreting Coefficients
This is linear regression’s key advantage over black-box models:
Fitted model: price = -50000 + 120 × sqft + 3000 × bedrooms + 8000 × bathrooms
Interpretation: Each additional sqft adds $120 to price (holding other features constant) Each bedroom adds $3,000 Each bathroom adds $8,000
This assumes linear, additive effects — verify with domain knowledgeCoefficient sign: Positive = feature increases prediction; Negative = decreases it
Coefficient magnitude: Only meaningful after standardizing features (otherwise reflects scale, not importance)
Assumptions of Linear Regression
Linear regression works best when these hold:
- Linearity: The relationship between features and target is linear
- Independence: Observations are independent of each other
- Homoscedasticity: Error variance is constant across all values of predictors
- Normality of errors: Residuals follow a normal distribution (for inference, not prediction)
- No multicollinearity: Features aren’t highly correlated with each other
Violations don’t always break the model, but they can mislead coefficient interpretation or inflate standard errors.
Regularized Variants
When features are correlated or you have many features, regularization prevents overfitting:
Ridge Regression (L2)
Adds penalty proportional to the sum of squared coefficients. Shrinks all coefficients toward zero, never sets them exactly to zero.
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=1.0) # Higher alpha = more regularizationridge.fit(X_train, y_train)Lasso Regression (L1)
Adds penalty proportional to the absolute sum of coefficients. Drives some coefficients exactly to zero — automatic feature selection.
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.1)lasso.fit(X_train, y_train)# Some coef_ will be exactly 0.0 — those features are excludedElasticNet
Combines L1 and L2 penalties. Best of both worlds.
from sklearn.linear_model import ElasticNet
enet = ElasticNet(alpha=0.1, l1_ratio=0.5)Evaluating Regression Models
| Metric | Formula | Interpretation |
|---|---|---|
| MAE | mean( | y - ŷ |
| RMSE | √mean((y - ŷ)²) | Penalizes large errors more; same units as y |
| R² | 1 - SS_res/SS_tot | 0 = useless, 1 = perfect; can be negative |
| MAPE | mean( | y - ŷ |
When Linear Regression Is the Right Tool
Linear regression wins when:
- The relationship is genuinely linear
- Interpretability of coefficients is required (medical, financial, legal)
- Training data is limited (complex models will overfit)
- You need fast inference
- You need prediction intervals (confidence in each prediction)
It loses when:
- The true relationship is highly non-linear
- Feature interactions are complex
- Features are highly correlated (multicollinearity inflates variance)
Even when you plan to use a complex model, fitting a linear baseline first is valuable — it sets a performance floor and reveals how much complexity is actually worth the cost.