Linear Regression

Linear regression is the starting point of predictive modeling — not because it’s always the best algorithm, but because it’s the one that makes every concept clear. Loss functions, gradient descent, regularization, model evaluation — all of these are easiest to understand through the lens of linear regression before applying them to neural networks and complex ensembles.

The Model

Linear regression assumes the target y is a linear function of the inputs x plus some noise:

y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε

Where:
  β₀ = intercept (bias)
  β₁...βₙ = coefficients (weights) — what we learn
  ε = irreducible noise

In matrix form: y = Xβ + ε

The goal: find β that minimizes prediction error on training data, while generalizing to new data.

Fitting: Ordinary Least Squares

The standard approach minimizes the Mean Squared Error (MSE) — the average squared difference between predictions and true values:

MSE = (1/n) Σ (yᵢ - ŷᵢ)²

There’s a closed-form solution: β = (XᵀX)⁻¹Xᵀy

Works perfectly for small-to-medium datasets. Expensive for very large datasets (matrix inversion scales as O(n³)).

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Fit
model = LinearRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.2f}")
print(f"R²: {r2:.4f}")
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")

Interpreting Coefficients

This is linear regression’s key advantage over black-box models:

Fitted model: price = -50000 + 120 × sqft + 3000 × bedrooms + 8000 × bathrooms

Interpretation:
  Each additional sqft adds $120 to price (holding other features constant)
  Each bedroom adds $3,000
  Each bathroom adds $8,000

  This assumes linear, additive effects — verify with domain knowledge

Coefficient sign: Positive = feature increases prediction; Negative = decreases it
Coefficient magnitude: Only meaningful after standardizing features (otherwise reflects scale, not importance)

Assumptions of Linear Regression

Linear regression works best when these hold:

Linearity: The relationship between features and target is linear
Independence: Observations are independent of each other
Homoscedasticity: Error variance is constant across all values of predictors
Normality of errors: Residuals follow a normal distribution (for inference, not prediction)
No multicollinearity: Features aren’t highly correlated with each other

Violations don’t always break the model, but they can mislead coefficient interpretation or inflate standard errors.

Regularized Variants

When features are correlated or you have many features, regularization prevents overfitting:

Ridge Regression (L2)

Adds penalty proportional to the sum of squared coefficients. Shrinks all coefficients toward zero, never sets them exactly to zero.

from sklearn.linear_model import Ridge

ridge = Ridge(alpha=1.0)  # Higher alpha = more regularization
ridge.fit(X_train, y_train)

Lasso Regression (L1)

Adds penalty proportional to the absolute sum of coefficients. Drives some coefficients exactly to zero — automatic feature selection.

from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
# Some coef_ will be exactly 0.0 — those features are excluded

ElasticNet

Combines L1 and L2 penalties. Best of both worlds.

from sklearn.linear_model import ElasticNet

enet = ElasticNet(alpha=0.1, l1_ratio=0.5)

Evaluating Regression Models

Metric	Formula	Interpretation
MAE	mean(	y - ŷ
RMSE	√mean((y - ŷ)²)	Penalizes large errors more; same units as y
R²	1 - SS_res/SS_tot	0 = useless, 1 = perfect; can be negative
MAPE	mean(	y - ŷ

When Linear Regression Is the Right Tool

Linear regression wins when:

The relationship is genuinely linear
Interpretability of coefficients is required (medical, financial, legal)
Training data is limited (complex models will overfit)
You need fast inference
You need prediction intervals (confidence in each prediction)

It loses when:

The true relationship is highly non-linear
Feature interactions are complex
Features are highly correlated (multicollinearity inflates variance)

Even when you plan to use a complex model, fitting a linear baseline first is valuable — it sets a performance floor and reveals how much complexity is actually worth the cost.