Feature Scaling
Feature scaling brings all numerical features to a similar magnitude. Without it, algorithms that rely on distances or gradient magnitudes are dominated by features with large values — a feature measured in thousands will overwhelm one measured in fractions, even if they’re equally informative.
Why Scaling Matters
Feature 1: income (range: 20,000 – 200,000)Feature 2: age (range: 18 – 90)Feature 3: credit_score (range: 300 – 850)
Without scaling: KNN: distances dominated by income (biggest scale) Gradient descent: learning rate must be tiny to avoid income exploding PCA: first component captures mostly income variance
With scaling: All features contribute proportionally Gradient descent converges fasterThe Three Main Scalers
StandardScaler (Z-score normalization)
x_scaled = (x - mean) / std
Result: mean = 0, std = 1Range: unbounded (typically -3 to 3 for normal distributions)from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()X_train_scaled = scaler.fit_transform(X_train) # Fit AND transformX_test_scaled = scaler.transform(X_test) # Transform ONLY (no fitting)
# Inspect what was learnedprint(f"Feature means: {scaler.mean_[:5]}")print(f"Feature stds: {scaler.scale_[:5]}")Best for: SVM, KNN, PCA, linear models, neural networks
MinMaxScaler
x_scaled = (x - x_min) / (x_max - x_min)
Result: values in [0, 1] (or [feature_range] if specified)from sklearn.preprocessing import MinMaxScaler
minmax = MinMaxScaler(feature_range=(0, 1))X_train_scaled = minmax.fit_transform(X_train)
# Custom range (e.g., for tanh activation: [-1, 1])minmax_sym = MinMaxScaler(feature_range=(-1, 1))Best for: Neural networks with sigmoid/tanh, image pixel values, features with known bounds
Weakness: Sensitive to outliers — a single extreme value compresses all other values
RobustScaler
x_scaled = (x - median) / IQR
Result: median = 0, IQR spans about [-0.5, 0.5]from sklearn.preprocessing import RobustScaler
robust = RobustScaler(quantile_range=(25.0, 75.0)) # IQR by defaultX_train_scaled = robust.fit_transform(X_train)Best for: Data with significant outliers (financial data, sensor readings with spikes)
Power Transformations (for Skewed Features)
Some features have heavily skewed distributions. Power transforms reduce skewness before scaling:
from sklearn.preprocessing import PowerTransformer
# Yeo-Johnson: works with positive and negative valuesyj = PowerTransformer(method='yeo-johnson')
# Box-Cox: only works with strictly positive valuesbc = PowerTransformer(method='box-cox')
X_train_transformed = yj.fit_transform(X_train)
# Visualization: before and afterimport matplotlib.pyplot as pltfig, axes = plt.subplots(1, 2, figsize=(12, 4))axes[0].hist(X_train[:, 0], bins=50); axes[0].set_title('Before')axes[1].hist(X_train_transformed[:, 0], bins=50); axes[1].set_title('After Yeo-Johnson')When Scaling Is Required vs. Optional
| Algorithm | Requires Scaling? | Why |
|---|---|---|
| KNN | Required | Distance-based |
| SVM | Required | Margin computation |
| Linear/Logistic Regression | Recommended | For coefficient comparison |
| PCA | Required | Variance-based |
| Neural Networks | Required | Gradient magnitudes |
| Random Forest | Not needed | Split-based (scale-invariant) |
| Gradient Boosting | Not needed | Split-based |
| Decision Trees | Not needed | Split-based |
| Naive Bayes | Not needed | Probabilistic |
| K-Means | Required | Distance-based |
Preventing Leakage in Scaling
# WRONG: Scaler sees test data during fittingscaler = StandardScaler()X_all_scaled = scaler.fit_transform(X) # leaks test statisticsX_train_scaled, X_test_scaled = train_test_split(X_all_scaled, ...)
# CORRECT: Scaler fitted only on train dataX_train, X_test = train_test_split(X, ...)scaler = StandardScaler()X_train_scaled = scaler.fit_transform(X_train)X_test_scaled = scaler.transform(X_test)
# BEST: Use Pipeline for automatic leak prevention in cross-validationfrom sklearn.pipeline import Pipelinepipeline = Pipeline([('scaler', StandardScaler()), ('model', SVC())])cross_val_score(pipeline, X, y, cv=5) # Scaler fitted on each fold's train setInverse Transforming Predictions
For regression tasks, if you scaled the target variable:
# Scale targety_scaler = StandardScaler()y_train_scaled = y_scaler.fit_transform(y_train.reshape(-1, 1)).ravel()
model.fit(X_train_scaled, y_train_scaled)
# Predictions come back in scaled space — inverse transformy_pred_scaled = model.predict(X_test_scaled)y_pred = y_scaler.inverse_transform(y_pred_scaled.reshape(-1, 1)).ravel()