Dimensionality Reduction
Real-world datasets often have hundreds or thousands of features. A gene expression dataset might have 20,000 features per sample. An image is a flat array of thousands of pixel values. A bag-of-words text representation might have 50,000 dimensions.
Working in very high dimensions is problematic — the curse of dimensionality makes distances meaningless, models overfit, and training slows. Dimensionality reduction is the solution: compress data into fewer dimensions while preserving what matters.
The Curse of Dimensionality
As dimensions grow, data points become increasingly sparse. In 1D, 10 evenly spaced points cover the unit interval well. In 10D, you’d need 10¹⁰ points to have the same density. In 100D, it’s astronomically worse.
Practical consequences:
- Distance metrics (cosine, Euclidean) lose discriminative power in high dimensions
- Models need exponentially more data to generalize
- Overfitting risk increases with feature count
- Computation scales poorly
PCA: Principal Component Analysis
The most widely used linear dimensionality reduction technique. PCA finds the directions of maximum variance in the data and projects onto those directions.
High-D data → Find principal components (directions of max variance) → Project data onto top-k components → Lower-D representation that preserves as much variance as possiblefrom sklearn.decomposition import PCAfrom sklearn.preprocessing import StandardScalerimport matplotlib.pyplot as plt
# Scale first (PCA is variance-based; scale matters)scaler = StandardScaler()X_scaled = scaler.fit_transform(X)
# Find how many components explain 95% variancepca = PCA()pca.fit(X_scaled)cumvar = pca.explained_variance_ratio_.cumsum()n_components_95 = (cumvar < 0.95).sum() + 1print(f"Need {n_components_95} components for 95% variance (from {X.shape[1]})")
# Apply reductionpca_95 = PCA(n_components=n_components_95)X_reduced = pca_95.fit_transform(X_scaled)When to use PCA: Pre-processing before supervised learning, noise reduction, speed improvement. Not great for non-linear structure.
t-SNE: Visualizing Clusters
t-SNE (t-distributed Stochastic Neighbor Embedding) is designed specifically for visualization. It preserves local structure — nearby points in high dimensions stay nearby in 2D.
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42, n_iter=1000)X_2d = tsne.fit_transform(X_scaled) # Produces a 2D representation
import seaborn as snssns.scatterplot(x=X_2d[:,0], y=X_2d[:,1], hue=labels, palette='tab10', s=15)plt.title("t-SNE Visualization")plt.show()t-SNE caveats:
- Distances between clusters in the 2D plot are not meaningful (only within-cluster structure is)
- Non-deterministic (use
random_statefor reproducibility) - Slow for large datasets (>10K points) — use UMAP instead
perplexityhyperparameter significantly affects output
UMAP: Faster and Better
UMAP (Uniform Manifold Approximation and Projection) is the modern alternative to t-SNE. Faster, preserves more global structure, and works as a general-purpose dimensionality reducer (not just for visualization).
import umap
# General dimensionality reductionreducer = umap.UMAP(n_components=50, random_state=42) # compress to 50-DX_umap = reducer.fit_transform(X_scaled)
# 2D visualizationreducer_2d = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1)X_2d_umap = reducer_2d.fit_transform(X_scaled)UMAP vs t-SNE:
| t-SNE | UMAP | |
|---|---|---|
| Speed | Slow (O(N log N)) | Fast (O(N)) |
| Global structure | Poor | Good |
| High → medium dimensions | Not ideal | Works well |
| Deterministic | No | Can be |
| General reducer | No | Yes |
Autoencoders: Non-Linear Reduction
For complex data (images, text), linear methods like PCA miss non-linear structure. Autoencoders learn a compressed representation through a neural network.
import torchimport torch.nn as nn
class Autoencoder(nn.Module): def __init__(self, input_dim, latent_dim): super().__init__() self.encoder = nn.Sequential( nn.Linear(input_dim, 256), nn.ReLU(), nn.Linear(256, latent_dim) # Bottleneck ) self.decoder = nn.Sequential( nn.Linear(latent_dim, 256), nn.ReLU(), nn.Linear(256, input_dim) )
def forward(self, x): z = self.encoder(x) # Compressed representation return self.decoder(z), z # Reconstructed x, latent code
# Train to minimize reconstruction lossautoencoder = Autoencoder(input_dim=784, latent_dim=32)criterion = nn.MSELoss()Choosing the Right Method
Purpose: Visualization (2D/3D) → UMAP or t-SNEPurpose: Pre-processing for supervised ML → PCA or UMAPPurpose: Noise reduction, speed → PCAPurpose: Non-linear structure in images/audio → AutoencoderPurpose: Very large datasets → UMAP (t-SNE too slow)Data: Linear relationships dominate → PCAData: Complex manifold structure → UMAP / AutoencoderFeature Selection vs. Dimensionality Reduction
These are related but different:
Feature selection: Choose a subset of original features. Interpretable — you know which original features you kept.
Dimensionality reduction: Create new features as combinations of originals. Better compression but less interpretable.
Use feature selection when interpretability matters. Use dimensionality reduction (PCA/UMAP) when maximizing variance capture or when features are correlated.
A common pipeline: apply PCA or UMAP → feed into a downstream model (SVM, neural net, k-NN). This is especially effective when the original feature space is very high-dimensional or noisy.