Dimensionality Reduction

Real-world datasets often have hundreds or thousands of features. A gene expression dataset might have 20,000 features per sample. An image is a flat array of thousands of pixel values. A bag-of-words text representation might have 50,000 dimensions.

Working in very high dimensions is problematic — the curse of dimensionality makes distances meaningless, models overfit, and training slows. Dimensionality reduction is the solution: compress data into fewer dimensions while preserving what matters.

The Curse of Dimensionality

As dimensions grow, data points become increasingly sparse. In 1D, 10 evenly spaced points cover the unit interval well. In 10D, you’d need 10¹⁰ points to have the same density. In 100D, it’s astronomically worse.

Practical consequences:

Distance metrics (cosine, Euclidean) lose discriminative power in high dimensions
Models need exponentially more data to generalize
Overfitting risk increases with feature count
Computation scales poorly

PCA: Principal Component Analysis

The most widely used linear dimensionality reduction technique. PCA finds the directions of maximum variance in the data and projects onto those directions.

High-D data → Find principal components (directions of max variance)
            → Project data onto top-k components
            → Lower-D representation that preserves as much variance as possible

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Scale first (PCA is variance-based; scale matters)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Find how many components explain 95% variance
pca = PCA()
pca.fit(X_scaled)
cumvar = pca.explained_variance_ratio_.cumsum()
n_components_95 = (cumvar < 0.95).sum() + 1
print(f"Need {n_components_95} components for 95% variance (from {X.shape[1]})")

# Apply reduction
pca_95 = PCA(n_components=n_components_95)
X_reduced = pca_95.fit_transform(X_scaled)

When to use PCA: Pre-processing before supervised learning, noise reduction, speed improvement. Not great for non-linear structure.

t-SNE: Visualizing Clusters

t-SNE (t-distributed Stochastic Neighbor Embedding) is designed specifically for visualization. It preserves local structure — nearby points in high dimensions stay nearby in 2D.

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30, random_state=42, n_iter=1000)
X_2d = tsne.fit_transform(X_scaled)  # Produces a 2D representation

import seaborn as sns
sns.scatterplot(x=X_2d[:,0], y=X_2d[:,1], hue=labels, palette='tab10', s=15)
plt.title("t-SNE Visualization")
plt.show()

t-SNE caveats:

Distances between clusters in the 2D plot are not meaningful (only within-cluster structure is)
Non-deterministic (use random_state for reproducibility)
Slow for large datasets (>10K points) — use UMAP instead
perplexity hyperparameter significantly affects output

UMAP: Faster and Better

UMAP (Uniform Manifold Approximation and Projection) is the modern alternative to t-SNE. Faster, preserves more global structure, and works as a general-purpose dimensionality reducer (not just for visualization).

import umap

# General dimensionality reduction
reducer = umap.UMAP(n_components=50, random_state=42)  # compress to 50-D
X_umap = reducer.fit_transform(X_scaled)

# 2D visualization
reducer_2d = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1)
X_2d_umap = reducer_2d.fit_transform(X_scaled)

UMAP vs t-SNE:

	t-SNE	UMAP
Speed	Slow (O(N log N))	Fast (O(N))
Global structure	Poor	Good
High → medium dimensions	Not ideal	Works well
Deterministic	No	Can be
General reducer	No	Yes

Autoencoders: Non-Linear Reduction

For complex data (images, text), linear methods like PCA miss non-linear structure. Autoencoders learn a compressed representation through a neural network.

import torch
import torch.nn as nn

class Autoencoder(nn.Module):
    def __init__(self, input_dim, latent_dim):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 256),
            nn.ReLU(),
            nn.Linear(256, latent_dim)  # Bottleneck
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 256),
            nn.ReLU(),
            nn.Linear(256, input_dim)
        )

    def forward(self, x):
        z = self.encoder(x)          # Compressed representation
        return self.decoder(z), z    # Reconstructed x, latent code

# Train to minimize reconstruction loss
autoencoder = Autoencoder(input_dim=784, latent_dim=32)
criterion = nn.MSELoss()

Choosing the Right Method

Purpose: Visualization (2D/3D) → UMAP or t-SNE
Purpose: Pre-processing for supervised ML → PCA or UMAP
Purpose: Noise reduction, speed → PCA
Purpose: Non-linear structure in images/audio → Autoencoder
Purpose: Very large datasets → UMAP (t-SNE too slow)
Data: Linear relationships dominate → PCA
Data: Complex manifold structure → UMAP / Autoencoder

Feature Selection vs. Dimensionality Reduction

These are related but different:

Feature selection: Choose a subset of original features. Interpretable — you know which original features you kept.

Dimensionality reduction: Create new features as combinations of originals. Better compression but less interpretable.

Use feature selection when interpretability matters. Use dimensionality reduction (PCA/UMAP) when maximizing variance capture or when features are correlated.

A common pipeline: apply PCA or UMAP → feed into a downstream model (SVM, neural net, k-NN). This is especially effective when the original feature space is very high-dimensional or noisy.