Unsupervised Learning

What if you have a mountain of data but no labels? No one has gone through and tagged each record. No ground truth. No right answers. That’s where unsupervised learning operates — and it turns out, some of the most powerful insights in data science come from finding structure without anyone telling the model what to look for.

The Key Difference from Supervised Learning

Supervised:    Input → Model → Predicted label (compared to true label)
Unsupervised:  Input → Model → Structure / Pattern / Compressed representation

There’s no teacher, no loss based on correct answers, no “right” output. The model explores the data and returns something about its structure: which points are similar, what a compressed representation looks like, what’s unusually different.

Three Core Tasks

1. Clustering

Group similar data points together. Points within a cluster are more similar to each other than to points in other clusters.

Use cases: Customer segmentation, document grouping, genomics, image compression.

Key algorithms:

K-Means: Assign points to K centroids, minimize within-cluster variance
DBSCAN: Density-based, finds arbitrary shapes, handles noise
Hierarchical: Builds a tree of clusters (dendrogram), good for exploring at multiple scales
Gaussian Mixture Models: Soft cluster assignment with probabilities

2. Dimensionality Reduction

Represent high-dimensional data in fewer dimensions, preserving as much structure as possible.

Use cases: Visualization, noise reduction, pre-processing before supervised learning.

Key algorithms:

PCA (Principal Component Analysis): Linear, variance-maximizing projection
t-SNE: Nonlinear, preserves local structure, excellent for visualization
UMAP: Faster than t-SNE, better preserves global structure, works for large datasets
Autoencoders: Neural network-based compression and reconstruction

3. Anomaly Detection

Identify data points that don’t fit the normal pattern.

Use cases: Fraud detection, network intrusion, equipment failure prediction, quality control.

Key approaches:

Isolation Forest: Anomalies are easier to isolate in random trees
One-Class SVM: Learn the boundary of normal data
Autoencoder reconstruction error: Normal data reconstructs well; anomalies don’t

K-Means in Practice

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import pandas as pd
import matplotlib.pyplot as plt

# Customer RFM data (Recency, Frequency, Monetary)
df = pd.read_csv("customers.csv")
X = df[["recency_days", "frequency", "total_spend"]]

# Scale features (K-means is distance-based, so scale matters)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Find optimal K using elbow method
inertias = []
for k in range(2, 11):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X_scaled)
    inertias.append(km.inertia_)

# Fit with chosen K
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
df["segment"] = kmeans.fit_predict(X_scaled)

print(df.groupby("segment")[["recency_days", "frequency", "total_spend"]].mean())

PCA: Dimensionality Reduction

from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt

# 64-dimensional digit images → 2D for visualization
digits = load_digits()
pca = PCA(n_components=2)
X_2d = pca.fit_transform(digits.data)

print(f"Variance explained: {pca.explained_variance_ratio_.sum():.1%}")

plt.scatter(X_2d[:, 0], X_2d[:, 1], c=digits.target, cmap='tab10', s=5)
plt.colorbar()
plt.title("Digits dataset compressed to 2D via PCA")
plt.show()

The Evaluation Problem

Unsupervised learning has no ground truth to evaluate against. How do you know if your clusters are good?

Internal metrics (no labels needed):

Silhouette score: Measures how similar a point is to its own cluster vs. others. Range -1 to 1; higher is better.
Inertia (K-Means): Within-cluster sum of squares. Use the elbow method.
Davies-Bouldin index: Lower means more separated, compact clusters.

External metrics (when you have labels for validation):

Adjusted Rand Index: Measures agreement with true labels
Normalized Mutual Information: Measures information overlap

The honest approach: Evaluate downstream. Good clustering should produce segments with different behavior, different churn rates, different conversion rates. If your customer segments don’t lead to different marketing outcomes, the clustering isn’t useful regardless of its Silhouette score.

2025–2026: Self-Supervised Learning

The boundary between unsupervised and supervised learning has blurred significantly. Self-supervised learning — where models create their own supervision signals from data structure — now dominates pre-training for LLMs and vision models.

Contrastive learning methods (SimCLR, CLIP, DINO) learn powerful representations by comparing similar and dissimilar examples, without human-provided labels. These representations transfer remarkably well to downstream tasks.

In practice: for tabular business data, classic unsupervised methods (K-Means, PCA, Isolation Forest) remain the standard. For unstructured data (text, images), self-supervised pre-training followed by fine-tuning is the dominant paradigm.