Unsupervised Learning
What if you have a mountain of data but no labels? No one has gone through and tagged each record. No ground truth. No right answers. That’s where unsupervised learning operates — and it turns out, some of the most powerful insights in data science come from finding structure without anyone telling the model what to look for.
The Key Difference from Supervised Learning
Supervised: Input → Model → Predicted label (compared to true label)Unsupervised: Input → Model → Structure / Pattern / Compressed representationThere’s no teacher, no loss based on correct answers, no “right” output. The model explores the data and returns something about its structure: which points are similar, what a compressed representation looks like, what’s unusually different.
Three Core Tasks
1. Clustering
Group similar data points together. Points within a cluster are more similar to each other than to points in other clusters.
Use cases: Customer segmentation, document grouping, genomics, image compression.
Key algorithms:
- K-Means: Assign points to K centroids, minimize within-cluster variance
- DBSCAN: Density-based, finds arbitrary shapes, handles noise
- Hierarchical: Builds a tree of clusters (dendrogram), good for exploring at multiple scales
- Gaussian Mixture Models: Soft cluster assignment with probabilities
2. Dimensionality Reduction
Represent high-dimensional data in fewer dimensions, preserving as much structure as possible.
Use cases: Visualization, noise reduction, pre-processing before supervised learning.
Key algorithms:
- PCA (Principal Component Analysis): Linear, variance-maximizing projection
- t-SNE: Nonlinear, preserves local structure, excellent for visualization
- UMAP: Faster than t-SNE, better preserves global structure, works for large datasets
- Autoencoders: Neural network-based compression and reconstruction
3. Anomaly Detection
Identify data points that don’t fit the normal pattern.
Use cases: Fraud detection, network intrusion, equipment failure prediction, quality control.
Key approaches:
- Isolation Forest: Anomalies are easier to isolate in random trees
- One-Class SVM: Learn the boundary of normal data
- Autoencoder reconstruction error: Normal data reconstructs well; anomalies don’t
K-Means in Practice
from sklearn.cluster import KMeansfrom sklearn.preprocessing import StandardScalerimport pandas as pdimport matplotlib.pyplot as plt
# Customer RFM data (Recency, Frequency, Monetary)df = pd.read_csv("customers.csv")X = df[["recency_days", "frequency", "total_spend"]]
# Scale features (K-means is distance-based, so scale matters)scaler = StandardScaler()X_scaled = scaler.fit_transform(X)
# Find optimal K using elbow methodinertias = []for k in range(2, 11): km = KMeans(n_clusters=k, random_state=42, n_init=10) km.fit(X_scaled) inertias.append(km.inertia_)
# Fit with chosen Kkmeans = KMeans(n_clusters=4, random_state=42, n_init=10)df["segment"] = kmeans.fit_predict(X_scaled)
print(df.groupby("segment")[["recency_days", "frequency", "total_spend"]].mean())PCA: Dimensionality Reduction
from sklearn.decomposition import PCAfrom sklearn.datasets import load_digitsimport matplotlib.pyplot as plt
# 64-dimensional digit images → 2D for visualizationdigits = load_digits()pca = PCA(n_components=2)X_2d = pca.fit_transform(digits.data)
print(f"Variance explained: {pca.explained_variance_ratio_.sum():.1%}")
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=digits.target, cmap='tab10', s=5)plt.colorbar()plt.title("Digits dataset compressed to 2D via PCA")plt.show()The Evaluation Problem
Unsupervised learning has no ground truth to evaluate against. How do you know if your clusters are good?
Internal metrics (no labels needed):
- Silhouette score: Measures how similar a point is to its own cluster vs. others. Range -1 to 1; higher is better.
- Inertia (K-Means): Within-cluster sum of squares. Use the elbow method.
- Davies-Bouldin index: Lower means more separated, compact clusters.
External metrics (when you have labels for validation):
- Adjusted Rand Index: Measures agreement with true labels
- Normalized Mutual Information: Measures information overlap
The honest approach: Evaluate downstream. Good clustering should produce segments with different behavior, different churn rates, different conversion rates. If your customer segments don’t lead to different marketing outcomes, the clustering isn’t useful regardless of its Silhouette score.
2025–2026: Self-Supervised Learning
The boundary between unsupervised and supervised learning has blurred significantly. Self-supervised learning — where models create their own supervision signals from data structure — now dominates pre-training for LLMs and vision models.
Contrastive learning methods (SimCLR, CLIP, DINO) learn powerful representations by comparing similar and dissimilar examples, without human-provided labels. These representations transfer remarkably well to downstream tasks.
In practice: for tabular business data, classic unsupervised methods (K-Means, PCA, Isolation Forest) remain the standard. For unstructured data (text, images), self-supervised pre-training followed by fine-tuning is the dominant paradigm.