K-Means Clustering: Partition-Based Unsupervised Learning

Understand K-means clustering — centroid initialization, the elbow method, silhouette analysis, limitations with non-convex clusters, and practical implementation.

K-Means Clustering

K-means is the most widely used clustering algorithm. It partitions N data points into K clusters, where each point belongs to the cluster whose centroid (mean) is closest. Simple, fast, and scalable — but with well-defined assumptions about cluster shape that matter when the assumptions break down.


The Algorithm

1. Initialize K centroids (randomly or via K-means++)
2. Assign: Each point → nearest centroid
3. Update: Each centroid → mean of its assigned points
4. Repeat steps 2–3 until centroids stop moving (convergence)
Initial state: After iteration 1: Converged:
○ × ○ × ○ ○ × ○ × ○ [○ ○ ○] [× × ×]
× ○ × ○ × → × ○ × ○ × → [× × ×] [○ ○ ○]
○ × ○ × ○ ○ × ○ × ○
C₁ C₂ C₁' C₂' C₁'' C₂''

Basic Implementation

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np
# Fit K-means
kmeans = KMeans(
n_clusters=4,
init='k-means++', # Smart initialization (much better than random)
n_init=10, # Run 10 times with different initializations, keep best
max_iter=300,
random_state=42
)
kmeans.fit(X)
labels = kmeans.labels_ # Cluster assignment per point
centers = kmeans.cluster_centers_ # K centroid coordinates
inertia = kmeans.inertia_ # Sum of squared distances to nearest centroid
# Predict cluster for new points
new_labels = kmeans.predict(X_new)

Choosing K: The Elbow Method

inertias = []
k_range = range(1, 15)
for k in k_range:
km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
km.fit(X)
inertias.append(km.inertia_)
plt.plot(k_range, inertias, marker='o')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (Within-Cluster SS)')
plt.title('Elbow Method for Optimal K')

The “elbow” is where adding more clusters gives diminishing inertia reduction — visually a bend in the curve. It’s a heuristic, not a precise rule.


Choosing K: Silhouette Score

More principled than the elbow method:

from sklearn.metrics import silhouette_score, silhouette_samples
silhouette_scores = []
for k in range(2, 15):
km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
labels = km.fit_predict(X)
score = silhouette_score(X, labels)
silhouette_scores.append(score)
print(f"K={k}: silhouette = {score:.3f}")
# Silhouette ranges from -1 (wrong cluster) to 1 (perfect cluster)
# Higher is better; K with highest score is typically optimal
best_k = range(2, 15)[np.argmax(silhouette_scores)]

K-Means++ Initialization

Random initialization can lead to poor local optima. K-means++ picks initial centroids that are spread out:

1. Pick first centroid randomly
2. For each subsequent centroid:
- Compute distance D(x) from each point x to nearest chosen centroid
- Pick next centroid with probability proportional to D(x)²
- Far-away points get more likely picked
3. This gives a better starting spread, leading to better final clusters

Always use init='k-means++' (sklearn default since 0.24).


Mini-Batch K-Means for Large Datasets

Full K-means is O(N × K × iterations). For large datasets:

from sklearn.cluster import MiniBatchKMeans
mb_kmeans = MiniBatchKMeans(
n_clusters=10,
batch_size=1024,
n_init=10,
random_state=42
)
mb_kmeans.fit(X_large) # Much faster for N > 100k

Slightly worse cluster quality than full K-means but usually acceptable.


Preprocessing Requirements

K-means is distance-based — scale matters:

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
kmeans_pipeline = Pipeline([
('scaler', StandardScaler()), # Always scale before K-means
('kmeans', KMeans(n_clusters=4, init='k-means++', random_state=42))
])
kmeans_pipeline.fit(X)
labels = kmeans_pipeline.named_steps['kmeans'].labels_

When K-Means Fails

K-means assumes spherical, equally-sized clusters. It will produce poor results for:

Non-convex shapes (moons, rings): → Use DBSCAN or Spectral Clustering
Highly unequal cluster sizes: → Use DBSCAN or Gaussian Mixture Models
Very different cluster densities: → Use DBSCAN
High-dimensional sparse data: → Apply PCA/UMAP first, then K-means
Presence of outliers: → Use DBSCAN (ignores noise points)
# For non-convex clusters: DBSCAN
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X)
# For probabilistic cluster membership: Gaussian Mixture Model
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=4, random_state=42)
labels = gmm.fit_predict(X)
probabilities = gmm.predict_proba(X)

K-means remains the first clustering algorithm to try — fast, interpretable, and often sufficient. Its limitations are well-understood and DBSCAN or GMM are clean alternatives for the cases where K-means breaks down.