K-Means Clustering

K-means is the most widely used clustering algorithm. It partitions N data points into K clusters, where each point belongs to the cluster whose centroid (mean) is closest. Simple, fast, and scalable — but with well-defined assumptions about cluster shape that matter when the assumptions break down.

The Algorithm

1. Initialize K centroids (randomly or via K-means++)
2. Assign: Each point → nearest centroid
3. Update: Each centroid → mean of its assigned points
4. Repeat steps 2–3 until centroids stop moving (convergence)

Initial state:        After iteration 1:    Converged:
○ × ○ × ○            ○ × ○ × ○             [○ ○ ○] [× × ×]
× ○ × ○ ×     →      × ○ × ○ ×     →      [× × ×] [○ ○ ○]
○ × ○ × ○            ○ × ○ × ○
C₁   C₂              C₁'  C₂'              C₁''  C₂''

Basic Implementation

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np

# Fit K-means
kmeans = KMeans(
    n_clusters=4,
    init='k-means++',   # Smart initialization (much better than random)
    n_init=10,          # Run 10 times with different initializations, keep best
    max_iter=300,
    random_state=42
)

kmeans.fit(X)

labels = kmeans.labels_           # Cluster assignment per point
centers = kmeans.cluster_centers_ # K centroid coordinates
inertia = kmeans.inertia_         # Sum of squared distances to nearest centroid

# Predict cluster for new points
new_labels = kmeans.predict(X_new)

Choosing K: The Elbow Method

inertias = []
k_range = range(1, 15)

for k in k_range:
    km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
    km.fit(X)
    inertias.append(km.inertia_)

plt.plot(k_range, inertias, marker='o')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (Within-Cluster SS)')
plt.title('Elbow Method for Optimal K')

The “elbow” is where adding more clusters gives diminishing inertia reduction — visually a bend in the curve. It’s a heuristic, not a precise rule.

Choosing K: Silhouette Score

More principled than the elbow method:

from sklearn.metrics import silhouette_score, silhouette_samples

silhouette_scores = []

for k in range(2, 15):
    km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
    labels = km.fit_predict(X)
    score = silhouette_score(X, labels)
    silhouette_scores.append(score)
    print(f"K={k}: silhouette = {score:.3f}")

# Silhouette ranges from -1 (wrong cluster) to 1 (perfect cluster)
# Higher is better; K with highest score is typically optimal
best_k = range(2, 15)[np.argmax(silhouette_scores)]

K-Means++ Initialization

Random initialization can lead to poor local optima. K-means++ picks initial centroids that are spread out:

1. Pick first centroid randomly
2. For each subsequent centroid:
   - Compute distance D(x) from each point x to nearest chosen centroid
   - Pick next centroid with probability proportional to D(x)²
   - Far-away points get more likely picked
3. This gives a better starting spread, leading to better final clusters

Always use init='k-means++' (sklearn default since 0.24).

Mini-Batch K-Means for Large Datasets

Full K-means is O(N × K × iterations). For large datasets:

from sklearn.cluster import MiniBatchKMeans

mb_kmeans = MiniBatchKMeans(
    n_clusters=10,
    batch_size=1024,
    n_init=10,
    random_state=42
)

mb_kmeans.fit(X_large)  # Much faster for N > 100k

Slightly worse cluster quality than full K-means but usually acceptable.

Preprocessing Requirements

K-means is distance-based — scale matters:

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

kmeans_pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Always scale before K-means
    ('kmeans', KMeans(n_clusters=4, init='k-means++', random_state=42))
])

kmeans_pipeline.fit(X)
labels = kmeans_pipeline.named_steps['kmeans'].labels_

When K-Means Fails

K-means assumes spherical, equally-sized clusters. It will produce poor results for:

Non-convex shapes (moons, rings): → Use DBSCAN or Spectral Clustering
Highly unequal cluster sizes:     → Use DBSCAN or Gaussian Mixture Models
Very different cluster densities:  → Use DBSCAN
High-dimensional sparse data:     → Apply PCA/UMAP first, then K-means
Presence of outliers:              → Use DBSCAN (ignores noise points)

# For non-convex clusters: DBSCAN
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X)

# For probabilistic cluster membership: Gaussian Mixture Model
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=4, random_state=42)
labels = gmm.fit_predict(X)
probabilities = gmm.predict_proba(X)

K-means remains the first clustering algorithm to try — fast, interpretable, and often sufficient. Its limitations are well-understood and DBSCAN or GMM are clean alternatives for the cases where K-means breaks down.