K-Means Clustering
K-means is the most widely used clustering algorithm. It partitions N data points into K clusters, where each point belongs to the cluster whose centroid (mean) is closest. Simple, fast, and scalable — but with well-defined assumptions about cluster shape that matter when the assumptions break down.
The Algorithm
1. Initialize K centroids (randomly or via K-means++)2. Assign: Each point → nearest centroid3. Update: Each centroid → mean of its assigned points4. Repeat steps 2–3 until centroids stop moving (convergence)Initial state: After iteration 1: Converged:○ × ○ × ○ ○ × ○ × ○ [○ ○ ○] [× × ×]× ○ × ○ × → × ○ × ○ × → [× × ×] [○ ○ ○]○ × ○ × ○ ○ × ○ × ○C₁ C₂ C₁' C₂' C₁'' C₂''Basic Implementation
from sklearn.cluster import KMeansimport matplotlib.pyplot as pltimport numpy as np
# Fit K-meanskmeans = KMeans( n_clusters=4, init='k-means++', # Smart initialization (much better than random) n_init=10, # Run 10 times with different initializations, keep best max_iter=300, random_state=42)
kmeans.fit(X)
labels = kmeans.labels_ # Cluster assignment per pointcenters = kmeans.cluster_centers_ # K centroid coordinatesinertia = kmeans.inertia_ # Sum of squared distances to nearest centroid
# Predict cluster for new pointsnew_labels = kmeans.predict(X_new)Choosing K: The Elbow Method
inertias = []k_range = range(1, 15)
for k in k_range: km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42) km.fit(X) inertias.append(km.inertia_)
plt.plot(k_range, inertias, marker='o')plt.xlabel('Number of Clusters (K)')plt.ylabel('Inertia (Within-Cluster SS)')plt.title('Elbow Method for Optimal K')The “elbow” is where adding more clusters gives diminishing inertia reduction — visually a bend in the curve. It’s a heuristic, not a precise rule.
Choosing K: Silhouette Score
More principled than the elbow method:
from sklearn.metrics import silhouette_score, silhouette_samples
silhouette_scores = []
for k in range(2, 15): km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42) labels = km.fit_predict(X) score = silhouette_score(X, labels) silhouette_scores.append(score) print(f"K={k}: silhouette = {score:.3f}")
# Silhouette ranges from -1 (wrong cluster) to 1 (perfect cluster)# Higher is better; K with highest score is typically optimalbest_k = range(2, 15)[np.argmax(silhouette_scores)]K-Means++ Initialization
Random initialization can lead to poor local optima. K-means++ picks initial centroids that are spread out:
1. Pick first centroid randomly2. For each subsequent centroid: - Compute distance D(x) from each point x to nearest chosen centroid - Pick next centroid with probability proportional to D(x)² - Far-away points get more likely picked3. This gives a better starting spread, leading to better final clustersAlways use init='k-means++' (sklearn default since 0.24).
Mini-Batch K-Means for Large Datasets
Full K-means is O(N × K × iterations). For large datasets:
from sklearn.cluster import MiniBatchKMeans
mb_kmeans = MiniBatchKMeans( n_clusters=10, batch_size=1024, n_init=10, random_state=42)
mb_kmeans.fit(X_large) # Much faster for N > 100kSlightly worse cluster quality than full K-means but usually acceptable.
Preprocessing Requirements
K-means is distance-based — scale matters:
from sklearn.preprocessing import StandardScalerfrom sklearn.pipeline import Pipeline
kmeans_pipeline = Pipeline([ ('scaler', StandardScaler()), # Always scale before K-means ('kmeans', KMeans(n_clusters=4, init='k-means++', random_state=42))])
kmeans_pipeline.fit(X)labels = kmeans_pipeline.named_steps['kmeans'].labels_When K-Means Fails
K-means assumes spherical, equally-sized clusters. It will produce poor results for:
Non-convex shapes (moons, rings): → Use DBSCAN or Spectral ClusteringHighly unequal cluster sizes: → Use DBSCAN or Gaussian Mixture ModelsVery different cluster densities: → Use DBSCANHigh-dimensional sparse data: → Apply PCA/UMAP first, then K-meansPresence of outliers: → Use DBSCAN (ignores noise points)# For non-convex clusters: DBSCANfrom sklearn.cluster import DBSCANdbscan = DBSCAN(eps=0.5, min_samples=5)labels = dbscan.fit_predict(X)
# For probabilistic cluster membership: Gaussian Mixture Modelfrom sklearn.mixture import GaussianMixturegmm = GaussianMixture(n_components=4, random_state=42)labels = gmm.fit_predict(X)probabilities = gmm.predict_proba(X)K-means remains the first clustering algorithm to try — fast, interpretable, and often sufficient. Its limitations are well-understood and DBSCAN or GMM are clean alternatives for the cases where K-means breaks down.