K-Nearest Neighbors
K-Nearest Neighbors (KNN) is the simplest machine learning algorithm that actually works. There’s no training phase — the algorithm stores the entire dataset and at prediction time, finds the K most similar training examples and takes a vote. It’s lazy learning at its purest.
How It Works
Training: Just store all (X, y) pairs — nothing else.
Prediction for a new point q: 1. Compute distance from q to every training point 2. Select the K closest points (the "neighbors") 3. Classification: majority vote among K neighbors Regression: average of K neighbors' target valuesWith K=3, the predicted class is whichever label appears most often among the 3 nearest training points.
Implementing KNN
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressorfrom sklearn.preprocessing import StandardScalerfrom sklearn.pipeline import Pipeline
# CRITICAL: Scale features — KNN uses distances, scale matters enormouslyknn_pipeline = Pipeline([ ('scaler', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)) # p=2 = Euclidean])
knn_pipeline.fit(X_train, y_train)y_pred = knn_pipeline.predict(X_test)y_prob = knn_pipeline.predict_proba(X_test)Choosing K
K is the most critical hyperparameter:
K=1: Most flexible. Fits every training point exactly. Training accuracy = 100%. High variance. Overfits noise.
K=N: Predicts the same class for everything (majority class). Extreme underfitting. High bias.
Best K: Usually between 3 and 20. Found via cross-validation.from sklearn.model_selection import cross_val_scoreimport matplotlib.pyplot as plt
k_range = range(1, 31)cv_scores = []
for k in k_range: knn = Pipeline([ ('scaler', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors=k)) ]) scores = cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy') cv_scores.append(scores.mean())
plt.plot(k_range, cv_scores)plt.xlabel('K')plt.ylabel('CV Accuracy')plt.title('KNN: Cross-Validation Accuracy vs K')Distance Metrics
# Euclidean distance (default, p=2 in Minkowski)knn_euclidean = KNeighborsClassifier(metric='euclidean')
# Manhattan distance (p=1 in Minkowski) — less sensitive to outliersknn_manhattan = KNeighborsClassifier(metric='manhattan')
# Minkowski (generalization): p=1 → Manhattan, p=2 → Euclideanknn_minkowski = KNeighborsClassifier(metric='minkowski', p=3)
# Cosine similarity — great for text/sparse featuresknn_cosine = KNeighborsClassifier(metric='cosine')
# Hamming distance — for categorical featuresknn_hamming = KNeighborsClassifier(metric='hamming')Curse of Dimensionality
KNN degrades significantly in high dimensions. As dimensions increase, all points become roughly equidistant from each other — the concept of “nearest neighbor” breaks down.
In 2D: nearest neighbors are truly closeIn 100D: nearest neighbors may be nearly as far as random pointsIn 10,000D: distance differences become noise — KNN failsSolutions:
- Dimensionality reduction (PCA, UMAP) before KNN
- Feature selection to remove irrelevant features
- Switch to a parametric model for high-dimensional data
Efficient Nearest Neighbor Search
Brute-force search over N training points is O(N × D) per query — slow for large datasets. Sklearn provides two faster options:
# KD-Tree: great for D < ~20 dimensionsknn_kdtree = KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree')
# Ball Tree: works better for high D or non-Euclidean metricsknn_balltree = KNeighborsClassifier(n_neighbors=5, algorithm='ball_tree', metric='haversine')
# Auto: sklearn picks the best algorithmknn_auto = KNeighborsClassifier(n_neighbors=5, algorithm='auto')For very large datasets (>1M samples), use approximate nearest neighbor libraries like Faiss, Annoy, or HNSW.
Weighted KNN
Default KNN gives equal weight to all K neighbors. Weighted KNN gives closer neighbors more influence:
# Weight by inverse distance — closer = more influenceknn_weighted = KNeighborsClassifier(n_neighbors=10, weights='distance')
# Custom weight functiondef custom_weights(distances): return 1 / (distances ** 2 + 1e-5)
knn_custom = KNeighborsClassifier(n_neighbors=10, weights=custom_weights)When to Use KNN
KNN shines when:
- Dataset is small (< ~50k samples)
- Decision boundaries are complex and irregular
- Features are well-scaled and few in number
- You need a non-parametric model with no assumptions about data distribution
- You need a simple baseline that’s hard to beat on local-structure data
Avoid KNN when:
- Dataset is large (prediction time scales with N)
- Features are high-dimensional (>20–30 features without reduction)
- Features have very different scales and you can’t normalize them meaningfully
- Real-time prediction with strict latency requirements