K-Nearest Neighbors

K-Nearest Neighbors (KNN) is the simplest machine learning algorithm that actually works. There’s no training phase — the algorithm stores the entire dataset and at prediction time, finds the K most similar training examples and takes a vote. It’s lazy learning at its purest.

How It Works

Training: Just store all (X, y) pairs — nothing else.

Prediction for a new point q:
  1. Compute distance from q to every training point
  2. Select the K closest points (the "neighbors")
  3. Classification: majority vote among K neighbors
     Regression: average of K neighbors' target values

With K=3, the predicted class is whichever label appears most often among the 3 nearest training points.

Implementing KNN

from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# CRITICAL: Scale features — KNN uses distances, scale matters enormously
knn_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2))  # p=2 = Euclidean
])

knn_pipeline.fit(X_train, y_train)
y_pred = knn_pipeline.predict(X_test)
y_prob = knn_pipeline.predict_proba(X_test)

Choosing K

K is the most critical hyperparameter:

K=1: Most flexible. Fits every training point exactly.
     Training accuracy = 100%. High variance. Overfits noise.

K=N: Predicts the same class for everything (majority class).
     Extreme underfitting. High bias.

Best K: Usually between 3 and 20. Found via cross-validation.

from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt

k_range = range(1, 31)
cv_scores = []

for k in k_range:
    knn = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier(n_neighbors=k))
    ])
    scores = cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy')
    cv_scores.append(scores.mean())

plt.plot(k_range, cv_scores)
plt.xlabel('K')
plt.ylabel('CV Accuracy')
plt.title('KNN: Cross-Validation Accuracy vs K')

Distance Metrics

# Euclidean distance (default, p=2 in Minkowski)
knn_euclidean = KNeighborsClassifier(metric='euclidean')

# Manhattan distance (p=1 in Minkowski) — less sensitive to outliers
knn_manhattan = KNeighborsClassifier(metric='manhattan')

# Minkowski (generalization): p=1 → Manhattan, p=2 → Euclidean
knn_minkowski = KNeighborsClassifier(metric='minkowski', p=3)

# Cosine similarity — great for text/sparse features
knn_cosine = KNeighborsClassifier(metric='cosine')

# Hamming distance — for categorical features
knn_hamming = KNeighborsClassifier(metric='hamming')

Curse of Dimensionality

KNN degrades significantly in high dimensions. As dimensions increase, all points become roughly equidistant from each other — the concept of “nearest neighbor” breaks down.

In 2D: nearest neighbors are truly close
In 100D: nearest neighbors may be nearly as far as random points
In 10,000D: distance differences become noise — KNN fails

Solutions:

Dimensionality reduction (PCA, UMAP) before KNN
Feature selection to remove irrelevant features
Switch to a parametric model for high-dimensional data

Efficient Nearest Neighbor Search

Brute-force search over N training points is O(N × D) per query — slow for large datasets. Sklearn provides two faster options:

# KD-Tree: great for D < ~20 dimensions
knn_kdtree = KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree')

# Ball Tree: works better for high D or non-Euclidean metrics
knn_balltree = KNeighborsClassifier(n_neighbors=5, algorithm='ball_tree', metric='haversine')

# Auto: sklearn picks the best algorithm
knn_auto = KNeighborsClassifier(n_neighbors=5, algorithm='auto')

For very large datasets (>1M samples), use approximate nearest neighbor libraries like Faiss, Annoy, or HNSW.

Weighted KNN

Default KNN gives equal weight to all K neighbors. Weighted KNN gives closer neighbors more influence:

# Weight by inverse distance — closer = more influence
knn_weighted = KNeighborsClassifier(n_neighbors=10, weights='distance')

# Custom weight function
def custom_weights(distances):
    return 1 / (distances ** 2 + 1e-5)

knn_custom = KNeighborsClassifier(n_neighbors=10, weights=custom_weights)

When to Use KNN

KNN shines when:

Dataset is small (< ~50k samples)
Decision boundaries are complex and irregular
Features are well-scaled and few in number
You need a non-parametric model with no assumptions about data distribution
You need a simple baseline that’s hard to beat on local-structure data

Avoid KNN when:

Dataset is large (prediction time scales with N)
Features are high-dimensional (>20–30 features without reduction)
Features have very different scales and you can’t normalize them meaningfully
Real-time prediction with strict latency requirements