K-Nearest Neighbors: Instance-Based Classification and Regression

Understand K-Nearest Neighbors — distance metrics, K selection, curse of dimensionality, KD-trees, and when lazy learning outperforms parametric models.

K-Nearest Neighbors

K-Nearest Neighbors (KNN) is the simplest machine learning algorithm that actually works. There’s no training phase — the algorithm stores the entire dataset and at prediction time, finds the K most similar training examples and takes a vote. It’s lazy learning at its purest.


How It Works

Training: Just store all (X, y) pairs — nothing else.
Prediction for a new point q:
1. Compute distance from q to every training point
2. Select the K closest points (the "neighbors")
3. Classification: majority vote among K neighbors
Regression: average of K neighbors' target values

With K=3, the predicted class is whichever label appears most often among the 3 nearest training points.


Implementing KNN

from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# CRITICAL: Scale features — KNN uses distances, scale matters enormously
knn_pipeline = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)) # p=2 = Euclidean
])
knn_pipeline.fit(X_train, y_train)
y_pred = knn_pipeline.predict(X_test)
y_prob = knn_pipeline.predict_proba(X_test)

Choosing K

K is the most critical hyperparameter:

K=1: Most flexible. Fits every training point exactly.
Training accuracy = 100%. High variance. Overfits noise.
K=N: Predicts the same class for everything (majority class).
Extreme underfitting. High bias.
Best K: Usually between 3 and 20. Found via cross-validation.
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
k_range = range(1, 31)
cv_scores = []
for k in k_range:
knn = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=k))
])
scores = cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy')
cv_scores.append(scores.mean())
plt.plot(k_range, cv_scores)
plt.xlabel('K')
plt.ylabel('CV Accuracy')
plt.title('KNN: Cross-Validation Accuracy vs K')

Distance Metrics

# Euclidean distance (default, p=2 in Minkowski)
knn_euclidean = KNeighborsClassifier(metric='euclidean')
# Manhattan distance (p=1 in Minkowski) — less sensitive to outliers
knn_manhattan = KNeighborsClassifier(metric='manhattan')
# Minkowski (generalization): p=1 → Manhattan, p=2 → Euclidean
knn_minkowski = KNeighborsClassifier(metric='minkowski', p=3)
# Cosine similarity — great for text/sparse features
knn_cosine = KNeighborsClassifier(metric='cosine')
# Hamming distance — for categorical features
knn_hamming = KNeighborsClassifier(metric='hamming')

Curse of Dimensionality

KNN degrades significantly in high dimensions. As dimensions increase, all points become roughly equidistant from each other — the concept of “nearest neighbor” breaks down.

In 2D: nearest neighbors are truly close
In 100D: nearest neighbors may be nearly as far as random points
In 10,000D: distance differences become noise — KNN fails

Solutions:

  1. Dimensionality reduction (PCA, UMAP) before KNN
  2. Feature selection to remove irrelevant features
  3. Switch to a parametric model for high-dimensional data

Brute-force search over N training points is O(N × D) per query — slow for large datasets. Sklearn provides two faster options:

# KD-Tree: great for D < ~20 dimensions
knn_kdtree = KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree')
# Ball Tree: works better for high D or non-Euclidean metrics
knn_balltree = KNeighborsClassifier(n_neighbors=5, algorithm='ball_tree', metric='haversine')
# Auto: sklearn picks the best algorithm
knn_auto = KNeighborsClassifier(n_neighbors=5, algorithm='auto')

For very large datasets (>1M samples), use approximate nearest neighbor libraries like Faiss, Annoy, or HNSW.


Weighted KNN

Default KNN gives equal weight to all K neighbors. Weighted KNN gives closer neighbors more influence:

# Weight by inverse distance — closer = more influence
knn_weighted = KNeighborsClassifier(n_neighbors=10, weights='distance')
# Custom weight function
def custom_weights(distances):
return 1 / (distances ** 2 + 1e-5)
knn_custom = KNeighborsClassifier(n_neighbors=10, weights=custom_weights)

When to Use KNN

KNN shines when:

  • Dataset is small (< ~50k samples)
  • Decision boundaries are complex and irregular
  • Features are well-scaled and few in number
  • You need a non-parametric model with no assumptions about data distribution
  • You need a simple baseline that’s hard to beat on local-structure data

Avoid KNN when:

  • Dataset is large (prediction time scales with N)
  • Features are high-dimensional (>20–30 features without reduction)
  • Features have very different scales and you can’t normalize them meaningfully
  • Real-time prediction with strict latency requirements