The Story That Explains KNN
You do not read economic theory. You do not build a complex model. You simply ask: "What are the five nearest flats paying right now?" If four of the five nearest flats pay ยฃ2,000 per month and one pays ยฃ1,200, you estimate ยฃ1,880 (average) โ and you are probably right.
Now imagine a doctor trying to decide if a new patient has diabetes. She pulls up the five most similar patients in the hospital database โ same age, same BMI, similar glucose levels. Four of those five have diabetes. Her prediction: diabetes. Majority vote. No formula needed.
That is K-Nearest Neighbors in its entirety. Find the K most similar examples you have already seen. Let them vote (classification) or average out (regression). Done.
Most algorithms learn a model during training and discard the raw data. KNN does the opposite โ it memorises the entire training set and does all the work at prediction time. There is no training phase at all. This makes fitting instantaneous but prediction slow on large datasets โ every new query must compute distances to all stored points.
How KNN Works โ Four Steps
Distance Metrics โ How "Closeness" Is Measured
The entire logic of KNN depends on a reliable measure of similarity. Different distance metrics lead to different neighbours โ and different predictions. Choosing the right metric matters.
| Metric | Best For | Sensitive to Outliers | sklearn Parameter |
|---|---|---|---|
| Euclidean | Continuous, low-dimensional, normally distributed | Yes โ squares differences | metric='euclidean' |
| Manhattan | Continuous, high-dimensional, noisy data | Less sensitive | metric='manhattan' |
| Minkowski | General โ tune p as hyperparameter | Depends on p | metric='minkowski', p=2 |
| Hamming | Categorical, binary, text features | No | metric='hamming' |
| Cosine | Text vectors โ direction more than magnitude | No | metric='cosine' |
Step-by-Step Manual Calculation
Let us classify one new patient by hand using K=3 and Euclidean distance. We have five existing patients with two features: Age and Glucose Level, and a known diabetes label.
| Patient | Age | Glucose | Diabetes? |
|---|---|---|---|
| P1 | 25 | 80 | No |
| P2 | 35 | 150 | Yes |
| P3 | 45 | 180 | Yes |
| P4 | 20 | 70 | No |
| P5 | 50 | 200 | Yes |
| NEW | 40 | 160 | ? |
Notice that Glucose ranges from 70โ200 while Age ranges from 20โ50. Glucose differences dominate the distance calculation by a factor of ~5. Age barely influences which neighbours are selected. In a real implementation you must scale both features to the same range before computing distances โ or Glucose silently overrules everything else.
Classification vs Regression
| Neighbour | Label | Distance |
|---|---|---|
| Neighbour 1 | Spam | 0.12 |
| Neighbour 2 | Spam | 0.18 |
| Neighbour 3 | Ham | 0.25 |
| Neighbour 4 | Spam | 0.31 |
| Neighbour 5 | Ham | 0.40 |
| Prediction | Spam (3 vs 2 votes) | |
| Neighbour | Price (ยฃk) | Distance |
|---|---|---|
| Neighbour 1 | 295 | 0.08 |
| Neighbour 2 | 310 | 0.15 |
| Neighbour 3 | 280 | 0.22 |
| Neighbour 4 | 325 | 0.30 |
| Neighbour 5 | 300 | 0.38 |
| Prediction | ยฃ302,000 (mean) | |
Choosing K โ The Most Critical Decision
K=N (all training points): Every query returns the majority class of the entire dataset โ completely ignoring the location of the new point. This is maximum underfitting.
The right K sits in between โ small enough to respect local structure, large enough to smooth out noise. The answer is always found with cross-validation.
| K Value | Decision Boundary | Bias | Variance | Risk |
|---|---|---|---|---|
| K = 1 | Extremely jagged, erratic | Very Low | Very High | Overfitting โ memorises noise |
| K = 3โ5 | Jagged but smoothing | Low | Moderate-High | Good for small clean datasets |
| K = 11โ21 | Smooth, well-generalised | Balanced โ | Moderate | Best general range for most data |
| K = 51+ | Very smooth, global | High | Low | Underfitting โ too much smoothing |
| K = N | Constant โ one class everywhere | Maximum | Zero | Predicts majority class always |
Feature Scaling โ Mandatory, Never Optional
KNN's core operation is measuring distance. Any feature with a larger numerical range will dominate all distance calculations, effectively making all other features invisible to the model.
| Feature | Patient A | Patient B | Diffยฒ |
|---|---|---|---|
| Age (years) | 30 | 31 | 1 |
| Income (ยฃ) | 30,000 | 70,000 | 1,600,000,000 |
| BMI | 22.5 | 35.0 | 156.25 |
| Euclidean d | 40,000.00 โ entirely Income | ||
| Feature | Patient A | Patient B | Diffยฒ |
|---|---|---|---|
| Age (z-score) | โ0.85 | โ0.71 | 0.0196 |
| Income (z-score) | โ0.92 | 1.15 | 4.2849 |
| BMI (z-score) | โ1.20 | 0.95 | 4.6225 |
| Euclidean d | 2.94 โ all features contribute | ||
The Curse of Dimensionality
In 100 dimensions, your "nearest" neighbours are almost as far away as your farthest neighbours. The concept of "near" loses all meaning. Every point becomes approximately equidistant from every other point. When all distances are similar, KNN has no signal to work with โ it is guessing randomly. This is the curse of dimensionality, and it is the primary reason KNN fails on high-dimensional data.
| Dimensions | Points Needed (1% coverage) | Nearest vs Farthest Ratio | KNN Effectiveness |
|---|---|---|---|
| 1D | 10 points | Very different | Excellent |
| 2D | 100 points | Clearly different | Very good |
| 10D | 10ยนโฐ points | Getting similar | Declining |
| 100D | 10ยนโฐโฐ points | Nearly identical | Breaks down |
| 1000D+ | Impossible | All points equidistant | Random guessing |
When features exceed ~20โ30, apply dimensionality reduction before KNN. PCA retains the directions of maximum variance and is the fastest option. UMAP or t-SNE preserve local neighbourhood structure better โ ideal for KNN preprocessing. As a rule of thumb: reduce to โ(n_features) dimensions, or use cross-validation to find the optimal number of components.
Python Implementation
KNN Classification โ Iris Dataset
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# โ ๏ธ CRITICAL: Always put StandardScaler inside Pipeline
# This prevents data leakage from the test set into scaling
pipe = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier(
n_neighbors=5, # K โ tune with cross-validation
metric='euclidean', # distance function
weights='uniform', # all neighbours vote equally
algorithm='auto', # auto-selects ball_tree/kd_tree/brute
n_jobs=-1 # use all CPU cores for search
))
])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print(f"Accuracy: {pipe.score(X_test, y_test):.4f}")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
# Predict with class probabilities (fraction of K neighbours per class)
y_prob = pipe.predict_proba(X_test)
for i in range(3):
cls = iris.target_names[y_pred[i]]
conf = y_prob[i].max() * 100
print(f" Sample {i+1}: {cls:12s} ({conf:.0f}% of neighbours agree)")
Finding Optimal K with Cross-Validation
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
k_range = range(1, 32, 2) # odd values 1, 3, 5 ... 31
k_scores = []
for k in k_range:
pipe_k = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=k, n_jobs=-1))
])
scores = cross_val_score(
pipe_k, X_train, y_train,
cv=5, scoring='accuracy'
)
k_scores.append((k, scores.mean(), scores.std()))
# Print table of K vs CV accuracy
print(f"{'K':>4} {'CV Accuracy':>12} {'Std Dev':>10}")
print("-" * 30)
for k, mean, std in k_scores:
marker = " โ BEST" if mean == max(s[1] for s in k_scores) else ""
print(f"{k:>4} {mean:.4f} ยฑ{std:.4f}{marker}")
best_k = max(k_scores, key=lambda x: x[1])[0]
print(f"\nOptimal K = {best_k}")
KNN Regression โ House Price Prediction
from sklearn.neighbors import KNeighborsRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_absolute_error
housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Uniform KNN regression โ simple average of K neighbours
knn_reg = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsRegressor(
n_neighbors=10,
weights='distance', # closer neighbours contribute more
metric='euclidean',
n_jobs=-1
))
])
knn_reg.fit(X_train, y_train)
y_pred = knn_reg.predict(X_test)
print(f"Rยฒ Score: {r2_score(y_test, y_pred):.4f}")
print(f"MAE: ${mean_absolute_error(y_test, y_pred)*100_000:.0f}")
Implementing KNN from Scratch
import numpy as np
from collections import Counter
def euclidean_distance(x1, x2):
"""Straight-line distance between two points."""
return np.sqrt(np.sum((x1 - x2) ** 2))
class KNNClassifier:
def __init__(self, k=5):
self.k = k
def fit(self, X, y):
"""No training โ just memorise the dataset."""
self.X_train = X
self.y_train = y
return self
def predict(self, X):
"""Predict class for each point in X."""
return np.array([self._predict_one(x) for x in X])
def _predict_one(self, x):
# Step 1: compute distance to every training point
distances = [euclidean_distance(x, x_tr)
for x_tr in self.X_train]
# Step 2: find indices of K smallest distances
k_idx = np.argsort(distances)[:self.k]
# Step 3: get labels of K nearest neighbours
k_labels = self.y_train[k_idx]
# Step 4: return majority vote
return Counter(k_labels).most_common(1)[0][0]
# โโ Test against sklearn โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
iris = load_iris()
X, y = iris.data, iris.target
sc = StandardScaler()
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)
knn_scratch = KNNClassifier(k=5)
knn_scratch.fit(X_train_sc, y_train)
y_pred = knn_scratch.predict(X_test_sc)
acc = np.mean(y_pred == y_test)
print(f"Scratch KNN accuracy: {acc:.4f}") # matches sklearn
Weighted KNN โ Closer Neighbours Vote Louder
Standard KNN gives every neighbour an equal vote regardless of distance. A neighbour at distance 0.01 has the same influence as one at distance 50. Weighted KNN fixes this by weighting each neighbour's vote inversely proportional to its distance โ closer means more influential.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# Full hyperparameter search โ K, weights, metric, p
pipe = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_jobs=-1))
])
param_grid = {
'knn__n_neighbors': [3, 5, 7, 9, 11, 15, 21],
'knn__weights': ['uniform', 'distance'],
'knn__metric': ['euclidean', 'manhattan'],
}
grid = GridSearchCV(
pipe, param_grid,
cv=5, scoring='accuracy',
n_jobs=-1, verbose=0
)
grid.fit(X_train, y_train)
print("Best parameters: ", grid.best_params_)
print(f"Best CV accuracy: {grid.best_score_:.4f}")
print(f"Test accuracy: {grid.score(X_test, y_test):.4f}")
Search Algorithms โ How sklearn Finds Neighbours Fast
The default algorithm='auto' selects the best algorithm
based on dataset size, dimensionality, and metric.
It chooses KD-Tree for small d, Ball-Tree for larger d,
and Brute Force for custom metrics.
Only override this if you have a specific reason โ benchmarking
or a known constraint.
Full Pipeline โ Breast Cancer Diagnosis
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target # 30 features, binary label
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 30 features โ apply PCA first to fight the curse of dimensionality
pipe = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=10, random_state=42)),
('knn', KNeighborsClassifier(
n_neighbors=9,
weights='distance',
metric='euclidean',
n_jobs=-1
))
])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
# Cross-validation score
cv_scores = cross_val_score(
pipe, X_train, y_train, cv=5, scoring='f1'
)
print(f"5-Fold CV F1: {cv_scores.mean():.4f} ยฑ {cv_scores.std():.4f}")
print(f"Test Accuracy: {pipe.score(X_test, y_test):.4f}")
print(classification_report(
y_test, y_pred,
target_names=cancer.target_names
))
# Confusion matrix
ConfusionMatrixDisplay.from_predictions(
y_test, y_pred,
display_labels=cancer.target_names
)
plt.tight_layout()
plt.show()
When to Use KNN
KNN vs Other Classifiers
| Property | KNN | Random Forest | SVM (RBF) | Logistic Reg. |
|---|---|---|---|---|
| Training time | Instant (none) | Moderate | Slow | Fast |
| Prediction time | Slow โ O(n ร d) | Moderate | Moderate | Fastest |
| Memory usage | Stores all data | Stores trees | Stores support vectors | Just coefficients |
| Feature scaling needed | Mandatory | Never | Mandatory | Recommended |
| Handles non-linearity | Naturally | Excellent | Via kernels | Cannot |
| High-dimensional data | Fails โ curse of d | Moderate | Excellent | Excellent |
| Interpretability | High โ show neighbours | Feature importance | Linear only | High โ coefficients |
| Online learning | Natural โ just add data | Full retrain | Full retrain | Possible (SGD) |
Strengths vs Weaknesses
| Zero training time โ fit is instantaneous |
| Trivially simple to understand and explain |
| Naturally handles multi-class problems |
| No assumptions about data distribution |
| Non-parametric โ learns any decision boundary shape |
| Online learning โ just append new points, no retraining |
| Predictions are explainable โ show the actual neighbours |
| Works for both classification and regression unchanged |
| Slow prediction โ O(n ร d) per query |
| Stores entire training set โ high memory usage |
| Feature scaling is mandatory โ forgetting it is fatal |
| Fails in high dimensions โ curse of dimensionality |
| Sensitive to irrelevant features โ all features vote |
| Sensitive to imbalanced datasets โ majority class dominates |
| No feature importance โ treats all features equally |
| K selection requires cross-validation โ not automatic |
Real-World Applications
- User-based collaborative filtering
- "Users like you also watched..."
- Item similarity matching
- Early Netflix and Amazon engines
- Find similar patient records
- Anomaly detection in vitals
- Drug response prediction
- Explainable โ show real cases
- MNIST digit classification (95%+)
- Reverse image search
- Face verification systems
- With PCA preprocessing
Golden Rules
StandardScaler()
inside a Pipeline to prevent leakage. This is the single most
impactful step in any KNN workflow.
weights='distance' by default.
Distance-weighted voting almost always beats uniform voting at negligible
computational cost. Closer neighbours are genuinely more relevant โ
let them vote louder. Start with distance weights
and switch to uniform only if cross-validation says so.
PCA first.
Add it inside the Pipeline: scaler โ PCA โ KNN.
Tune n_components with cross-validation alongside K.
SelectKBest to remove dead weight first.
class_weight workarounds (use a custom voting
scheme) or preferably oversample the minority class with
SMOTE before fitting. Evaluate with F1 or AUC, not accuracy.