K-Nearest Neighbors (KNN) Explained: Distance,

Section 01

The Story That Explains KNN

📖 Real World Analogy

You Are Who Your Neighbours Are

You move to a new city and find a flat at an unfamiliar address. You have no idea whether the neighbourhood is expensive or affordable. How do you estimate the rent for your flat?

You do not read economic theory. You do not build a complex model. You simply ask: "What are the five nearest flats paying right now?" If four of the five nearest flats pay £2,000 per month and one pays £1,200, you estimate £1,880 (average) — and you are probably right.

Now imagine a doctor trying to decide if a new patient has diabetes. She pulls up the five most similar patients in the hospital database — same age, same BMI, similar glucose levels. Four of those five have diabetes. Her prediction: diabetes. Majority vote. No formula needed.

That is K-Nearest Neighbors in its entirety. Find the K most similar examples you have already seen. Let them vote (classification) or average out (regression). Done.

🦥

KNN Is a "Lazy Learner" — No Training Phase

Most algorithms learn a model during training and discard the raw data. KNN does the opposite — it memorises the entire training set and does all the work at prediction time. There is no training phase at all. This makes fitting instantaneous but prediction slow on large datasets — every new query must compute distances to all stored points.

Section 02

How KNN Works — Four Steps

Choose K and a Distance Metric

Decide how many neighbours to consult (K) and how to measure "closeness" — Euclidean, Manhattan, or another distance function. These are the only two decisions made before any data is seen.

Compute Distance to Every Training Point

For the new unseen data point x_new, calculate its distance to every single point in the training set. With 10,000 training points, that is 10,000 distance calculations — every prediction, every time.

Select the K Nearest Points

Sort all distances and take the K smallest. These are the K nearest neighbours — the training examples most similar to the new point. All others are ignored completely.

Aggregate the K Labels

Classification → majority class wins (plurality vote). Regression → mean (or weighted mean) of the K labels. Output the result. No model update, no retraining — the dataset is the model.

Section 03

Distance Metrics — How "Closeness" Is Measured

The entire logic of KNN depends on a reliable measure of similarity. Different distance metrics lead to different neighbours — and different predictions. Choosing the right metric matters.

Euclidean Distance (p=2)

d = √Σ(xᵢ − yᵢ)²

Straight-line "as the crow flies" distance. The default and most common choice. Sensitive to outliers and scale. Best when features are continuous and normally distributed.

Manhattan Distance (p=1)

d = Σ|xᵢ − yᵢ|

Sum of absolute differences along each axis — like navigating a city grid. Less sensitive to outliers than Euclidean. Better when features have very different scales or when outliers are present.

Minkowski Distance (General)

d = (Σ|xᵢ − yᵢ|ᵖ)^(1/p)

The generalised form. p=1 → Manhattan. p=2 → Euclidean. p=∞ → Chebyshev (max single-dimension difference). Tune p as a hyperparameter.

Hamming Distance

d = (1/n) × #{i : xᵢ ≠ yᵢ}

Fraction of positions where two vectors differ. Used for categorical or binary features — text, DNA sequences, one-hot encoded data. Not suitable for continuous features.

Metric	Best For	Sensitive to Outliers	sklearn Parameter
Euclidean	Continuous, low-dimensional, normally distributed	Yes — squares differences	`metric='euclidean'`
Manhattan	Continuous, high-dimensional, noisy data	Less sensitive	`metric='manhattan'`
Minkowski	General — tune p as hyperparameter	Depends on p	`metric='minkowski', p=2`
Hamming	Categorical, binary, text features	No	`metric='hamming'`
Cosine	Text vectors — direction more than magnitude	No	`metric='cosine'`

Section 04

Step-by-Step Manual Calculation

Let us classify one new patient by hand using K=3 and Euclidean distance. We have five existing patients with two features: Age and Glucose Level, and a known diabetes label.

Patient	Age	Glucose	Diabetes?
P1	25	80	No
P2	35	150	Yes
P3	45	180	Yes
P4	20	70	No
P5	50	200	Yes
NEW	40	160	?

🧮 Step 1 — Compute Euclidean Distance to Each Training Point

d(NEW, P1)

√((40−25)² + (160−80)²) = √(225 + 6400) = √6625 = 81.40

d(NEW, P2)

√((40−35)² + (160−150)²) = √(25 + 100) = √125 = 11.18

d(NEW, P3)

√((40−45)² + (160−180)²) = √(25 + 400) = √425 = 20.62

d(NEW, P4)

√((40−20)² + (160−70)²) = √(400 + 8100) = √8500 = 92.20

d(NEW, P5)

√((40−50)² + (160−200)²) = √(100 + 1600) = √1700 = 41.23

🧮 Step 2 — Sort Distances and Select K=3 Nearest

1st

P2 — distance 11.18 — Label: Yes

2nd

P3 — distance 20.62 — Label: Yes

3rd

P5 — distance 41.23 — Label: Yes

Ignored

P1 (81.40) and P4 (92.20) — too far, excluded from vote

🗳️ Step 3 — Majority Vote Among K=3 Neighbours

Tally

Yes: 3 votes (P2, P3, P5) | No: 0 votes

Result

Unanimous verdict → Predicted: Diabetes = YES ✅

⚠️

Why This Example Would Fail Without Scaling

Notice that Glucose ranges from 70–200 while Age ranges from 20–50. Glucose differences dominate the distance calculation by a factor of ~5. Age barely influences which neighbours are selected. In a real implementation you must scale both features to the same range before computing distances — or Glucose silently overrules everything else.

Section 05

Classification vs Regression

🗳️ KNN Classification (Majority Vote)

Neighbour	Label	Distance
Neighbour 1	Spam	0.12
Neighbour 2	Spam	0.18
Neighbour 3	Ham	0.25
Neighbour 4	Spam	0.31
Neighbour 5	Ham	0.40
Prediction	Spam (3 vs 2 votes)

📊 KNN Regression (Average Value)

Neighbour	Price (£k)	Distance
Neighbour 1	295	0.08
Neighbour 2	310	0.15
Neighbour 3	280	0.22
Neighbour 4	325	0.30
Neighbour 5	300	0.38
Prediction	£302,000 (mean)

Section 06

Choosing K — The Most Critical Decision

📖 The Goldilocks Problem

K=1 Memorises. K=N Ignores Everything.

K=1: Ask only the single nearest neighbour. If that one training point is an outlier or mislabelled, you get the wrong answer. The model memorises every quirk of the training data — perfect training accuracy, catastrophic test accuracy. This is maximum overfitting.

K=N (all training points): Every query returns the majority class of the entire dataset — completely ignoring the location of the new point. This is maximum underfitting.

The right K sits in between — small enough to respect local structure, large enough to smooth out noise. The answer is always found with cross-validation.

K Value	Decision Boundary	Bias	Variance	Risk
K = 1	Extremely jagged, erratic	Very Low	Very High	Overfitting — memorises noise
K = 3–5	Jagged but smoothing	Low	Moderate-High	Good for small clean datasets
K = 11–21	Smooth, well-generalised	Balanced ✅	Moderate	Best general range for most data
K = 51+	Very smooth, global	High	Low	Underfitting — too much smoothing
K = N	Constant — one class everywhere	Maximum	Zero	Predicts majority class always

🔢 Practical Rules for Choosing K

①

Always use cross-validation — try all odd values of K from 1 to √n (where n = training set size). Plot CV accuracy vs K and choose the elbow point where accuracy stabilises.

②

Prefer odd K for binary classification — eliminates tie votes between the two classes. For multi-class, ties are broken by the closest tied neighbour in sklearn.

③

Start with K = √n as a baseline — e.g. 1,000 training samples → K ≈ 31. This heuristic lands in a reasonable range before fine-tuning with cross-validation.

④

Noisy or overlapping data → larger K to smooth out label noise. Clean well-separated data → smaller K to preserve local structure.

Section 07

Feature Scaling — Mandatory, Never Optional

KNN's core operation is measuring distance. Any feature with a larger numerical range will dominate all distance calculations, effectively making all other features invisible to the model.

❌ Without Scaling — Distance is Meaningless

Feature	Patient A	Patient B	Diff²
Age (years)	30	31	1
Income (£)	30,000	70,000	1,600,000,000
BMI	22.5	35.0	156.25
Euclidean d	40,000.00 — entirely Income

✅ With StandardScaler — Equal Contribution

Feature	Patient A	Patient B	Diff²
Age (z-score)	−0.85	−0.71	0.0196
Income (z-score)	−0.92	1.15	4.2849
BMI (z-score)	−1.20	0.95	4.6225
Euclidean d	2.94 — all features contribute

📏

StandardScaler

z = (x − μ) / σ

Centres each feature to mean 0 and standard deviation 1. Best when features are roughly normally distributed. The default and most common choice for KNN. Sensitive to outliers — if outliers exist use RobustScaler.

✅ Best general choice — use by default

📐

MinMaxScaler

x' = (x − min) / (max − min)

Scales all features to [0, 1] range. Preserves the original distribution shape. Useful when you need bounded outputs (e.g. image pixel values 0–255 → 0–1). Extremely sensitive to outliers — one extreme value compresses all others.

✅ Good for bounded, uniform distributions

🛡️

RobustScaler

x' = (x − Q2) / (Q3 − Q1)

Scales using the median and interquartile range instead of mean and std. Outliers do not influence the scaling at all. Best when your data contains many outliers or the distribution is heavily skewed. Use this for real-world noisy datasets.

✅ Best when outliers are present

Section 08

The Curse of Dimensionality

📖 Why KNN Breaks in High Dimensions

The Neighbourhood That Keeps Expanding

Imagine you want to find your 5 nearest neighbours in a 1D street (one feature). Your neighbours are very close — perhaps 10 metres away. Now add a second dimension (a 2D city block). Your 5 nearest neighbours might be 50 metres away. Add a third dimension (a 3D building). Now they might be 200 metres away.

In 100 dimensions, your "nearest" neighbours are almost as far away as your farthest neighbours. The concept of "near" loses all meaning. Every point becomes approximately equidistant from every other point. When all distances are similar, KNN has no signal to work with — it is guessing randomly. This is the curse of dimensionality, and it is the primary reason KNN fails on high-dimensional data.

Dimensions	Points Needed (1% coverage)	Nearest vs Farthest Ratio	KNN Effectiveness
1D	10 points	Very different	Excellent
2D	100 points	Clearly different	Very good
10D	10¹⁰ points	Getting similar	Declining
100D	10¹⁰⁰ points	Nearly identical	Breaks down
1000D+	Impossible	All points equidistant	Random guessing

🔧

Fighting the Curse — Dimensionality Reduction Before KNN

When features exceed ~20–30, apply dimensionality reduction before KNN. PCA retains the directions of maximum variance and is the fastest option. UMAP or t-SNE preserve local neighbourhood structure better — ideal for KNN preprocessing. As a rule of thumb: reduce to √(n_features) dimensions, or use cross-validation to find the optimal number of components.

Section 09

Python Implementation

KNN Classification — Iris Dataset

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# ⚠️ CRITICAL: Always put StandardScaler inside Pipeline
# This prevents data leakage from the test set into scaling
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(
        n_neighbors=5,          # K — tune with cross-validation
        metric='euclidean',     # distance function
        weights='uniform',      # all neighbours vote equally
        algorithm='auto',       # auto-selects ball_tree/kd_tree/brute
        n_jobs=-1               # use all CPU cores for search
    ))
])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

print(f"Accuracy: {pipe.score(X_test, y_test):.4f}")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# Predict with class probabilities (fraction of K neighbours per class)
y_prob = pipe.predict_proba(X_test)
for i in range(3):
    cls   = iris.target_names[y_pred[i]]
    conf  = y_prob[i].max() * 100
    print(f"  Sample {i+1}: {cls:12s} ({conf:.0f}% of neighbours agree)")

OUTPUT

Accuracy: 0.9667 precision recall f1-score support setosa 1.00 1.00 1.00 10 versicolor 1.00 0.90 0.95 10 virginica 0.91 1.00 0.95 10 accuracy 0.97 30 Sample 1: setosa (100% of neighbours agree) Sample 2: versicolor ( 80% of neighbours agree) Sample 3: virginica (100% of neighbours agree)

Finding Optimal K with Cross-Validation

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

k_range  = range(1, 32, 2)   # odd values 1, 3, 5 ... 31
k_scores = []

for k in k_range:
    pipe_k = Pipeline([
        ('scaler', StandardScaler()),
        ('knn',    KNeighborsClassifier(n_neighbors=k, n_jobs=-1))
    ])
    scores = cross_val_score(
        pipe_k, X_train, y_train,
        cv=5, scoring='accuracy'
    )
    k_scores.append((k, scores.mean(), scores.std()))

# Print table of K vs CV accuracy
print(f"{'K':>4} {'CV Accuracy':>12} {'Std Dev':>10}")
print("-" * 30)
for k, mean, std in k_scores:
    marker = " ← BEST" if mean == max(s[1] for s in k_scores) else ""
    print(f"{k:>4}   {mean:.4f}       ±{std:.4f}{marker}")

best_k = max(k_scores, key=lambda x: x[1])[0]
print(f"\nOptimal K = {best_k}")

OUTPUT

K CV Accuracy Std Dev ------------------------------ 1 0.9167 ±0.0408 3 0.9333 ±0.0333 5 0.9500 ±0.0333 7 0.9583 ±0.0306 9 0.9583 ±0.0306 11 0.9667 ±0.0272 ← BEST 13 0.9583 ±0.0306 15 0.9500 ±0.0204 ... 31 0.9083 ±0.0408 Optimal K = 11

KNN Regression — House Price Prediction

from sklearn.neighbors import KNeighborsRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_absolute_error

housing = fetch_california_housing()
X, y    = housing.data, housing.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Uniform KNN regression — simple average of K neighbours
knn_reg = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsRegressor(
        n_neighbors=10,
        weights='distance',   # closer neighbours contribute more
        metric='euclidean',
        n_jobs=-1
    ))
])

knn_reg.fit(X_train, y_train)
y_pred = knn_reg.predict(X_test)

print(f"R² Score: {r2_score(y_test, y_pred):.4f}")
print(f"MAE:      ${mean_absolute_error(y_test, y_pred)*100_000:.0f}")

OUTPUT

R² Score: 0.7318 MAE: $39,204

Implementing KNN from Scratch

import numpy as np
from collections import Counter

def euclidean_distance(x1, x2):
    """Straight-line distance between two points."""
    return np.sqrt(np.sum((x1 - x2) ** 2))

class KNNClassifier:
    def __init__(self, k=5):
        self.k = k

    def fit(self, X, y):
        """No training — just memorise the dataset."""
        self.X_train = X
        self.y_train = y
        return self

    def predict(self, X):
        """Predict class for each point in X."""
        return np.array([self._predict_one(x) for x in X])

    def _predict_one(self, x):
        # Step 1: compute distance to every training point
        distances = [euclidean_distance(x, x_tr)
                     for x_tr in self.X_train]
        # Step 2: find indices of K smallest distances
        k_idx = np.argsort(distances)[:self.k]
        # Step 3: get labels of K nearest neighbours
        k_labels = self.y_train[k_idx]
        # Step 4: return majority vote
        return Counter(k_labels).most_common(1)[0][0]

# ── Test against sklearn ─────────────────────────────────────
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

iris = load_iris()
X, y = iris.data, iris.target
sc   = StandardScaler()

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

X_train_sc = sc.fit_transform(X_train)
X_test_sc  = sc.transform(X_test)

knn_scratch = KNNClassifier(k=5)
knn_scratch.fit(X_train_sc, y_train)
y_pred = knn_scratch.predict(X_test_sc)

acc = np.mean(y_pred == y_test)
print(f"Scratch KNN accuracy: {acc:.4f}")  # matches sklearn

OUTPUT

Scratch KNN accuracy: 0.9667

Section 10

Weighted KNN — Closer Neighbours Vote Louder

Standard KNN gives every neighbour an equal vote regardless of distance. A neighbour at distance 0.01 has the same influence as one at distance 50. Weighted KNN fixes this by weighting each neighbour's vote inversely proportional to its distance — closer means more influential.

Uniform Weights (Default)

weight(i) = 1 / K

Every neighbour contributes equally to the vote. Simple and fast. Can be noisy when a far neighbour is accidentally included among the K nearest.

Distance Weights

weight(i) = 1 / distance(i)

Neighbours closer to the query point contribute more. Smoother predictions. Dramatically helps when the K-th neighbour is much further than the first. Use weights='distance' in sklearn.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Full hyperparameter search — K, weights, metric, p
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('knn',    KNeighborsClassifier(n_jobs=-1))
])

param_grid = {
    'knn__n_neighbors': [3, 5, 7, 9, 11, 15, 21],
    'knn__weights':     ['uniform', 'distance'],
    'knn__metric':      ['euclidean', 'manhattan'],
}

grid = GridSearchCV(
    pipe, param_grid,
    cv=5, scoring='accuracy',
    n_jobs=-1, verbose=0
)
grid.fit(X_train, y_train)

print("Best parameters: ", grid.best_params_)
print(f"Best CV accuracy: {grid.best_score_:.4f}")
print(f"Test accuracy:    {grid.score(X_test, y_test):.4f}")

OUTPUT

Best parameters: {'knn__metric': 'euclidean', 'knn__n_neighbors': 11, 'knn__weights': 'distance'} Best CV accuracy: 0.9750 Test accuracy: 0.9667

Section 11

Search Algorithms — How sklearn Finds Neighbours Fast

🔍

Brute Force

algorithm='brute'

Computes distance to every single training point. O(n × d) per query — slow for large n but works with any distance metric. Best choice when n is small (< 1,000) or when using non-standard metrics like cosine or hamming.

✅ Works with any metric, best for small data

❌ O(n) per query — slow on large datasets

🌲

KD-Tree

algorithm='kd_tree'

Partitions the feature space into a binary tree of hyperrectangles. Queries run in O(log n) average time — much faster than brute force. Works well in low dimensions (d ≤ 20). Degrades toward brute force as dimensions increase.

✅ Fast in low-d, O(log n) queries

❌ Breaks down in high dimensions (>20)

🎯

Ball-Tree

algorithm='ball_tree'

Partitions data into nested hyperspheres instead of hyperrectangles. More robust to high-dimensional data than KD-Tree. Better suited for non-Euclidean metrics. Use when d > 20 or when using Manhattan/Minkowski with p ≠ 2.

✅ Better in higher dims than KD-Tree

❌ Slower to build than KD-Tree

⚡

Use algorithm='auto' — Let sklearn Decide

The default algorithm='auto' selects the best algorithm based on dataset size, dimensionality, and metric. It chooses KD-Tree for small d, Ball-Tree for larger d, and Brute Force for custom metrics. Only override this if you have a specific reason — benchmarking or a known constraint.

Section 12

Full Pipeline — Breast Cancer Diagnosis

import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

cancer = load_breast_cancer()
X, y   = cancer.data, cancer.target   # 30 features, binary label

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 30 features → apply PCA first to fight the curse of dimensionality
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('pca',    PCA(n_components=10, random_state=42)),
    ('knn',    KNeighborsClassifier(
        n_neighbors=9,
        weights='distance',
        metric='euclidean',
        n_jobs=-1
    ))
])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

# Cross-validation score
cv_scores = cross_val_score(
    pipe, X_train, y_train, cv=5, scoring='f1'
)
print(f"5-Fold CV F1:  {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
print(f"Test Accuracy: {pipe.score(X_test, y_test):.4f}")

print(classification_report(
    y_test, y_pred,
    target_names=cancer.target_names
))

# Confusion matrix
ConfusionMatrixDisplay.from_predictions(
    y_test, y_pred,
    display_labels=cancer.target_names
)
plt.tight_layout()
plt.show()

OUTPUT

5-Fold CV F1: 0.9735 ± 0.0143 Test Accuracy: 0.9737 precision recall f1-score support malignant 0.95 0.97 0.96 42 benign 0.98 0.97 0.98 72 accuracy 0.97 114

Section 13

When to Use KNN

✅

Low-Dimensional Data

KNN performs best when features number under 20–30. In low dimensions, distance is meaningful and neighbours are genuinely similar. Classification accuracy is often competitive with more complex models.

d < 20 features

✅

Non-Linear Decision Boundaries

KNN naturally creates arbitrarily complex, non-linear decision boundaries with no kernel trick or feature engineering required. Any shape is possible simply by adjusting K.

complex boundaries · irregular shapes

✅

Small Datasets

With few training examples, KNN can outperform parametric models that need more data to estimate parameters reliably. The entire training set is used for every prediction — nothing is discarded.

under 10k samples

✅

Quick Baseline

KNN requires zero assumptions about data distribution. Instant setup, no model to train. Use it as a strong, no-effort baseline before investing in complex model development.

rapid prototyping · baseline

❌

Large Datasets

Every prediction requires searching all training points. With 1 million samples, each query computes 1 million distances. Inference becomes unbearably slow. Use Approximate Nearest Neighbour (ANN) libraries like FAISS instead.

use FAISS / Annoy instead

❌

High-Dimensional Data

Text (TF-IDF), images (raw pixels), genomics — all suffer from the curse of dimensionality. All distances become similar. KNN loses all discriminative power. Use cosine similarity + dimensionality reduction or switch to SVM/neural nets.

apply PCA first or use SVM

Section 14

KNN vs Other Classifiers

Property	KNN	Random Forest	SVM (RBF)	Logistic Reg.
Training time	Instant (none)	Moderate	Slow	Fast
Prediction time	Slow — O(n × d)	Moderate	Moderate	Fastest
Memory usage	Stores all data	Stores trees	Stores support vectors	Just coefficients
Feature scaling needed	Mandatory	Never	Mandatory	Recommended
Handles non-linearity	Naturally	Excellent	Via kernels	Cannot
High-dimensional data	Fails — curse of d	Moderate	Excellent	Excellent
Interpretability	High — show neighbours	Feature importance	Linear only	High — coefficients
Online learning	Natural — just add data	Full retrain	Full retrain	Possible (SGD)

Section 15

Strengths vs Weaknesses

✅ Strengths

Zero training time — fit is instantaneous

Trivially simple to understand and explain

Naturally handles multi-class problems

No assumptions about data distribution

Non-parametric — learns any decision boundary shape

Online learning — just append new points, no retraining

Predictions are explainable — show the actual neighbours

Works for both classification and regression unchanged

❌ Weaknesses

Slow prediction — O(n × d) per query

Stores entire training set — high memory usage

Feature scaling is mandatory — forgetting it is fatal

Fails in high dimensions — curse of dimensionality

Sensitive to irrelevant features — all features vote

Sensitive to imbalanced datasets — majority class dominates

No feature importance — treats all features equally

K selection requires cross-validation — not automatic

Section 16

Real-World Applications

Recommendation Systems

🎬

User-based collaborative filtering
"Users like you also watched..."
Item similarity matching
Early Netflix and Amazon engines

Medical Diagnosis

🏥

Find similar patient records
Anomaly detection in vitals
Drug response prediction
Explainable — show real cases

Image Recognition

🖼️

MNIST digit classification (95%+)
Reverse image search
Face verification systems
With PCA preprocessing

Section 17

Golden Rules

📍 K-Nearest Neighbors — Non-Negotiable Rules

Always scale features — without exception, before any distance is computed. KNN is entirely distance-based. An unscaled feature with a larger numeric range will silently dominate every prediction. Wrap StandardScaler() inside a Pipeline to prevent leakage. This is the single most impactful step in any KNN workflow.

Always tune K with cross-validation — never guess. Try all odd values from 1 to √n_train. Plot CV accuracy vs K and choose the point where accuracy peaks and stabilises. The √n heuristic is a starting point only — never the final answer.

Use weights='distance' by default. Distance-weighted voting almost always beats uniform voting at negligible computational cost. Closer neighbours are genuinely more relevant — let them vote louder. Start with distance weights and switch to uniform only if cross-validation says so.

Apply PCA before KNN when features exceed 20. The curse of dimensionality makes distance meaningless in high-d space. Reduce to 10–15 components with PCA first. Add it inside the Pipeline: scaler → PCA → KNN. Tune n_components with cross-validation alongside K.

Remove irrelevant features before KNN. KNN treats every feature equally. Noisy, irrelevant features add meaningless noise to every distance calculation, polluting the neighbourhood. Use Random Forest feature importance or SelectKBest to remove dead weight first.

Handle class imbalance explicitly. KNN majority voting is naturally biased toward the dominant class. Fix this with class_weight workarounds (use a custom voting scheme) or preferably oversample the minority class with SMOTE before fitting. Evaluate with F1 or AUC, not accuracy.

For large-scale production, use Approximate Nearest Neighbour (ANN). Exact KNN does not scale beyond ~100k points for real-time inference. Libraries like FAISS (Facebook AI), Annoy (Spotify), and ScaNN (Google) find near-exact neighbours in milliseconds on billion-scale datasets using tree structures and quantisation.

KNN is your most explainable model — use that advantage. Unlike Random Forests or SVMs, you can always justify a KNN prediction by showing the actual K neighbours from the training data. "We predicted diabetes because these five similar patients also had diabetes" is a uniquely compelling explanation for medical, legal, or regulatory audiences.