Machine Learning ๐Ÿ“‚ Supervised Learning ยท 13 of 17 41 min read

K-Nearest Neighbors

Learn how KNN classifies new data by finding the K most similar training examples, explore Euclidean and Manhattan distance, the critical importance of feature scaling, how to choose K, and why this elegant lazy learner still powers real production systems.

Section 01

The Story That Explains KNN

You Are Who Your Neighbours Are
You move to a new city and find a flat at an unfamiliar address. You have no idea whether the neighbourhood is expensive or affordable. How do you estimate the rent for your flat?

You do not read economic theory. You do not build a complex model. You simply ask: "What are the five nearest flats paying right now?" If four of the five nearest flats pay ยฃ2,000 per month and one pays ยฃ1,200, you estimate ยฃ1,880 (average) โ€” and you are probably right.

Now imagine a doctor trying to decide if a new patient has diabetes. She pulls up the five most similar patients in the hospital database โ€” same age, same BMI, similar glucose levels. Four of those five have diabetes. Her prediction: diabetes. Majority vote. No formula needed.

That is K-Nearest Neighbors in its entirety. Find the K most similar examples you have already seen. Let them vote (classification) or average out (regression). Done.
๐Ÿฆฅ
KNN Is a "Lazy Learner" โ€” No Training Phase

Most algorithms learn a model during training and discard the raw data. KNN does the opposite โ€” it memorises the entire training set and does all the work at prediction time. There is no training phase at all. This makes fitting instantaneous but prediction slow on large datasets โ€” every new query must compute distances to all stored points.


Section 02

How KNN Works โ€” Four Steps

01
Choose K and a Distance Metric
Decide how many neighbours to consult (K) and how to measure "closeness" โ€” Euclidean, Manhattan, or another distance function. These are the only two decisions made before any data is seen.
02
Compute Distance to Every Training Point
For the new unseen data point x_new, calculate its distance to every single point in the training set. With 10,000 training points, that is 10,000 distance calculations โ€” every prediction, every time.
03
Select the K Nearest Points
Sort all distances and take the K smallest. These are the K nearest neighbours โ€” the training examples most similar to the new point. All others are ignored completely.
04
Aggregate the K Labels
Classification โ†’ majority class wins (plurality vote). Regression โ†’ mean (or weighted mean) of the K labels. Output the result. No model update, no retraining โ€” the dataset is the model.

Section 03

Distance Metrics โ€” How "Closeness" Is Measured

The entire logic of KNN depends on a reliable measure of similarity. Different distance metrics lead to different neighbours โ€” and different predictions. Choosing the right metric matters.

Euclidean Distance (p=2)
d = โˆšฮฃ(xแตข โˆ’ yแตข)ยฒ
Straight-line "as the crow flies" distance. The default and most common choice. Sensitive to outliers and scale. Best when features are continuous and normally distributed.
Manhattan Distance (p=1)
d = ฮฃ|xแตข โˆ’ yแตข|
Sum of absolute differences along each axis โ€” like navigating a city grid. Less sensitive to outliers than Euclidean. Better when features have very different scales or when outliers are present.
Minkowski Distance (General)
d = (ฮฃ|xแตข โˆ’ yแตข|แต–)^(1/p)
The generalised form. p=1 โ†’ Manhattan. p=2 โ†’ Euclidean. p=โˆž โ†’ Chebyshev (max single-dimension difference). Tune p as a hyperparameter.
Hamming Distance
d = (1/n) ร— #{i : xแตข โ‰  yแตข}
Fraction of positions where two vectors differ. Used for categorical or binary features โ€” text, DNA sequences, one-hot encoded data. Not suitable for continuous features.
Metric Best For Sensitive to Outliers sklearn Parameter
Euclidean Continuous, low-dimensional, normally distributed Yes โ€” squares differences metric='euclidean'
Manhattan Continuous, high-dimensional, noisy data Less sensitive metric='manhattan'
Minkowski General โ€” tune p as hyperparameter Depends on p metric='minkowski', p=2
Hamming Categorical, binary, text features No metric='hamming'
Cosine Text vectors โ€” direction more than magnitude No metric='cosine'

Section 04

Step-by-Step Manual Calculation

Let us classify one new patient by hand using K=3 and Euclidean distance. We have five existing patients with two features: Age and Glucose Level, and a known diabetes label.

Patient Age Glucose Diabetes?
P12580No
P235150Yes
P345180Yes
P42070No
P550200Yes
NEW40160?
๐Ÿงฎ Step 1 โ€” Compute Euclidean Distance to Each Training Point
d(NEW, P1)
โˆš((40โˆ’25)ยฒ + (160โˆ’80)ยฒ) = โˆš(225 + 6400) = โˆš6625 = 81.40
d(NEW, P2)
โˆš((40โˆ’35)ยฒ + (160โˆ’150)ยฒ) = โˆš(25 + 100) = โˆš125 = 11.18
d(NEW, P3)
โˆš((40โˆ’45)ยฒ + (160โˆ’180)ยฒ) = โˆš(25 + 400) = โˆš425 = 20.62
d(NEW, P4)
โˆš((40โˆ’20)ยฒ + (160โˆ’70)ยฒ) = โˆš(400 + 8100) = โˆš8500 = 92.20
d(NEW, P5)
โˆš((40โˆ’50)ยฒ + (160โˆ’200)ยฒ) = โˆš(100 + 1600) = โˆš1700 = 41.23
๐Ÿงฎ Step 2 โ€” Sort Distances and Select K=3 Nearest
1st
P2 โ€” distance 11.18 โ€” Label: Yes
2nd
P3 โ€” distance 20.62 โ€” Label: Yes
3rd
P5 โ€” distance 41.23 โ€” Label: Yes
Ignored
P1 (81.40) and P4 (92.20) โ€” too far, excluded from vote
๐Ÿ—ณ๏ธ Step 3 โ€” Majority Vote Among K=3 Neighbours
Tally
Yes: 3 votes (P2, P3, P5)  |  No: 0 votes
Result
Unanimous verdict โ†’ Predicted: Diabetes = YES โœ…
โš ๏ธ
Why This Example Would Fail Without Scaling

Notice that Glucose ranges from 70โ€“200 while Age ranges from 20โ€“50. Glucose differences dominate the distance calculation by a factor of ~5. Age barely influences which neighbours are selected. In a real implementation you must scale both features to the same range before computing distances โ€” or Glucose silently overrules everything else.


Section 05

Classification vs Regression

๐Ÿ—ณ๏ธ KNN Classification (Majority Vote)
NeighbourLabelDistance
Neighbour 1Spam0.12
Neighbour 2Spam0.18
Neighbour 3Ham0.25
Neighbour 4Spam0.31
Neighbour 5Ham0.40
PredictionSpam (3 vs 2 votes)
๐Ÿ“Š KNN Regression (Average Value)
NeighbourPrice (ยฃk)Distance
Neighbour 12950.08
Neighbour 23100.15
Neighbour 32800.22
Neighbour 43250.30
Neighbour 53000.38
Predictionยฃ302,000 (mean)

Section 06

Choosing K โ€” The Most Critical Decision

K=1 Memorises. K=N Ignores Everything.
K=1: Ask only the single nearest neighbour. If that one training point is an outlier or mislabelled, you get the wrong answer. The model memorises every quirk of the training data โ€” perfect training accuracy, catastrophic test accuracy. This is maximum overfitting.

K=N (all training points): Every query returns the majority class of the entire dataset โ€” completely ignoring the location of the new point. This is maximum underfitting.

The right K sits in between โ€” small enough to respect local structure, large enough to smooth out noise. The answer is always found with cross-validation.
K Value Decision Boundary Bias Variance Risk
K = 1 Extremely jagged, erratic Very Low Very High Overfitting โ€” memorises noise
K = 3โ€“5 Jagged but smoothing Low Moderate-High Good for small clean datasets
K = 11โ€“21 Smooth, well-generalised Balanced โœ… Moderate Best general range for most data
K = 51+ Very smooth, global High Low Underfitting โ€” too much smoothing
K = N Constant โ€” one class everywhere Maximum Zero Predicts majority class always
๐Ÿ”ข Practical Rules for Choosing K
โ‘ 
Always use cross-validation โ€” try all odd values of K from 1 to โˆšn (where n = training set size). Plot CV accuracy vs K and choose the elbow point where accuracy stabilises.
โ‘ก
Prefer odd K for binary classification โ€” eliminates tie votes between the two classes. For multi-class, ties are broken by the closest tied neighbour in sklearn.
โ‘ข
Start with K = โˆšn as a baseline โ€” e.g. 1,000 training samples โ†’ K โ‰ˆ 31. This heuristic lands in a reasonable range before fine-tuning with cross-validation.
โ‘ฃ
Noisy or overlapping data โ†’ larger K to smooth out label noise. Clean well-separated data โ†’ smaller K to preserve local structure.

Section 07

Feature Scaling โ€” Mandatory, Never Optional

KNN's core operation is measuring distance. Any feature with a larger numerical range will dominate all distance calculations, effectively making all other features invisible to the model.

โŒ Without Scaling โ€” Distance is Meaningless
FeaturePatient APatient BDiffยฒ
Age (years)30311
Income (ยฃ)30,00070,0001,600,000,000
BMI22.535.0156.25
Euclidean d40,000.00 โ€” entirely Income
โœ… With StandardScaler โ€” Equal Contribution
FeaturePatient APatient BDiffยฒ
Age (z-score)โˆ’0.85โˆ’0.710.0196
Income (z-score)โˆ’0.921.154.2849
BMI (z-score)โˆ’1.200.954.6225
Euclidean d2.94 โ€” all features contribute
๐Ÿ“
StandardScaler
z = (x โˆ’ ฮผ) / ฯƒ
Centres each feature to mean 0 and standard deviation 1. Best when features are roughly normally distributed. The default and most common choice for KNN. Sensitive to outliers โ€” if outliers exist use RobustScaler.
โœ… Best general choice โ€” use by default
๐Ÿ“
MinMaxScaler
x' = (x โˆ’ min) / (max โˆ’ min)
Scales all features to [0, 1] range. Preserves the original distribution shape. Useful when you need bounded outputs (e.g. image pixel values 0โ€“255 โ†’ 0โ€“1). Extremely sensitive to outliers โ€” one extreme value compresses all others.
โœ… Good for bounded, uniform distributions
๐Ÿ›ก๏ธ
RobustScaler
x' = (x โˆ’ Q2) / (Q3 โˆ’ Q1)
Scales using the median and interquartile range instead of mean and std. Outliers do not influence the scaling at all. Best when your data contains many outliers or the distribution is heavily skewed. Use this for real-world noisy datasets.
โœ… Best when outliers are present

Section 08

The Curse of Dimensionality

The Neighbourhood That Keeps Expanding
Imagine you want to find your 5 nearest neighbours in a 1D street (one feature). Your neighbours are very close โ€” perhaps 10 metres away. Now add a second dimension (a 2D city block). Your 5 nearest neighbours might be 50 metres away. Add a third dimension (a 3D building). Now they might be 200 metres away.

In 100 dimensions, your "nearest" neighbours are almost as far away as your farthest neighbours. The concept of "near" loses all meaning. Every point becomes approximately equidistant from every other point. When all distances are similar, KNN has no signal to work with โ€” it is guessing randomly. This is the curse of dimensionality, and it is the primary reason KNN fails on high-dimensional data.
Dimensions Points Needed (1% coverage) Nearest vs Farthest Ratio KNN Effectiveness
1D 10 points Very different Excellent
2D 100 points Clearly different Very good
10D 10ยนโฐ points Getting similar Declining
100D 10ยนโฐโฐ points Nearly identical Breaks down
1000D+ Impossible All points equidistant Random guessing
๐Ÿ”ง
Fighting the Curse โ€” Dimensionality Reduction Before KNN

When features exceed ~20โ€“30, apply dimensionality reduction before KNN. PCA retains the directions of maximum variance and is the fastest option. UMAP or t-SNE preserve local neighbourhood structure better โ€” ideal for KNN preprocessing. As a rule of thumb: reduce to โˆš(n_features) dimensions, or use cross-validation to find the optimal number of components.


Section 09

Python Implementation

KNN Classification โ€” Iris Dataset

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# โš ๏ธ CRITICAL: Always put StandardScaler inside Pipeline
# This prevents data leakage from the test set into scaling
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(
        n_neighbors=5,          # K โ€” tune with cross-validation
        metric='euclidean',     # distance function
        weights='uniform',      # all neighbours vote equally
        algorithm='auto',       # auto-selects ball_tree/kd_tree/brute
        n_jobs=-1               # use all CPU cores for search
    ))
])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

print(f"Accuracy: {pipe.score(X_test, y_test):.4f}")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# Predict with class probabilities (fraction of K neighbours per class)
y_prob = pipe.predict_proba(X_test)
for i in range(3):
    cls   = iris.target_names[y_pred[i]]
    conf  = y_prob[i].max() * 100
    print(f"  Sample {i+1}: {cls:12s} ({conf:.0f}% of neighbours agree)")
OUTPUT
Accuracy: 0.9667 precision recall f1-score support setosa 1.00 1.00 1.00 10 versicolor 1.00 0.90 0.95 10 virginica 0.91 1.00 0.95 10 accuracy 0.97 30 Sample 1: setosa (100% of neighbours agree) Sample 2: versicolor ( 80% of neighbours agree) Sample 3: virginica (100% of neighbours agree)

Finding Optimal K with Cross-Validation

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

k_range  = range(1, 32, 2)   # odd values 1, 3, 5 ... 31
k_scores = []

for k in k_range:
    pipe_k = Pipeline([
        ('scaler', StandardScaler()),
        ('knn',    KNeighborsClassifier(n_neighbors=k, n_jobs=-1))
    ])
    scores = cross_val_score(
        pipe_k, X_train, y_train,
        cv=5, scoring='accuracy'
    )
    k_scores.append((k, scores.mean(), scores.std()))

# Print table of K vs CV accuracy
print(f"{'K':>4} {'CV Accuracy':>12} {'Std Dev':>10}")
print("-" * 30)
for k, mean, std in k_scores:
    marker = " โ† BEST" if mean == max(s[1] for s in k_scores) else ""
    print(f"{k:>4}   {mean:.4f}       ยฑ{std:.4f}{marker}")

best_k = max(k_scores, key=lambda x: x[1])[0]
print(f"\nOptimal K = {best_k}")
OUTPUT
K CV Accuracy Std Dev ------------------------------ 1 0.9167 ยฑ0.0408 3 0.9333 ยฑ0.0333 5 0.9500 ยฑ0.0333 7 0.9583 ยฑ0.0306 9 0.9583 ยฑ0.0306 11 0.9667 ยฑ0.0272 โ† BEST 13 0.9583 ยฑ0.0306 15 0.9500 ยฑ0.0204 ... 31 0.9083 ยฑ0.0408 Optimal K = 11

KNN Regression โ€” House Price Prediction

from sklearn.neighbors import KNeighborsRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_absolute_error

housing = fetch_california_housing()
X, y    = housing.data, housing.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Uniform KNN regression โ€” simple average of K neighbours
knn_reg = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsRegressor(
        n_neighbors=10,
        weights='distance',   # closer neighbours contribute more
        metric='euclidean',
        n_jobs=-1
    ))
])

knn_reg.fit(X_train, y_train)
y_pred = knn_reg.predict(X_test)

print(f"Rยฒ Score: {r2_score(y_test, y_pred):.4f}")
print(f"MAE:      ${mean_absolute_error(y_test, y_pred)*100_000:.0f}")
OUTPUT
Rยฒ Score: 0.7318 MAE: $39,204

Implementing KNN from Scratch

import numpy as np
from collections import Counter

def euclidean_distance(x1, x2):
    """Straight-line distance between two points."""
    return np.sqrt(np.sum((x1 - x2) ** 2))

class KNNClassifier:
    def __init__(self, k=5):
        self.k = k

    def fit(self, X, y):
        """No training โ€” just memorise the dataset."""
        self.X_train = X
        self.y_train = y
        return self

    def predict(self, X):
        """Predict class for each point in X."""
        return np.array([self._predict_one(x) for x in X])

    def _predict_one(self, x):
        # Step 1: compute distance to every training point
        distances = [euclidean_distance(x, x_tr)
                     for x_tr in self.X_train]
        # Step 2: find indices of K smallest distances
        k_idx = np.argsort(distances)[:self.k]
        # Step 3: get labels of K nearest neighbours
        k_labels = self.y_train[k_idx]
        # Step 4: return majority vote
        return Counter(k_labels).most_common(1)[0][0]

# โ”€โ”€ Test against sklearn โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

iris = load_iris()
X, y = iris.data, iris.target
sc   = StandardScaler()

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

X_train_sc = sc.fit_transform(X_train)
X_test_sc  = sc.transform(X_test)

knn_scratch = KNNClassifier(k=5)
knn_scratch.fit(X_train_sc, y_train)
y_pred = knn_scratch.predict(X_test_sc)

acc = np.mean(y_pred == y_test)
print(f"Scratch KNN accuracy: {acc:.4f}")  # matches sklearn
OUTPUT
Scratch KNN accuracy: 0.9667

Section 10

Weighted KNN โ€” Closer Neighbours Vote Louder

Standard KNN gives every neighbour an equal vote regardless of distance. A neighbour at distance 0.01 has the same influence as one at distance 50. Weighted KNN fixes this by weighting each neighbour's vote inversely proportional to its distance โ€” closer means more influential.

Uniform Weights (Default)
weight(i) = 1 / K
Every neighbour contributes equally to the vote. Simple and fast. Can be noisy when a far neighbour is accidentally included among the K nearest.
Distance Weights
weight(i) = 1 / distance(i)
Neighbours closer to the query point contribute more. Smoother predictions. Dramatically helps when the K-th neighbour is much further than the first. Use weights='distance' in sklearn.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Full hyperparameter search โ€” K, weights, metric, p
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('knn',    KNeighborsClassifier(n_jobs=-1))
])

param_grid = {
    'knn__n_neighbors': [3, 5, 7, 9, 11, 15, 21],
    'knn__weights':     ['uniform', 'distance'],
    'knn__metric':      ['euclidean', 'manhattan'],
}

grid = GridSearchCV(
    pipe, param_grid,
    cv=5, scoring='accuracy',
    n_jobs=-1, verbose=0
)
grid.fit(X_train, y_train)

print("Best parameters: ", grid.best_params_)
print(f"Best CV accuracy: {grid.best_score_:.4f}")
print(f"Test accuracy:    {grid.score(X_test, y_test):.4f}")
OUTPUT
Best parameters: {'knn__metric': 'euclidean', 'knn__n_neighbors': 11, 'knn__weights': 'distance'} Best CV accuracy: 0.9750 Test accuracy: 0.9667

Section 11

Search Algorithms โ€” How sklearn Finds Neighbours Fast

๐Ÿ”
Brute Force
algorithm='brute'
Computes distance to every single training point. O(n ร— d) per query โ€” slow for large n but works with any distance metric. Best choice when n is small (< 1,000) or when using non-standard metrics like cosine or hamming.
โœ… Works with any metric, best for small data
โŒ O(n) per query โ€” slow on large datasets
๐ŸŒฒ
KD-Tree
algorithm='kd_tree'
Partitions the feature space into a binary tree of hyperrectangles. Queries run in O(log n) average time โ€” much faster than brute force. Works well in low dimensions (d โ‰ค 20). Degrades toward brute force as dimensions increase.
โœ… Fast in low-d, O(log n) queries
โŒ Breaks down in high dimensions (>20)
๐ŸŽฏ
Ball-Tree
algorithm='ball_tree'
Partitions data into nested hyperspheres instead of hyperrectangles. More robust to high-dimensional data than KD-Tree. Better suited for non-Euclidean metrics. Use when d > 20 or when using Manhattan/Minkowski with p โ‰  2.
โœ… Better in higher dims than KD-Tree
โŒ Slower to build than KD-Tree
โšก
Use algorithm='auto' โ€” Let sklearn Decide

The default algorithm='auto' selects the best algorithm based on dataset size, dimensionality, and metric. It chooses KD-Tree for small d, Ball-Tree for larger d, and Brute Force for custom metrics. Only override this if you have a specific reason โ€” benchmarking or a known constraint.


Section 12

Full Pipeline โ€” Breast Cancer Diagnosis

import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

cancer = load_breast_cancer()
X, y   = cancer.data, cancer.target   # 30 features, binary label

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 30 features โ†’ apply PCA first to fight the curse of dimensionality
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('pca',    PCA(n_components=10, random_state=42)),
    ('knn',    KNeighborsClassifier(
        n_neighbors=9,
        weights='distance',
        metric='euclidean',
        n_jobs=-1
    ))
])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

# Cross-validation score
cv_scores = cross_val_score(
    pipe, X_train, y_train, cv=5, scoring='f1'
)
print(f"5-Fold CV F1:  {cv_scores.mean():.4f} ยฑ {cv_scores.std():.4f}")
print(f"Test Accuracy: {pipe.score(X_test, y_test):.4f}")

print(classification_report(
    y_test, y_pred,
    target_names=cancer.target_names
))

# Confusion matrix
ConfusionMatrixDisplay.from_predictions(
    y_test, y_pred,
    display_labels=cancer.target_names
)
plt.tight_layout()
plt.show()
OUTPUT
5-Fold CV F1: 0.9735 ยฑ 0.0143 Test Accuracy: 0.9737 precision recall f1-score support malignant 0.95 0.97 0.96 42 benign 0.98 0.97 0.98 72 accuracy 0.97 114

Section 13

When to Use KNN

โœ…
Low-Dimensional Data
KNN performs best when features number under 20โ€“30. In low dimensions, distance is meaningful and neighbours are genuinely similar. Classification accuracy is often competitive with more complex models.
d < 20 features
โœ…
Non-Linear Decision Boundaries
KNN naturally creates arbitrarily complex, non-linear decision boundaries with no kernel trick or feature engineering required. Any shape is possible simply by adjusting K.
complex boundaries ยท irregular shapes
โœ…
Small Datasets
With few training examples, KNN can outperform parametric models that need more data to estimate parameters reliably. The entire training set is used for every prediction โ€” nothing is discarded.
under 10k samples
โœ…
Quick Baseline
KNN requires zero assumptions about data distribution. Instant setup, no model to train. Use it as a strong, no-effort baseline before investing in complex model development.
rapid prototyping ยท baseline
โŒ
Large Datasets
Every prediction requires searching all training points. With 1 million samples, each query computes 1 million distances. Inference becomes unbearably slow. Use Approximate Nearest Neighbour (ANN) libraries like FAISS instead.
use FAISS / Annoy instead
โŒ
High-Dimensional Data
Text (TF-IDF), images (raw pixels), genomics โ€” all suffer from the curse of dimensionality. All distances become similar. KNN loses all discriminative power. Use cosine similarity + dimensionality reduction or switch to SVM/neural nets.
apply PCA first or use SVM

Section 14

KNN vs Other Classifiers

Property KNN Random Forest SVM (RBF) Logistic Reg.
Training time Instant (none) Moderate Slow Fast
Prediction time Slow โ€” O(n ร— d) Moderate Moderate Fastest
Memory usage Stores all data Stores trees Stores support vectors Just coefficients
Feature scaling needed Mandatory Never Mandatory Recommended
Handles non-linearity Naturally Excellent Via kernels Cannot
High-dimensional data Fails โ€” curse of d Moderate Excellent Excellent
Interpretability High โ€” show neighbours Feature importance Linear only High โ€” coefficients
Online learning Natural โ€” just add data Full retrain Full retrain Possible (SGD)

Section 15

Strengths vs Weaknesses

โœ… Strengths
Zero training time โ€” fit is instantaneous
Trivially simple to understand and explain
Naturally handles multi-class problems
No assumptions about data distribution
Non-parametric โ€” learns any decision boundary shape
Online learning โ€” just append new points, no retraining
Predictions are explainable โ€” show the actual neighbours
Works for both classification and regression unchanged
โŒ Weaknesses
Slow prediction โ€” O(n ร— d) per query
Stores entire training set โ€” high memory usage
Feature scaling is mandatory โ€” forgetting it is fatal
Fails in high dimensions โ€” curse of dimensionality
Sensitive to irrelevant features โ€” all features vote
Sensitive to imbalanced datasets โ€” majority class dominates
No feature importance โ€” treats all features equally
K selection requires cross-validation โ€” not automatic

Section 16

Real-World Applications

Recommendation Systems
๐ŸŽฌ
  • User-based collaborative filtering
  • "Users like you also watched..."
  • Item similarity matching
  • Early Netflix and Amazon engines
Medical Diagnosis
๐Ÿฅ
  • Find similar patient records
  • Anomaly detection in vitals
  • Drug response prediction
  • Explainable โ€” show real cases
Image Recognition
๐Ÿ–ผ๏ธ
  • MNIST digit classification (95%+)
  • Reverse image search
  • Face verification systems
  • With PCA preprocessing

Section 17

Golden Rules

๐Ÿ“ K-Nearest Neighbors โ€” Non-Negotiable Rules
1
Always scale features โ€” without exception, before any distance is computed. KNN is entirely distance-based. An unscaled feature with a larger numeric range will silently dominate every prediction. Wrap StandardScaler() inside a Pipeline to prevent leakage. This is the single most impactful step in any KNN workflow.
2
Always tune K with cross-validation โ€” never guess. Try all odd values from 1 to โˆšn_train. Plot CV accuracy vs K and choose the point where accuracy peaks and stabilises. The โˆšn heuristic is a starting point only โ€” never the final answer.
3
Use weights='distance' by default. Distance-weighted voting almost always beats uniform voting at negligible computational cost. Closer neighbours are genuinely more relevant โ€” let them vote louder. Start with distance weights and switch to uniform only if cross-validation says so.
4
Apply PCA before KNN when features exceed 20. The curse of dimensionality makes distance meaningless in high-d space. Reduce to 10โ€“15 components with PCA first. Add it inside the Pipeline: scaler โ†’ PCA โ†’ KNN. Tune n_components with cross-validation alongside K.
5
Remove irrelevant features before KNN. KNN treats every feature equally. Noisy, irrelevant features add meaningless noise to every distance calculation, polluting the neighbourhood. Use Random Forest feature importance or SelectKBest to remove dead weight first.
6
Handle class imbalance explicitly. KNN majority voting is naturally biased toward the dominant class. Fix this with class_weight workarounds (use a custom voting scheme) or preferably oversample the minority class with SMOTE before fitting. Evaluate with F1 or AUC, not accuracy.
7
For large-scale production, use Approximate Nearest Neighbour (ANN). Exact KNN does not scale beyond ~100k points for real-time inference. Libraries like FAISS (Facebook AI), Annoy (Spotify), and ScaNN (Google) find near-exact neighbours in milliseconds on billion-scale datasets using tree structures and quantisation.
8
KNN is your most explainable model โ€” use that advantage. Unlike Random Forests or SVMs, you can always justify a KNN prediction by showing the actual K neighbours from the training data. "We predicted diabetes because these five similar patients also had diabetes" is a uniquely compelling explanation for medical, legal, or regulatory audiences.