Machine Learning ๐Ÿ“‚ Supervised Learning ยท 15 of 17 67 min read

Distance Metrics โ€” Euclidean, Manhattan, Minkowski & Hamming

Master the four essential distance metrics powering machine learning โ€” learn when to use straight-line Euclidean, city-grid Manhattan, generalised Minkowski, and bit-difference Hamming distances, with intuitive stories, step-by-step calculations, visual diagrams, and Python code.

Section 01

The Story That Explains Distance Metrics

Four Ways to Measure the Same Journey
Four explorers stand at the same starting point on a map โ€” City A โ€” and all want to reach City B. Each has a completely different philosophy about how distance should be measured.

Euclid pulls out a ruler and draws a straight line between the two cities. "This is the true distance," he says โ€” the shortest path a bird could fly, regardless of roads or buildings. 5 kilometres.

Manny (from Manhattan) shakes his head. "That line goes through buildings and rivers. I can only walk on streets โ€” horizontally and vertically. I have to go 3 blocks east and 4 blocks north." 7 kilometres.

Minka pulls out a formula with a dial on it. "The true distance depends on the terrain," she says. "In open fields I travel like Euclid. In strict grid cities I travel like Manny. I tune my dial โ€” the parameter p โ€” to match the environment." She represents the general case.

Hamm is not interested in geography at all. He is comparing two genetic sequences, two passwords, two binary codes. "My distance is not about space," he says. "It's about how many positions differ between two strings of equal length."

Same two points. Four valid measures. The right one depends entirely on what your data represents and what you need distance to mean.
๐Ÿ“
Why Distance Metrics Matter in Machine Learning

Distance is the backbone of algorithms like KNN, K-Means clustering, DBSCAN, hierarchical clustering, and similarity search. Every one of them asks the same fundamental question: "How far apart are these two points?" Choose the wrong metric and the algorithm finds the wrong neighbours, the wrong clusters, the wrong patterns โ€” regardless of how well tuned everything else is.


Section 02

Euclidean Distance โ€” The Straight Line

The Crow's Flight
A crow is flying from nest A to nest B. It does not follow roads, rivers, or city blocks. It flies in a perfectly straight line โ€” directly from one point to the other. The Euclidean distance is this straight-line length: the shortest possible path between two points in space, ignoring all obstacles. It is the measure we intuitively think of when we say "how far apart are these two things?"
Euclidean Distance โ€” 2D
d = โˆš((xโ‚‚โˆ’xโ‚)ยฒ + (yโ‚‚โˆ’yโ‚)ยฒ)
The Pythagorean theorem โ€” the hypotenuse of the right triangle formed by the horizontal and vertical differences between two points
Euclidean Distance โ€” n-Dimensions
d = โˆšฮฃแตข (aแตข โˆ’ bแตข)ยฒ
Sum the squared difference on every dimension, then take the square root. Also called the L2 norm or โ„“โ‚‚ distance.

Let us work through a complete example. Point A = (1, 2) and Point B = (4, 6).

๐Ÿ“ Euclidean Distance โ€” Visual Diagram and Calculation
Plot
A is at coordinates (1, 2). B is at coordinates (4, 6). On a grid, A is 3 units to the left and 4 units below B. These form the two legs of a right triangle.
ฮ”x
Horizontal difference: xโ‚‚ โˆ’ xโ‚ = 4 โˆ’ 1 = 3 units
ฮ”y
Vertical difference: yโ‚‚ โˆ’ yโ‚ = 6 โˆ’ 2 = 4 units
Square
Square each difference: 3ยฒ + 4ยฒ = 9 + 16 = 25
โˆšRoot
Take the square root: โˆš25 = 5.00 โ€” the length of the straight line from A to B
๐Ÿ“ EUCLIDEAN DISTANCE โ€” STRAIGHT LINE ON A COORDINATE GRID
Euclidean distance diagram Coordinate grid showing point A at (1,2) and point B at (4,6) connected by a straight diagonal line of length 5. A right triangle is drawn showing the horizontal leg ฮ”x=3 and vertical leg ฮ”y=4. 1 2 3 4 5 6 7 8 1 2 3 4 5 x y ฮ”x = 3 ฮ”y = 4 d = 5.00 A (1,2) B (4,6) Euclidean (hypotenuse) ฮ”x = horizontal leg ฮ”y = vertical leg d = โˆš(ฮ”xยฒ + ฮ”yยฒ) d = โˆš(9 + 16) = โˆš25 = 5
โš ๏ธ
Euclidean Distance Is Extremely Sensitive to Scale

If one feature has values in the range [0, 10,000] (e.g. salary) and another in [0, 1] (e.g. a probability score), the salary feature will completely dominate the distance โ€” the probability feature contributes almost nothing. Always apply StandardScaler or MinMaxScaler before any Euclidean-based algorithm (KNN, K-Means, PCA).

Properties of Euclidean Distance

Property Value What It Means
Non-negativity d(A,B) โ‰ฅ 0 Distance is always zero or positive โ€” never negative
Identity d(A,A) = 0 Distance from a point to itself is exactly zero
Symmetry d(A,B) = d(B,A) Distance from A to B equals distance from B to A
Triangle Inequality d(A,C) โ‰ค d(A,B) + d(B,C) The direct path is never longer than any detour through a third point
Sensitivity to outliers High Squaring differences amplifies the effect of extreme values

Section 03

Manhattan Distance โ€” The City Grid

The Manhattan Taxi Driver
You are in a New York City taxi. You want to go from 1st Avenue & 42nd Street to 4th Avenue & 46th Street. The crow flying overhead travels in a straight diagonal line โ€” 5 blocks as the crow flies. But your taxi cannot drive diagonally through buildings. It must go 3 blocks east along 42nd Street, then 4 blocks north up 4th Avenue. Total: 7 blocks.

That is Manhattan distance โ€” the sum of the absolute differences along each axis, as if you are always constrained to travel along grid lines. Also called taxicab distance, city block distance, or the L1 norm.
Manhattan Distance โ€” 2D
d = |xโ‚‚ โˆ’ xโ‚| + |yโ‚‚ โˆ’ yโ‚|
Sum of absolute differences on each axis. No squaring โ€” outliers have less amplified effect than in Euclidean distance.
Manhattan Distance โ€” n-Dimensions
d = ฮฃแตข |aแตข โˆ’ bแตข|
Also called the L1 norm or โ„“โ‚ distance. Used in LASSO regression, robust statistics, and high-dimensional data.
๐Ÿ™๏ธ Manhattan Distance โ€” Visual Diagram and Calculation
Grid
Same points: A = (1, 2) and B = (4, 6). Imagine a city grid โ€” you cannot cut diagonally. Every move is either horizontal or vertical.
East โ†’
Move horizontally from x=1 to x=4: |4 โˆ’ 1| = 3 blocks east
North โ†‘
Move vertically from y=2 to y=6: |6 โˆ’ 2| = 4 blocks north
Total
3 + 4 = 7.00 blocks โ€” always greater than or equal to Euclidean (5.00)
Note
There are infinite paths of length 7 between A and B on a grid (right then up, up then right, alternating). All have the same Manhattan distance.
๐Ÿ™๏ธ MANHATTAN DISTANCE โ€” CITY GRID WITH HORIZONTAL + VERTICAL STEPS
Manhattan distance diagram City block grid showing three valid L-shaped taxi routes between point A (1,2) and point B (4,6), each totalling 7 blocks. The direct diagonal Euclidean line of 5 is shown for comparison. Buildings are shown as shaded rectangles within the blocks. 1 2 3 4 5 1 2 3 4 5 x (avenues) y (streets) |ฮ”x| = 3 |ฮ”y| = 4 A (1,2) B (4,6) Manhattan = |ฮ”x| + |ฮ”y| = 3 + 4 = 7.00 Route 1 โ€” right then up Route 2 โ€” up then right Route 3 โ€” alternating Euclidean d=5 (for comparison) All three routes have the same total distance = 7
๐Ÿ›ก๏ธ
Manhattan Distance Is More Robust to Outliers

Because Manhattan uses absolute differences (not squared), extreme outliers have a linear rather than quadratic effect. This makes it more robust in noisy, real-world datasets. In LASSO regression the L1 penalty produces sparse solutions (many coefficients driven exactly to zero) โ€” a property that Euclidean (L2) penalty cannot achieve. This is a critical practical advantage.

Manhattan vs Euclidean โ€” Key Differences

Aspect Euclidean (L2) Manhattan (L1)
Formula โˆšฮฃ(aแตขโˆ’bแตข)ยฒ ฮฃ|aแตขโˆ’bแตข|
Path type Straight diagonal line Axis-aligned steps only
Result (Aโ†’B example) 5.00 7.00 (always โ‰ฅ Euclidean)
Outlier sensitivity High โ€” squares differences Lower โ€” linear differences
High-dimensional data Degrades faster More stable
Best for Continuous, low-d, normally distributed High-d, noisy, sparse, grid-like data
Used in KNN, K-Means, PCA, SVM LASSO, robust KNN, image processing, NLP

Section 04

Minkowski Distance โ€” The Unifying Formula

Minkowski distance is not a new metric โ€” it is a parameterised family of metrics that includes both Euclidean and Manhattan as special cases. By tuning a single parameter p, you slide between different geometries of distance.

Minkowski Distance
d = (ฮฃแตข |aแตข โˆ’ bแตข|แต–)^(1/p)
The generalised Lp norm. Parameter p controls the "shape" of the distance measure. Any valid p โ‰ฅ 1 produces a true metric satisfying all mathematical distance properties.
Chebyshev (Special Case p=โˆž)
d = max |aแตข โˆ’ bแตข|
As p โ†’ โˆž, Minkowski converges to the maximum absolute difference across any single dimension โ€” the Chebyshev or Lโˆž distance. Used in chess (king moves) and scheduling.
p Value Name Formula Result (Aโ†’B) Geometry
p = 1 Manhattan / L1 ฮฃ|aแตขโˆ’bแตข| 7.00 Diamond-shaped unit ball
p = 2 Euclidean / L2 โˆšฮฃ(aแตขโˆ’bแตข)ยฒ 5.00 Circular unit ball
p = 3 Minkowski L3 (ฮฃ|aแตขโˆ’bแตข|ยณ)^(1/3) 4.50 Rounder than circle
p = 4 Minkowski L4 (ฮฃ|aแตขโˆ’bแตข|โด)^(1/4) 4.28 Very round, approaching square
p โ†’ โˆž Chebyshev / Lโˆž max|aแตขโˆ’bแตข| 4.00 Square unit ball โ€” max dimension only
๐Ÿงฎ Minkowski Calculation โ€” A=(1,2) to B=(4,6) at Different p
ฮ” Values
|4โˆ’1| = 3  |  |6โˆ’2| = 4 (these are the same for all p values below)
p = 1
(3ยน + 4ยน)^(1/1) = (3 + 4) = 7.00 โ†’ Manhattan
p = 2
(3ยฒ + 4ยฒ)^(1/2) = (9 + 16)^0.5 = 25^0.5 = 5.00 โ†’ Euclidean
p = 3
(3ยณ + 4ยณ)^(1/3) = (27 + 64)^(1/3) = 91^0.333 = 4.50
p = 4
(3โด + 4โด)^(1/4) = (81 + 256)^(1/4) = 337^0.25 = 4.28
p โ†’ โˆž
max(3, 4) = 4.00 โ†’ Chebyshev (the largest dimension wins)
โญ• MINKOWSKI UNIT BALLS โ€” HOW SHAPE CHANGES AS p INCREASES
Minkowski unit balls for different p values Five unit balls side by side showing how the shape changes as p increases from 1 to infinity. p=1 gives a diamond shape, p=2 gives a circle, p=4 gives a rounded square, p=10 approaches a square, and p=infinity gives a perfect square. p = 1 Manhattan p = 2 Euclidean p = 4 Smoother p = 10 Near square p โ†’ โˆž Chebyshev d = |x|+|y| โ‰ค 1 โˆš(xยฒ+yยฒ) โ‰ค 1 flatter corners almost square max(|x|,|y|) โ‰ค 1 Distance shrinks: 7.00 โ†’ 5.00 โ†’ 4.50 โ†’ 4.28 โ†’ 4.00 as p increases
๐Ÿ“‰
Key Insight โ€” As p Increases, Distance Decreases

Manhattan (p=1) always gives the largest distance. As p increases, the result shrinks toward the Chebyshev (p=โˆž) limit โ€” the maximum single-dimension gap. This is because higher p gives increasingly more weight to the largest difference dimension and ignores smaller ones. The formula converges to max(|aแตขโˆ’bแตข|) as pโ†’โˆž.

The Unit Ball โ€” Visualising What Each p "Looks Like"

๐Ÿ”ท
p = 1 (Manhattan)
Diamond / Rotated Square
All points exactly distance 1 from origin form a diamond shape โ€” pointing along the axes. LASSO regression uses this geometry to zero out coefficients.
โญ•
p = 2 (Euclidean)
Perfect Circle / Sphere
All equidistant points form a perfect circle in 2D and a sphere in 3D. Our natural geometric intuition โ€” isotropic, treats all directions equally.
๐Ÿ”ต
p = 3, 4, ...
Superellipse โ€” Rounder
As p grows above 2, the unit ball becomes more rounded โ€” bulging outward. Intermediate between the circle and the square. Rarely used directly in practice.
๐ŸŸฆ
p โ†’ โˆž (Chebyshev)
Square / Hypercube
All equidistant points form a square (or hypercube in higher dimensions). Distance is determined entirely by the single largest-gap dimension. Used in chess (king moves any direction by 1).

Section 05

Hamming Distance โ€” Comparing Sequences

The DNA Analyst and the Error Correction Engineer
A DNA analyst receives two genetic sequences and needs to know how similar they are โ€” not in physical space, but in character composition. She lines them up position by position and counts how many slots differ:

Sequence 1: GAATTCGACT
Sequence 2: GAATCCGAGT

Position 5: T vs C โ€” different. Position 9: C vs G โ€” different. Position 10: T vs T โ€” same. Two mismatches in ten positions. Hamming distance = 2.

Meanwhile, a network engineer is checking if a data packet was corrupted in transmission. He compares the sent binary code 10110101 with the received code 10010100. Three bits flipped โ€” Hamming distance = 3. Any Hamming distance greater than 1 means at least one error occurred.

Hamming distance is purely about how many positions differ between two sequences of the same length. No geometry. No coordinates. Just character-by-character comparison.
Hamming Distance (Raw Count)
d = #{i : aแตข โ‰  bแตข}
Count the number of positions where sequences a and b differ. Both sequences must be the same length. Returns an integer from 0 to n.
Normalised Hamming Distance
d = #{i : aแตข โ‰  bแตข} / n
Divide by sequence length n to get a value in [0, 1]. 0 = identical sequences. 1 = every position is different. Allows comparison across sequences of different lengths.
๐Ÿงฌ Hamming Distance โ€” String Example: "KAROLIN" vs "KATHRIN"
Pos 1
K vs K โ†’ Same
Pos 2
A vs A โ†’ Same
Pos 3
R vs T โ†’ Different โœ— (count: 1)
Pos 4
O vs H โ†’ Different โœ— (count: 2)
Pos 5
L vs R โ†’ Different โœ— (count: 3)
Pos 6
I vs I โ†’ Same
Pos 7
N vs N โ†’ Same
Result
Hamming distance = 3  |  Normalised = 3/7 = 0.429
๐Ÿงฌ HAMMING DISTANCE โ€” CHARACTER-BY-CHARACTER COMPARISON
Hamming distance character comparison diagram Three side-by-side Hamming distance examples: KAROLIN vs KATHRIN (string comparison with 3 mismatches), a binary transmission example with 2 bit-flip errors, and a DNA sequence comparison with 2 substitutions. Example 1 โ€” Strings Seq A Seq B K K A A R T O H L R I I N N โœ“ โœ“ โœ— โœ— โœ— โœ“ โœ“ Hamming = 3 3 positions differ Normalised: 3/7 = 0.43 Example 2 โ€” Binary transmission (8 bits) Sent Rcvd 1 1 0 0 1 0 1 1 0 0 1 0 0 0 1 1 โœ“ โœ“ โœ— โœ“ โœ“ โœ— โœ“ โœ“ 2 bits flipped! Hamming = 2 Transmission error detected Example 3 โ€” DNA sequences (10 bases) Seq 1 Seq 2 G G A A A A T T T C C C G A A G C A T T โœ“ โœ“ โœ“ โœ“ โœ— โœ“ โœ— โœ“ โœ“ โœ“ Hamming = 2 2 mutations found Normalised: 0.20 Match (same character) Mismatch (counts +1)
๐Ÿ’พ Hamming Distance โ€” Binary Example: Error Detection
Sent
1 0 1 1 1 0 1
Received
1 0 0 1 0 0 1
Compare
Bit 1: 1=1 โœ“   Bit 2: 0=0 โœ“   Bit 3: 1โ‰ 0 โœ—   Bit 4: 1=1 โœ“   Bit 5: 1โ‰ 0 โœ—   Bit 6: 0=0 โœ“   Bit 7: 1=1 โœ“
Result
Hamming distance = 2 โ€” two bits were flipped during transmission
โš ๏ธ
Hamming Requires Equal-Length Sequences

Hamming distance is undefined for sequences of different lengths โ€” there is no meaningful position-by-position comparison. For variable-length strings, use Levenshtein (edit) distance instead โ€” it allows insertions, deletions, and substitutions. For numeric feature vectors, Hamming applies only when features are binary or categorical โ€” never for continuous values.


Section 06

Side-by-Side โ€” All Four Metrics on the Same Data

Let us apply all four distance metrics to the same two numeric points A = (1, 2, 3) and B = (4, 6, 3) โ€” a 3-dimensional example.

Metric Calculation Result Norm Name
Euclidean โˆš((4โˆ’1)ยฒ + (6โˆ’2)ยฒ + (3โˆ’3)ยฒ) = โˆš(9+16+0) 5.00 L2 / โ„“โ‚‚
Manhattan |4โˆ’1| + |6โˆ’2| + |3โˆ’3| = 3 + 4 + 0 7.00 L1 / โ„“โ‚
Minkowski p=3 (3ยณ + 4ยณ + 0ยณ)^(1/3) = (27+64)^(1/3) 4.50 L3 / โ„“โ‚ƒ
Chebyshev (p=โˆž) max(|3|, |4|, |0|) 4.00 Lโˆž / โ„“โˆž
Hamming (binary) Positions where values differ: dim1 (1โ‰ 4), dim2 (2โ‰ 6), dim3 (3=3) 2 / 3 โ‰ˆ 0.67 Normalised count
๐Ÿ“Š
The Ordering Is Always: Chebyshev โ‰ค Euclidean โ‰ค Manhattan

For any two points in any dimension, the following inequality always holds: Chebyshev โ‰ค Euclidean โ‰ค Manhattan. This is a mathematical fact โ€” not a coincidence. Chebyshev focuses on the single largest gap (smallest result). Manhattan sums all gaps without any compression (largest result). Euclidean compresses via square root, sitting in between.


Section 07

Python Implementation

All Four Distances โ€” From Scratch

import numpy as np
from collections import Counter

# โ”€โ”€ Euclidean Distance โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def euclidean(a, b):
    """L2 norm โ€” straight-line distance between two vectors."""
    a, b = np.array(a, dtype=float), np.array(b, dtype=float)
    return np.sqrt(np.sum((a - b) ** 2))

# โ”€โ”€ Manhattan Distance โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def manhattan(a, b):
    """L1 norm โ€” sum of absolute differences (taxicab distance)."""
    a, b = np.array(a, dtype=float), np.array(b, dtype=float)
    return np.sum(np.abs(a - b))

# โ”€โ”€ Minkowski Distance โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def minkowski(a, b, p=2):
    """Lp norm โ€” generalised distance. p=1โ†’Manhattan, p=2โ†’Euclidean."""
    a, b = np.array(a, dtype=float), np.array(b, dtype=float)
    if p == float('inf'):
        return np.max(np.abs(a - b))
    return np.sum(np.abs(a - b) ** p) ** (1 / p)

# โ”€โ”€ Hamming Distance โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def hamming(a, b, normalise=False):
    """Count of differing positions between two equal-length sequences."""
    if len(a) != len(b):
        raise ValueError("Sequences must be the same length")
    count = sum(x != y for x, y in zip(a, b))
    return count / len(a) if normalise else count

A = [1, 2, 3]
B = [4, 6, 3]

print(f"Euclidean    : {euclidean(A, B):.4f}")
print(f"Manhattan    : {manhattan(A, B):.4f}")
print(f"Minkowski p=1: {minkowski(A, B, p=1):.4f}")
print(f"Minkowski p=2: {minkowski(A, B, p=2):.4f}")
print(f"Minkowski p=3: {minkowski(A, B, p=3):.4f}")
print(f"Chebyshev p=โˆž: {minkowski(A, B, p=float('inf')):.4f}")

s1, s2 = "KAROLIN", "KATHRIN"
print(f"\nHamming (raw):        {hamming(s1, s2)}")
print(f"Hamming (normalised): {hamming(s1, s2, normalise=True):.4f}")
OUTPUT
Euclidean : 5.0000 Manhattan : 7.0000 Minkowski p=1: 7.0000 Minkowski p=2: 5.0000 Minkowski p=3: 4.4972 Chebyshev p=โˆž: 4.0000 Hamming (raw): 3 Hamming (normalised): 0.4286

Using scipy โ€” Production-Grade Implementations

from scipy.spatial.distance import (
    euclidean, cityblock, minkowski, hamming,
    chebyshev, pdist, cdist, squareform
)
import numpy as np

A = np.array([1, 2, 3], dtype=float)
B = np.array([4, 6, 3], dtype=float)

print(f"Euclidean  : {euclidean(A, B):.4f}")
print(f"Manhattan  : {cityblock(A, B):.4f}")
print(f"Minkowski 3: {minkowski(A, B, p=3):.4f}")
print(f"Chebyshev  : {chebyshev(A, B):.4f}")

b1 = np.array([1, 0, 1, 1, 1, 0, 1])
b2 = np.array([1, 0, 0, 1, 0, 0, 1])
print(f"Hamming    : {hamming(b1, b2):.4f}")
print(f"Hamming raw: {hamming(b1, b2) * len(b1):.0f} positions differ")
OUTPUT
Euclidean : 5.0000 Manhattan : 7.0000 Minkowski 3: 4.4972 Chebyshev : 4.0000 Hamming : 0.2857 Hamming raw: 2 positions differ

Distance Matrix โ€” All Pairs at Once

import numpy as np
from scipy.spatial.distance import pdist, squareform
import pandas as pd

X = np.array([
    [25, 80,  22.0],
    [35, 150, 28.5],
    [45, 180, 33.0],
    [28, 90,  23.5],
    [50, 200, 35.5],
])

for metric in ['euclidean', 'cityblock', 'chebyshev']:
    dist_matrix = squareform(pdist(X, metric=metric))
    df = pd.DataFrame(
        dist_matrix.round(1),
        index=[f"P{i+1}" for i in range(5)],
        columns=[f"P{i+1}" for i in range(5)]
    )
    print(f"\n{metric.upper()} distance matrix:")
    print(df.to_string())
OUTPUT
EUCLIDEAN distance matrix: P1 P2 P3 P4 P5 P1 0.0 71.5 105.6 3.5 124.5 P2 71.5 0.0 35.2 61.1 53.1 P3 105.6 35.2 0.0 91.9 20.1 P4 3.5 61.1 91.9 0.0 110.8 P5 124.5 53.1 20.1 110.8 0.0

Comparing Metrics in KNN โ€” Effect on Classification

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

iris = load_iris()
X, y = iris.data, iris.target

metrics = {
    'Euclidean':    dict(metric='euclidean'),
    'Manhattan':    dict(metric='manhattan'),
    'Minkowski p3': dict(metric='minkowski', metric_params={'p': 3}),
    'Chebyshev':    dict(metric='chebyshev'),
}

print(f"{'Metric':15s} {'CV Accuracy':>12} {'Std Dev':>10}")
print("-" * 40)

for name, kwargs in metrics.items():
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('knn',    KNeighborsClassifier(n_neighbors=5, **kwargs))
    ])
    scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
    print(f"{name:15s}  {scores.mean():.4f}       ยฑ{scores.std():.4f}")
OUTPUT
Metric CV Accuracy Std Dev ---------------------------------------- Euclidean 0.9600 ยฑ0.0249 Manhattan 0.9667 ยฑ0.0211 Minkowski p3 0.9600 ยฑ0.0249 Chebyshev 0.9600 ยฑ0.0266

Section 08

Choosing the Right Distance Metric

๐Ÿ“
Use Euclidean Whenโ€ฆ
Features are continuous and roughly normally distributed. Dimensions are low (under 20). Data is already scaled. Standard choice for KNN, K-Means, and PCA.
KNN ยท K-Means ยท PCA ยท image features
๐Ÿ™๏ธ
Use Manhattan Whenโ€ฆ
Data is high-dimensional and sparse. Outliers are present and you want less sensitivity to them. You are regularising a model (LASSO uses L1 penalty for sparsity).
LASSO ยท sparse data ยท high-d KNN ยท images
๐ŸŽ›๏ธ
Use Minkowski Whenโ€ฆ
You want to tune the distance geometry as a hyperparameter. Treat p as a value to optimise via cross-validation โ€” include it in GridSearchCV alongside K.
hyperparameter search ยท KNN tuning
๐Ÿงฌ
Use Hamming Whenโ€ฆ
Features are binary (0/1) or categorical. Comparing DNA sequences, passwords, binary codes, or one-hot encoded vectors. Error detection in data transmission.
DNA ยท binary ยท categorical ยท error correction
โ™Ÿ๏ธ
Use Chebyshev Whenโ€ฆ
The worst-case difference across any single feature matters most. Warehouse logistics (crane movements), chess (king moves), scheduling bottleneck problems.
logistics ยท gaming ยท scheduling
๐Ÿ“
Use Cosine Whenโ€ฆ
Direction matters more than magnitude. Two documents with the same word proportions but different lengths should be considered similar. Standard for TF-IDF, word embeddings.
NLP ยท text ยท word embeddings ยท recommendations

Section 09

Complete Reference Table โ€” All Metrics

Metric Formula Norm Sensitive to Scale Sensitive to Outliers Best Feature Type scipy Function
Euclidean โˆšฮฃ(aแตขโˆ’bแตข)ยฒ L2 Yes High Continuous, normal euclidean()
Manhattan ฮฃ|aแตขโˆ’bแตข| L1 Yes Lower Continuous, counts cityblock()
Minkowski (ฮฃ|aแตขโˆ’bแตข|แต–)^(1/p) Lp Yes Depends on p Continuous (tune p) minkowski()
Chebyshev max|aแตขโˆ’bแตข| Lโˆž Yes Extreme Worst-case scenarios chebyshev()
Hamming #{i: aแตขโ‰ bแตข} / n โ€” No No Binary, categorical, strings hamming()
Cosine 1 โˆ’ (aยทb / โ€–aโ€–โ€–bโ€–) โ€” No No Text vectors, embeddings cosine()

Section 10

Real-World Applications

Euclidean Distance
๐Ÿ“
  • KNN classification (default metric)
  • K-Means cluster centroid updates
  • PCA and dimensionality reduction
  • GPS route distance calculation
  • Face recognition feature matching
Manhattan Distance
๐Ÿ™๏ธ
  • LASSO regularisation (L1 penalty)
  • Robust KNN on noisy datasets
  • Image reconstruction (L1 loss)
  • Urban planning and logistics
  • Sparse signal recovery
Hamming Distance
๐Ÿงฌ
  • DNA and protein sequence comparison
  • Network error detection and correction
  • Spell checking and string matching
  • Cryptographic hash comparison
  • Binary feature similarity (KNN)

Section 11

Golden Rules

๐Ÿ“ Distance Metrics โ€” Non-Negotiable Rules
1
Always scale features before computing Euclidean or Manhattan distance. Both are sensitive to feature magnitude. A salary column in the thousands will completely dominate an age column in the tens. Apply StandardScaler or MinMaxScaler inside a Pipeline before any distance-based algorithm.
2
Use Manhattan over Euclidean when outliers are present. Euclidean squares differences โ€” one extreme outlier inflates distances quadratically. Manhattan uses absolute differences โ€” linear response to outliers. On noisy real-world data, Manhattan distance often produces more stable KNN results.
3
Treat p in Minkowski as a tunable hyperparameter. Do not default to p=2 (Euclidean) blindly. Include p in your GridSearchCV: try p โˆˆ {1, 2, 3, 4} alongside K and weights. On some datasets, p=1.5 or p=3 outperforms both fixed alternatives.
4
Use Hamming only for equal-length binary or categorical sequences. Hamming on continuous features is meaningless โ€” every value differs and you lose all magnitude information. For variable-length strings use Levenshtein distance. For continuous features use Euclidean or Manhattan.
5
In high dimensions (>20 features), prefer Manhattan over Euclidean. As dimensions increase, Euclidean distances concentrate โ€” all points become nearly equidistant. Manhattan distance degrades more slowly and retains more discriminative power in high-dimensional spaces.
6
For text data, use cosine distance โ€” not Euclidean or Manhattan. Two documents with identical word proportions but different lengths will appear far apart in Euclidean space despite being semantically identical. Cosine distance measures the angle between vectors โ€” ignoring magnitude โ€” which is exactly what text similarity requires.
7
Always verify your metric is valid (satisfies metric axioms). A true metric must satisfy: non-negativity, identity, symmetry, and triangle inequality. Minkowski with p < 1 violates the triangle inequality โ€” it is not a true metric and will cause unpredictable behaviour in algorithms that assume metric properties.
8
Use scipy.spatial.distance for production โ€” not manual loops. scipy's distance functions are implemented in C, vectorised, and numerically stable. The cdist and pdist functions compute entire distance matrices efficiently. For million-scale similarity search, use FAISS or ScaNN.