Euclidean, Manhattan, Minkowski & Hamming

Section 01

The Story That Explains Distance Metrics

📖 The Four Explorers

Four Ways to Measure the Same Journey

Four explorers stand at the same starting point on a map — City A — and all want to reach City B. Each has a completely different philosophy about how distance should be measured.

Euclid pulls out a ruler and draws a straight line between the two cities. "This is the true distance," he says — the shortest path a bird could fly, regardless of roads or buildings. 5 kilometres.

Manny (from Manhattan) shakes his head. "That line goes through buildings and rivers. I can only walk on streets — horizontally and vertically. I have to go 3 blocks east and 4 blocks north." 7 kilometres.

Minka pulls out a formula with a dial on it. "The true distance depends on the terrain," she says. "In open fields I travel like Euclid. In strict grid cities I travel like Manny. I tune my dial — the parameter p — to match the environment." She represents the general case.

Hamm is not interested in geography at all. He is comparing two genetic sequences, two passwords, two binary codes. "My distance is not about space," he says. "It's about how many positions differ between two strings of equal length."

Same two points. Four valid measures. The right one depends entirely on what your data represents and what you need distance to mean.

📐

Why Distance Metrics Matter in Machine Learning

Distance is the backbone of algorithms like KNN, K-Means clustering, DBSCAN, hierarchical clustering, and similarity search. Every one of them asks the same fundamental question: "How far apart are these two points?" Choose the wrong metric and the algorithm finds the wrong neighbours, the wrong clusters, the wrong patterns — regardless of how well tuned everything else is.

Section 02

Euclidean Distance — The Straight Line

📖 Analogy

The Crow's Flight

A crow is flying from nest A to nest B. It does not follow roads, rivers, or city blocks. It flies in a perfectly straight line — directly from one point to the other. The Euclidean distance is this straight-line length: the shortest possible path between two points in space, ignoring all obstacles. It is the measure we intuitively think of when we say "how far apart are these two things?"

Euclidean Distance — 2D

d = √((x₂−x₁)² + (y₂−y₁)²)

The Pythagorean theorem — the hypotenuse of the right triangle formed by the horizontal and vertical differences between two points

Euclidean Distance — n-Dimensions

d = √Σᵢ (aᵢ − bᵢ)²

Sum the squared difference on every dimension, then take the square root. Also called the L2 norm or ℓ₂ distance.

Let us work through a complete example. Point A = (1, 2) and Point B = (4, 6).

📐 Euclidean Distance — Visual Diagram and Calculation

Plot

A is at coordinates (1, 2). B is at coordinates (4, 6). On a grid, A is 3 units to the left and 4 units below B. These form the two legs of a right triangle.

Δx

Horizontal difference: x₂ − x₁ = 4 − 1 = 3 units

Δy

Vertical difference: y₂ − y₁ = 6 − 2 = 4 units

Square

Square each difference: 3² + 4² = 9 + 16 = 25

√Root

Take the square root: √25 = 5.00 — the length of the straight line from A to B

📐 EUCLIDEAN DISTANCE — STRAIGHT LINE ON A COORDINATE GRID

⚠️

Euclidean Distance Is Extremely Sensitive to Scale

If one feature has values in the range [0, 10,000] (e.g. salary) and another in [0, 1] (e.g. a probability score), the salary feature will completely dominate the distance — the probability feature contributes almost nothing. Always apply StandardScaler or MinMaxScaler before any Euclidean-based algorithm (KNN, K-Means, PCA).

Properties of Euclidean Distance

Property	Value	What It Means
Non-negativity	d(A,B) ≥ 0	Distance is always zero or positive — never negative
Identity	d(A,A) = 0	Distance from a point to itself is exactly zero
Symmetry	d(A,B) = d(B,A)	Distance from A to B equals distance from B to A
Triangle Inequality	d(A,C) ≤ d(A,B) + d(B,C)	The direct path is never longer than any detour through a third point
Sensitivity to outliers	High	Squaring differences amplifies the effect of extreme values

Section 03

Manhattan Distance — The City Grid

📖 Analogy

The Manhattan Taxi Driver

You are in a New York City taxi. You want to go from 1st Avenue & 42nd Street to 4th Avenue & 46th Street. The crow flying overhead travels in a straight diagonal line — 5 blocks as the crow flies. But your taxi cannot drive diagonally through buildings. It must go 3 blocks east along 42nd Street, then 4 blocks north up 4th Avenue. Total: 7 blocks.

That is Manhattan distance — the sum of the absolute differences along each axis, as if you are always constrained to travel along grid lines. Also called taxicab distance, city block distance, or the L1 norm.

Manhattan Distance — 2D

d = |x₂ − x₁| + |y₂ − y₁|

Sum of absolute differences on each axis. No squaring — outliers have less amplified effect than in Euclidean distance.

Manhattan Distance — n-Dimensions

d = Σᵢ |aᵢ − bᵢ|

Also called the L1 norm or ℓ₁ distance. Used in LASSO regression, robust statistics, and high-dimensional data.

🏙️ Manhattan Distance — Visual Diagram and Calculation

Grid

Same points: A = (1, 2) and B = (4, 6). Imagine a city grid — you cannot cut diagonally. Every move is either horizontal or vertical.

East →

Move horizontally from x=1 to x=4: |4 − 1| = 3 blocks east

North ↑

Move vertically from y=2 to y=6: |6 − 2| = 4 blocks north

Total

3 + 4 = 7.00 blocks — always greater than or equal to Euclidean (5.00)

Note

There are infinite paths of length 7 between A and B on a grid (right then up, up then right, alternating). All have the same Manhattan distance.

🏙️ MANHATTAN DISTANCE — CITY GRID WITH HORIZONTAL + VERTICAL STEPS

🛡️

Manhattan Distance Is More Robust to Outliers

Because Manhattan uses absolute differences (not squared), extreme outliers have a linear rather than quadratic effect. This makes it more robust in noisy, real-world datasets. In LASSO regression the L1 penalty produces sparse solutions (many coefficients driven exactly to zero) — a property that Euclidean (L2) penalty cannot achieve. This is a critical practical advantage.

Manhattan vs Euclidean — Key Differences

Aspect	Euclidean (L2)	Manhattan (L1)
Formula	√Σ(aᵢ−bᵢ)²	Σ\|aᵢ−bᵢ\|
Path type	Straight diagonal line	Axis-aligned steps only
Result (A→B example)	5.00	7.00 (always ≥ Euclidean)
Outlier sensitivity	High — squares differences	Lower — linear differences
High-dimensional data	Degrades faster	More stable
Best for	Continuous, low-d, normally distributed	High-d, noisy, sparse, grid-like data
Used in	KNN, K-Means, PCA, SVM	LASSO, robust KNN, image processing, NLP

Section 04

Minkowski Distance — The Unifying Formula

Minkowski distance is not a new metric — it is a parameterised family of metrics that includes both Euclidean and Manhattan as special cases. By tuning a single parameter p, you slide between different geometries of distance.

Minkowski Distance

d = (Σᵢ |aᵢ − bᵢ|ᵖ)^(1/p)

The generalised Lp norm. Parameter p controls the "shape" of the distance measure. Any valid p ≥ 1 produces a true metric satisfying all mathematical distance properties.

Chebyshev (Special Case p=∞)

d = max |aᵢ − bᵢ|

As p → ∞, Minkowski converges to the maximum absolute difference across any single dimension — the Chebyshev or L∞ distance. Used in chess (king moves) and scheduling.

p Value	Name	Formula	Result (A→B)	Geometry
p = 1	Manhattan / L1	Σ\|aᵢ−bᵢ\|	7.00	Diamond-shaped unit ball
p = 2	Euclidean / L2	√Σ(aᵢ−bᵢ)²	5.00	Circular unit ball
p = 3	Minkowski L3	(Σ\|aᵢ−bᵢ\|³)^(1/3)	4.50	Rounder than circle
p = 4	Minkowski L4	(Σ\|aᵢ−bᵢ\|⁴)^(1/4)	4.28	Very round, approaching square
p → ∞	Chebyshev / L∞	max\|aᵢ−bᵢ\|	4.00	Square unit ball — max dimension only

🧮 Minkowski Calculation — A=(1,2) to B=(4,6) at Different p

Δ Values

|4−1| = 3 | |6−2| = 4 (these are the same for all p values below)

p = 1

(3¹ + 4¹)^(1/1) = (3 + 4) = 7.00 → Manhattan

p = 2

(3² + 4²)^(1/2) = (9 + 16)^0.5 = 25^0.5 = 5.00 → Euclidean

p = 3

(3³ + 4³)^(1/3) = (27 + 64)^(1/3) = 91^0.333 = 4.50

p = 4

(3⁴ + 4⁴)^(1/4) = (81 + 256)^(1/4) = 337^0.25 = 4.28

p → ∞

max(3, 4) = 4.00 → Chebyshev (the largest dimension wins)

⭕ MINKOWSKI UNIT BALLS — HOW SHAPE CHANGES AS p INCREASES

📉

Key Insight — As p Increases, Distance Decreases

Manhattan (p=1) always gives the largest distance. As p increases, the result shrinks toward the Chebyshev (p=∞) limit — the maximum single-dimension gap. This is because higher p gives increasingly more weight to the largest difference dimension and ignores smaller ones. The formula converges to max(|aᵢ−bᵢ|) as p→∞.

The Unit Ball — Visualising What Each p "Looks Like"

🔷

p = 1 (Manhattan)

Diamond / Rotated Square

All points exactly distance 1 from origin form a diamond shape — pointing along the axes. LASSO regression uses this geometry to zero out coefficients.

⭕

p = 2 (Euclidean)

Perfect Circle / Sphere

All equidistant points form a perfect circle in 2D and a sphere in 3D. Our natural geometric intuition — isotropic, treats all directions equally.

🔵

p = 3, 4, ...

Superellipse — Rounder

As p grows above 2, the unit ball becomes more rounded — bulging outward. Intermediate between the circle and the square. Rarely used directly in practice.

🟦

p → ∞ (Chebyshev)

Square / Hypercube

All equidistant points form a square (or hypercube in higher dimensions). Distance is determined entirely by the single largest-gap dimension. Used in chess (king moves any direction by 1).

Section 05

Hamming Distance — Comparing Sequences

📖 Analogy

The DNA Analyst and the Error Correction Engineer

A DNA analyst receives two genetic sequences and needs to know how similar they are — not in physical space, but in character composition. She lines them up position by position and counts how many slots differ:

Sequence 1: GAATTCGACT
Sequence 2: GAATCCGAGT

Position 5: T vs C — different. Position 9: C vs G — different. Position 10: T vs T — same. Two mismatches in ten positions. Hamming distance = 2.

Meanwhile, a network engineer is checking if a data packet was corrupted in transmission. He compares the sent binary code 10110101 with the received code 10010100. Three bits flipped — Hamming distance = 3. Any Hamming distance greater than 1 means at least one error occurred.

Hamming distance is purely about how many positions differ between two sequences of the same length. No geometry. No coordinates. Just character-by-character comparison.

Hamming Distance (Raw Count)

d = #{i : aᵢ ≠ bᵢ}

Count the number of positions where sequences a and b differ. Both sequences must be the same length. Returns an integer from 0 to n.

Normalised Hamming Distance

d = #{i : aᵢ ≠ bᵢ} / n

Divide by sequence length n to get a value in [0, 1]. 0 = identical sequences. 1 = every position is different. Allows comparison across sequences of different lengths.

🧬 Hamming Distance — String Example: "KAROLIN" vs "KATHRIN"

Pos 1

K vs K → Same

Pos 2

A vs A → Same

Pos 3

R vs T → Different ✗ (count: 1)

Pos 4

O vs H → Different ✗ (count: 2)

Pos 5

L vs R → Different ✗ (count: 3)

Pos 6

I vs I → Same

Pos 7

N vs N → Same

Result

Hamming distance = 3 | Normalised = 3/7 = 0.429

🧬 HAMMING DISTANCE — CHARACTER-BY-CHARACTER COMPARISON

💾 Hamming Distance — Binary Example: Error Detection

Sent

1 0 1 1 1 0 1

Received

1 0 0 1 0 0 1

Compare

Bit 1: 1=1 ✓ Bit 2: 0=0 ✓ Bit 3: 1≠0 ✗ Bit 4: 1=1 ✓ Bit 5: 1≠0 ✗ Bit 6: 0=0 ✓ Bit 7: 1=1 ✓

Result

Hamming distance = 2 — two bits were flipped during transmission

⚠️

Hamming Requires Equal-Length Sequences

Hamming distance is undefined for sequences of different lengths — there is no meaningful position-by-position comparison. For variable-length strings, use Levenshtein (edit) distance instead — it allows insertions, deletions, and substitutions. For numeric feature vectors, Hamming applies only when features are binary or categorical — never for continuous values.

Section 06

Side-by-Side — All Four Metrics on the Same Data

Let us apply all four distance metrics to the same two numeric points A = (1, 2, 3) and B = (4, 6, 3) — a 3-dimensional example.

Metric	Calculation	Result	Norm Name
Euclidean	√((4−1)² + (6−2)² + (3−3)²) = √(9+16+0)	5.00	L2 / ℓ₂
Manhattan	\|4−1\| + \|6−2\| + \|3−3\| = 3 + 4 + 0	7.00	L1 / ℓ₁
Minkowski p=3	(3³ + 4³ + 0³)^(1/3) = (27+64)^(1/3)	4.50	L3 / ℓ₃
Chebyshev (p=∞)	max(\|3\|, \|4\|, \|0\|)	4.00	L∞ / ℓ∞
Hamming (binary)	Positions where values differ: dim1 (1≠4), dim2 (2≠6), dim3 (3=3)	2 / 3 ≈ 0.67	Normalised count

📊

The Ordering Is Always: Chebyshev ≤ Euclidean ≤ Manhattan

For any two points in any dimension, the following inequality always holds: Chebyshev ≤ Euclidean ≤ Manhattan. This is a mathematical fact — not a coincidence. Chebyshev focuses on the single largest gap (smallest result). Manhattan sums all gaps without any compression (largest result). Euclidean compresses via square root, sitting in between.

Section 07

Python Implementation

All Four Distances — From Scratch

import numpy as np
from collections import Counter

# ── Euclidean Distance ────────────────────────────────────────
def euclidean(a, b):
    """L2 norm — straight-line distance between two vectors."""
    a, b = np.array(a, dtype=float), np.array(b, dtype=float)
    return np.sqrt(np.sum((a - b) ** 2))

# ── Manhattan Distance ────────────────────────────────────────
def manhattan(a, b):
    """L1 norm — sum of absolute differences (taxicab distance)."""
    a, b = np.array(a, dtype=float), np.array(b, dtype=float)
    return np.sum(np.abs(a - b))

# ── Minkowski Distance ────────────────────────────────────────
def minkowski(a, b, p=2):
    """Lp norm — generalised distance. p=1→Manhattan, p=2→Euclidean."""
    a, b = np.array(a, dtype=float), np.array(b, dtype=float)
    if p == float('inf'):
        return np.max(np.abs(a - b))
    return np.sum(np.abs(a - b) ** p) ** (1 / p)

# ── Hamming Distance ──────────────────────────────────────────
def hamming(a, b, normalise=False):
    """Count of differing positions between two equal-length sequences."""
    if len(a) != len(b):
        raise ValueError("Sequences must be the same length")
    count = sum(x != y for x, y in zip(a, b))
    return count / len(a) if normalise else count

A = [1, 2, 3]
B = [4, 6, 3]

print(f"Euclidean    : {euclidean(A, B):.4f}")
print(f"Manhattan    : {manhattan(A, B):.4f}")
print(f"Minkowski p=1: {minkowski(A, B, p=1):.4f}")
print(f"Minkowski p=2: {minkowski(A, B, p=2):.4f}")
print(f"Minkowski p=3: {minkowski(A, B, p=3):.4f}")
print(f"Chebyshev p=∞: {minkowski(A, B, p=float('inf')):.4f}")

s1, s2 = "KAROLIN", "KATHRIN"
print(f"\nHamming (raw):        {hamming(s1, s2)}")
print(f"Hamming (normalised): {hamming(s1, s2, normalise=True):.4f}")

OUTPUT

Euclidean : 5.0000 Manhattan : 7.0000 Minkowski p=1: 7.0000 Minkowski p=2: 5.0000 Minkowski p=3: 4.4972 Chebyshev p=∞: 4.0000 Hamming (raw): 3 Hamming (normalised): 0.4286

Using scipy — Production-Grade Implementations

from scipy.spatial.distance import (
    euclidean, cityblock, minkowski, hamming,
    chebyshev, pdist, cdist, squareform
)
import numpy as np

A = np.array([1, 2, 3], dtype=float)
B = np.array([4, 6, 3], dtype=float)

print(f"Euclidean  : {euclidean(A, B):.4f}")
print(f"Manhattan  : {cityblock(A, B):.4f}")
print(f"Minkowski 3: {minkowski(A, B, p=3):.4f}")
print(f"Chebyshev  : {chebyshev(A, B):.4f}")

b1 = np.array([1, 0, 1, 1, 1, 0, 1])
b2 = np.array([1, 0, 0, 1, 0, 0, 1])
print(f"Hamming    : {hamming(b1, b2):.4f}")
print(f"Hamming raw: {hamming(b1, b2) * len(b1):.0f} positions differ")

OUTPUT

Euclidean : 5.0000 Manhattan : 7.0000 Minkowski 3: 4.4972 Chebyshev : 4.0000 Hamming : 0.2857 Hamming raw: 2 positions differ

Distance Matrix — All Pairs at Once

import numpy as np
from scipy.spatial.distance import pdist, squareform
import pandas as pd

X = np.array([
    [25, 80,  22.0],
    [35, 150, 28.5],
    [45, 180, 33.0],
    [28, 90,  23.5],
    [50, 200, 35.5],
])

for metric in ['euclidean', 'cityblock', 'chebyshev']:
    dist_matrix = squareform(pdist(X, metric=metric))
    df = pd.DataFrame(
        dist_matrix.round(1),
        index=[f"P{i+1}" for i in range(5)],
        columns=[f"P{i+1}" for i in range(5)]
    )
    print(f"\n{metric.upper()} distance matrix:")
    print(df.to_string())

OUTPUT

EUCLIDEAN distance matrix: P1 P2 P3 P4 P5 P1 0.0 71.5 105.6 3.5 124.5 P2 71.5 0.0 35.2 61.1 53.1 P3 105.6 35.2 0.0 91.9 20.1 P4 3.5 61.1 91.9 0.0 110.8 P5 124.5 53.1 20.1 110.8 0.0

Comparing Metrics in KNN — Effect on Classification

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

iris = load_iris()
X, y = iris.data, iris.target

metrics = {
    'Euclidean':    dict(metric='euclidean'),
    'Manhattan':    dict(metric='manhattan'),
    'Minkowski p3': dict(metric='minkowski', metric_params={'p': 3}),
    'Chebyshev':    dict(metric='chebyshev'),
}

print(f"{'Metric':15s} {'CV Accuracy':>12} {'Std Dev':>10}")
print("-" * 40)

for name, kwargs in metrics.items():
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('knn',    KNeighborsClassifier(n_neighbors=5, **kwargs))
    ])
    scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
    print(f"{name:15s}  {scores.mean():.4f}       ±{scores.std():.4f}")

OUTPUT

Metric CV Accuracy Std Dev ---------------------------------------- Euclidean 0.9600 ±0.0249 Manhattan 0.9667 ±0.0211 Minkowski p3 0.9600 ±0.0249 Chebyshev 0.9600 ±0.0266

Section 08

Choosing the Right Distance Metric

📏

Use Euclidean When…

Features are continuous and roughly normally distributed. Dimensions are low (under 20). Data is already scaled. Standard choice for KNN, K-Means, and PCA.

KNN · K-Means · PCA · image features

🏙️

Use Manhattan When…

Data is high-dimensional and sparse. Outliers are present and you want less sensitivity to them. You are regularising a model (LASSO uses L1 penalty for sparsity).

LASSO · sparse data · high-d KNN · images

🎛️

Use Minkowski When…

You want to tune the distance geometry as a hyperparameter. Treat p as a value to optimise via cross-validation — include it in GridSearchCV alongside K.

hyperparameter search · KNN tuning

🧬

Use Hamming When…

Features are binary (0/1) or categorical. Comparing DNA sequences, passwords, binary codes, or one-hot encoded vectors. Error detection in data transmission.

DNA · binary · categorical · error correction

♟️

Use Chebyshev When…

The worst-case difference across any single feature matters most. Warehouse logistics (crane movements), chess (king moves), scheduling bottleneck problems.

logistics · gaming · scheduling

📐

Use Cosine When…

Direction matters more than magnitude. Two documents with the same word proportions but different lengths should be considered similar. Standard for TF-IDF, word embeddings.

NLP · text · word embeddings · recommendations

Section 09

Complete Reference Table — All Metrics

Metric	Formula	Norm	Sensitive to Scale	Sensitive to Outliers	Best Feature Type	scipy Function
Euclidean	√Σ(aᵢ−bᵢ)²	L2	Yes	High	Continuous, normal	`euclidean()`
Manhattan	Σ\|aᵢ−bᵢ\|	L1	Yes	Lower	Continuous, counts	`cityblock()`
Minkowski	(Σ\|aᵢ−bᵢ\|ᵖ)^(1/p)	Lp	Yes	Depends on p	Continuous (tune p)	`minkowski()`
Chebyshev	max\|aᵢ−bᵢ\|	L∞	Yes	Extreme	Worst-case scenarios	`chebyshev()`
Hamming	#{i: aᵢ≠bᵢ} / n	—	No	No	Binary, categorical, strings	`hamming()`
Cosine	1 − (a·b / ‖a‖‖b‖)	—	No	No	Text vectors, embeddings	`cosine()`

Section 10

Real-World Applications

Euclidean Distance

📏

KNN classification (default metric)
K-Means cluster centroid updates
PCA and dimensionality reduction
GPS route distance calculation
Face recognition feature matching

Manhattan Distance

🏙️

LASSO regularisation (L1 penalty)
Robust KNN on noisy datasets
Image reconstruction (L1 loss)
Urban planning and logistics
Sparse signal recovery

Hamming Distance

🧬

DNA and protein sequence comparison
Network error detection and correction
Spell checking and string matching
Cryptographic hash comparison
Binary feature similarity (KNN)

Section 11

Golden Rules

📐 Distance Metrics — Non-Negotiable Rules

Always scale features before computing Euclidean or Manhattan distance. Both are sensitive to feature magnitude. A salary column in the thousands will completely dominate an age column in the tens. Apply StandardScaler or MinMaxScaler inside a Pipeline before any distance-based algorithm.

Use Manhattan over Euclidean when outliers are present. Euclidean squares differences — one extreme outlier inflates distances quadratically. Manhattan uses absolute differences — linear response to outliers. On noisy real-world data, Manhattan distance often produces more stable KNN results.

Treat p in Minkowski as a tunable hyperparameter. Do not default to p=2 (Euclidean) blindly. Include p in your GridSearchCV: try p ∈ {1, 2, 3, 4} alongside K and weights. On some datasets, p=1.5 or p=3 outperforms both fixed alternatives.

Use Hamming only for equal-length binary or categorical sequences. Hamming on continuous features is meaningless — every value differs and you lose all magnitude information. For variable-length strings use Levenshtein distance. For continuous features use Euclidean or Manhattan.

In high dimensions (>20 features), prefer Manhattan over Euclidean. As dimensions increase, Euclidean distances concentrate — all points become nearly equidistant. Manhattan distance degrades more slowly and retains more discriminative power in high-dimensional spaces.

For text data, use cosine distance — not Euclidean or Manhattan. Two documents with identical word proportions but different lengths will appear far apart in Euclidean space despite being semantically identical. Cosine distance measures the angle between vectors — ignoring magnitude — which is exactly what text similarity requires.

Always verify your metric is valid (satisfies metric axioms). A true metric must satisfy: non-negativity, identity, symmetry, and triangle inequality. Minkowski with p < 1 violates the triangle inequality — it is not a true metric and will cause unpredictable behaviour in algorithms that assume metric properties.

Use scipy.spatial.distance for production — not manual loops. scipy's distance functions are implemented in C, vectorised, and numerically stable. The cdist and pdist functions compute entire distance matrices efficiently. For million-scale similarity search, use FAISS or ScaNN.