The Story That Explains Distance Metrics
Euclid pulls out a ruler and draws a straight line between the two cities. "This is the true distance," he says โ the shortest path a bird could fly, regardless of roads or buildings. 5 kilometres.
Manny (from Manhattan) shakes his head. "That line goes through buildings and rivers. I can only walk on streets โ horizontally and vertically. I have to go 3 blocks east and 4 blocks north." 7 kilometres.
Minka pulls out a formula with a dial on it. "The true distance depends on the terrain," she says. "In open fields I travel like Euclid. In strict grid cities I travel like Manny. I tune my dial โ the parameter p โ to match the environment." She represents the general case.
Hamm is not interested in geography at all. He is comparing two genetic sequences, two passwords, two binary codes. "My distance is not about space," he says. "It's about how many positions differ between two strings of equal length."
Same two points. Four valid measures. The right one depends entirely on what your data represents and what you need distance to mean.
Distance is the backbone of algorithms like KNN, K-Means clustering, DBSCAN, hierarchical clustering, and similarity search. Every one of them asks the same fundamental question: "How far apart are these two points?" Choose the wrong metric and the algorithm finds the wrong neighbours, the wrong clusters, the wrong patterns โ regardless of how well tuned everything else is.
Euclidean Distance โ The Straight Line
Let us work through a complete example. Point A = (1, 2) and Point B = (4, 6).
If one feature has values in the range [0, 10,000] (e.g. salary) and another in [0, 1] (e.g. a probability score), the salary feature will completely dominate the distance โ the probability feature contributes almost nothing. Always apply StandardScaler or MinMaxScaler before any Euclidean-based algorithm (KNN, K-Means, PCA).
Properties of Euclidean Distance
| Property | Value | What It Means |
|---|---|---|
| Non-negativity | d(A,B) โฅ 0 | Distance is always zero or positive โ never negative |
| Identity | d(A,A) = 0 | Distance from a point to itself is exactly zero |
| Symmetry | d(A,B) = d(B,A) | Distance from A to B equals distance from B to A |
| Triangle Inequality | d(A,C) โค d(A,B) + d(B,C) | The direct path is never longer than any detour through a third point |
| Sensitivity to outliers | High | Squaring differences amplifies the effect of extreme values |
Manhattan Distance โ The City Grid
That is Manhattan distance โ the sum of the absolute differences along each axis, as if you are always constrained to travel along grid lines. Also called taxicab distance, city block distance, or the L1 norm.
Because Manhattan uses absolute differences (not squared), extreme outliers have a linear rather than quadratic effect. This makes it more robust in noisy, real-world datasets. In LASSO regression the L1 penalty produces sparse solutions (many coefficients driven exactly to zero) โ a property that Euclidean (L2) penalty cannot achieve. This is a critical practical advantage.
Manhattan vs Euclidean โ Key Differences
| Aspect | Euclidean (L2) | Manhattan (L1) |
|---|---|---|
| Formula | โฮฃ(aแตขโbแตข)ยฒ | ฮฃ|aแตขโbแตข| |
| Path type | Straight diagonal line | Axis-aligned steps only |
| Result (AโB example) | 5.00 | 7.00 (always โฅ Euclidean) |
| Outlier sensitivity | High โ squares differences | Lower โ linear differences |
| High-dimensional data | Degrades faster | More stable |
| Best for | Continuous, low-d, normally distributed | High-d, noisy, sparse, grid-like data |
| Used in | KNN, K-Means, PCA, SVM | LASSO, robust KNN, image processing, NLP |
Minkowski Distance โ The Unifying Formula
Minkowski distance is not a new metric โ it is a parameterised family of metrics that includes both Euclidean and Manhattan as special cases. By tuning a single parameter p, you slide between different geometries of distance.
| p Value | Name | Formula | Result (AโB) | Geometry |
|---|---|---|---|---|
| p = 1 | Manhattan / L1 | ฮฃ|aแตขโbแตข| | 7.00 | Diamond-shaped unit ball |
| p = 2 | Euclidean / L2 | โฮฃ(aแตขโbแตข)ยฒ | 5.00 | Circular unit ball |
| p = 3 | Minkowski L3 | (ฮฃ|aแตขโbแตข|ยณ)^(1/3) | 4.50 | Rounder than circle |
| p = 4 | Minkowski L4 | (ฮฃ|aแตขโbแตข|โด)^(1/4) | 4.28 | Very round, approaching square |
| p โ โ | Chebyshev / Lโ | max|aแตขโbแตข| | 4.00 | Square unit ball โ max dimension only |
Manhattan (p=1) always gives the largest distance.
As p increases, the result shrinks toward the
Chebyshev (p=โ) limit โ the maximum single-dimension gap.
This is because higher p gives increasingly more weight to the
largest difference dimension and ignores smaller ones.
The formula converges to max(|aแตขโbแตข|) as pโโ.
The Unit Ball โ Visualising What Each p "Looks Like"
Hamming Distance โ Comparing Sequences
Sequence 1:
GAATTCGACTSequence 2:
GAATCCGAGTPosition 5: T vs C โ different. Position 9: C vs G โ different. Position 10: T vs T โ same. Two mismatches in ten positions. Hamming distance = 2.
Meanwhile, a network engineer is checking if a data packet was corrupted in transmission. He compares the sent binary code
10110101 with the received code 10010100.
Three bits flipped โ Hamming distance = 3.
Any Hamming distance greater than 1 means at least one error occurred.Hamming distance is purely about how many positions differ between two sequences of the same length. No geometry. No coordinates. Just character-by-character comparison.
Hamming distance is undefined for sequences of different lengths โ there is no meaningful position-by-position comparison. For variable-length strings, use Levenshtein (edit) distance instead โ it allows insertions, deletions, and substitutions. For numeric feature vectors, Hamming applies only when features are binary or categorical โ never for continuous values.
Side-by-Side โ All Four Metrics on the Same Data
Let us apply all four distance metrics to the same two numeric points A = (1, 2, 3) and B = (4, 6, 3) โ a 3-dimensional example.
| Metric | Calculation | Result | Norm Name |
|---|---|---|---|
| Euclidean | โ((4โ1)ยฒ + (6โ2)ยฒ + (3โ3)ยฒ) = โ(9+16+0) | 5.00 | L2 / โโ |
| Manhattan | |4โ1| + |6โ2| + |3โ3| = 3 + 4 + 0 | 7.00 | L1 / โโ |
| Minkowski p=3 | (3ยณ + 4ยณ + 0ยณ)^(1/3) = (27+64)^(1/3) | 4.50 | L3 / โโ |
| Chebyshev (p=โ) | max(|3|, |4|, |0|) | 4.00 | Lโ / โโ |
| Hamming (binary) | Positions where values differ: dim1 (1โ 4), dim2 (2โ 6), dim3 (3=3) | 2 / 3 โ 0.67 | Normalised count |
For any two points in any dimension, the following inequality always holds: Chebyshev โค Euclidean โค Manhattan. This is a mathematical fact โ not a coincidence. Chebyshev focuses on the single largest gap (smallest result). Manhattan sums all gaps without any compression (largest result). Euclidean compresses via square root, sitting in between.
Python Implementation
All Four Distances โ From Scratch
import numpy as np
from collections import Counter
# โโ Euclidean Distance โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def euclidean(a, b):
"""L2 norm โ straight-line distance between two vectors."""
a, b = np.array(a, dtype=float), np.array(b, dtype=float)
return np.sqrt(np.sum((a - b) ** 2))
# โโ Manhattan Distance โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def manhattan(a, b):
"""L1 norm โ sum of absolute differences (taxicab distance)."""
a, b = np.array(a, dtype=float), np.array(b, dtype=float)
return np.sum(np.abs(a - b))
# โโ Minkowski Distance โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def minkowski(a, b, p=2):
"""Lp norm โ generalised distance. p=1โManhattan, p=2โEuclidean."""
a, b = np.array(a, dtype=float), np.array(b, dtype=float)
if p == float('inf'):
return np.max(np.abs(a - b))
return np.sum(np.abs(a - b) ** p) ** (1 / p)
# โโ Hamming Distance โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def hamming(a, b, normalise=False):
"""Count of differing positions between two equal-length sequences."""
if len(a) != len(b):
raise ValueError("Sequences must be the same length")
count = sum(x != y for x, y in zip(a, b))
return count / len(a) if normalise else count
A = [1, 2, 3]
B = [4, 6, 3]
print(f"Euclidean : {euclidean(A, B):.4f}")
print(f"Manhattan : {manhattan(A, B):.4f}")
print(f"Minkowski p=1: {minkowski(A, B, p=1):.4f}")
print(f"Minkowski p=2: {minkowski(A, B, p=2):.4f}")
print(f"Minkowski p=3: {minkowski(A, B, p=3):.4f}")
print(f"Chebyshev p=โ: {minkowski(A, B, p=float('inf')):.4f}")
s1, s2 = "KAROLIN", "KATHRIN"
print(f"\nHamming (raw): {hamming(s1, s2)}")
print(f"Hamming (normalised): {hamming(s1, s2, normalise=True):.4f}")
Using scipy โ Production-Grade Implementations
from scipy.spatial.distance import (
euclidean, cityblock, minkowski, hamming,
chebyshev, pdist, cdist, squareform
)
import numpy as np
A = np.array([1, 2, 3], dtype=float)
B = np.array([4, 6, 3], dtype=float)
print(f"Euclidean : {euclidean(A, B):.4f}")
print(f"Manhattan : {cityblock(A, B):.4f}")
print(f"Minkowski 3: {minkowski(A, B, p=3):.4f}")
print(f"Chebyshev : {chebyshev(A, B):.4f}")
b1 = np.array([1, 0, 1, 1, 1, 0, 1])
b2 = np.array([1, 0, 0, 1, 0, 0, 1])
print(f"Hamming : {hamming(b1, b2):.4f}")
print(f"Hamming raw: {hamming(b1, b2) * len(b1):.0f} positions differ")
Distance Matrix โ All Pairs at Once
import numpy as np
from scipy.spatial.distance import pdist, squareform
import pandas as pd
X = np.array([
[25, 80, 22.0],
[35, 150, 28.5],
[45, 180, 33.0],
[28, 90, 23.5],
[50, 200, 35.5],
])
for metric in ['euclidean', 'cityblock', 'chebyshev']:
dist_matrix = squareform(pdist(X, metric=metric))
df = pd.DataFrame(
dist_matrix.round(1),
index=[f"P{i+1}" for i in range(5)],
columns=[f"P{i+1}" for i in range(5)]
)
print(f"\n{metric.upper()} distance matrix:")
print(df.to_string())
Comparing Metrics in KNN โ Effect on Classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
iris = load_iris()
X, y = iris.data, iris.target
metrics = {
'Euclidean': dict(metric='euclidean'),
'Manhattan': dict(metric='manhattan'),
'Minkowski p3': dict(metric='minkowski', metric_params={'p': 3}),
'Chebyshev': dict(metric='chebyshev'),
}
print(f"{'Metric':15s} {'CV Accuracy':>12} {'Std Dev':>10}")
print("-" * 40)
for name, kwargs in metrics.items():
pipe = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=5, **kwargs))
])
scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
print(f"{name:15s} {scores.mean():.4f} ยฑ{scores.std():.4f}")
Choosing the Right Distance Metric
Complete Reference Table โ All Metrics
| Metric | Formula | Norm | Sensitive to Scale | Sensitive to Outliers | Best Feature Type | scipy Function |
|---|---|---|---|---|---|---|
| Euclidean | โฮฃ(aแตขโbแตข)ยฒ | L2 | Yes | High | Continuous, normal | euclidean() |
| Manhattan | ฮฃ|aแตขโbแตข| | L1 | Yes | Lower | Continuous, counts | cityblock() |
| Minkowski | (ฮฃ|aแตขโbแตข|แต)^(1/p) | Lp | Yes | Depends on p | Continuous (tune p) | minkowski() |
| Chebyshev | max|aแตขโbแตข| | Lโ | Yes | Extreme | Worst-case scenarios | chebyshev() |
| Hamming | #{i: aแตขโ bแตข} / n | โ | No | No | Binary, categorical, strings | hamming() |
| Cosine | 1 โ (aยทb / โaโโbโ) | โ | No | No | Text vectors, embeddings | cosine() |
Real-World Applications
- KNN classification (default metric)
- K-Means cluster centroid updates
- PCA and dimensionality reduction
- GPS route distance calculation
- Face recognition feature matching
- LASSO regularisation (L1 penalty)
- Robust KNN on noisy datasets
- Image reconstruction (L1 loss)
- Urban planning and logistics
- Sparse signal recovery
- DNA and protein sequence comparison
- Network error detection and correction
- Spell checking and string matching
- Cryptographic hash comparison
- Binary feature similarity (KNN)
Golden Rules
StandardScaler or MinMaxScaler inside a Pipeline
before any distance-based algorithm.
GridSearchCV: try p โ {1, 2, 3, 4} alongside K and weights.
On some datasets, p=1.5 or p=3 outperforms both fixed alternatives.
cdist and pdist
functions compute entire distance matrices efficiently.
For million-scale similarity search, use FAISS or
ScaNN.