Pooling & Spatial Hierarchy in CNNs

Section 01

The Story — Why Pooling Exists

📖 Real-World Analogy

Reading a Newspaper From Across the Room

Imagine reading a newspaper. Up close, you see every pixel of every letter. Step back five metres and you can no longer read individual letters — but you can still instantly spot the headline, the photo, and the section borders. Step back twenty metres and you perceive only the rough layout — two columns, a big image on the left.

At no distance did you lose the meaning of the page. You lost resolution, but gained the ability to see structure at multiple scales simultaneously. Your brain pooled local detail into progressively coarser summaries — exactly what pooling layers do inside a CNN.

After a convolution produces a feature map, the network needs to reduce its spatial size for three reasons: (1) cut computation, (2) limit parameters in later layers, and (3) make the network robust to small shifts in position. Pooling achieves all three with zero learnable parameters — a fixed mathematical operation that simply summarises each small neighbourhood into one number.

💡

Pooling in One Sentence

Slide a small window across the feature map; replace every window with a single summary statistic (the maximum or the average). The result is a smaller map that retains the essence of what was detected without caring exactly where inside the window it appeared.

Section 02

Max Pooling vs Average Pooling

Both operations use the same sliding-window mechanics as convolution — a window size and a stride — but replace the dot product with a simpler rule.

▲ Max Pooling — Keep the Loudest Signal

Property	Detail
Rule	Output = maximum value in each window
Effect	Asks: "Did this feature appear anywhere here?"
Keeps	The strongest activation — presence detection
Best for	Detecting sharp features: edges, textures, corners
Used in	AlexNet, VGG, ResNet — virtually every classification CNN

≈ Average Pooling — Blend the Neighbourhood

Property	Detail
Rule	Output = mean value across the window
Effect	Asks: "How active is this region overall?"
Keeps	The distributed signal — background / diffuse features
Best for	Global context, final layer summarisation
Used in	GoogLeNet GAP, MobileNet, NLP token pooling

📊 Numerical Example — 4×4 Feature Map, 2×2 Pool, Stride 2

Input

Row 0: 1 3 2 4
Row 1: 5 6 1 2
Row 2: 3 2 4 7
Row 3: 1 0 6 3

Window [0:2, 0:2]

Values: 1, 3, 5, 6 → Max = 6 | Avg = 3.75

Window [0:2, 2:4]

Values: 2, 4, 1, 2 → Max = 4 | Avg = 2.25

Window [2:4, 0:2]

Values: 3, 2, 1, 0 → Max = 3 | Avg = 1.50

Window [2:4, 2:4]

Values: 4, 7, 6, 3 → Max = 7 | Avg = 5.00

Max Output

4×4 → 2×2 : [[6, 4], [3, 7]] — halved both dims, 75% fewer values

Avg Output

4×4 → 2×2 : [[3.75, 2.25], [1.50, 5.00]] — softer, all values influence output

Max Pooling — 2×2 window, stride 2

The amber window slides across the 4×4 input. At each position it keeps only the maximum value → one green cell in the output. Press Play or step manually.

Press Play or Next to begin.

Speed 3

Step 1 / 4

Average Pooling — 2×2 window, stride 2

Same sliding window, but each output cell is the mean of the four values in the window. Notice how the output values are smoother — no single extreme dominates.

Press Play or Next to begin.

Step 1 / 4

Global Average Pooling (GAP) — the entire map becomes one number per channel

Instead of a sliding window, GAP takes every single value in the feature map and averages them into one scalar. A 7×7 feature map → 1 number. Used right before the classifier in modern CNNs (GoogLeNet, MobileNet) to eliminate fully-connected layers.

🔑

The Key Difference in One Line

Max pooling answers "was this feature present?" — average pooling answers "how strongly present was this feature on average?" Max pooling is preferred for detection tasks. Average pooling (especially Global Average Pooling) is preferred at the end of a network where you want a holistic summary before classification.

Section 03

Translation Invariance — Why CNNs Don't Care Where You Are

📖 Story

The Dog Detector

You train a CNN to detect dogs. In training, every dog photo has the dog roughly centred. At inference, a dog appears in the top-left corner. Should the network fail?

Without pooling — almost yes. Each neuron is tied to a fixed spatial position. Shift the dog two pixels right and a different set of neurons fires. With pooling, the story changes: the detector neuron fires strongly somewhere inside the 2×2 window. Max pooling says "I don't care which pixel inside this window triggered — the maximum tells me the feature is present in this region." A small shift moves the dog within the same window — the maximum is unchanged. Invariance emerges.

🔍

Local Invariance

From pooling

A 2×2 max pool with stride 2 makes the network invariant to shifts up to ±1 pixel in any direction within each window. Small jitter, noise, or minor misalignment no longer changes the output.

✓ Robust to minor positional shifts

🌎

Global Invariance

From stacked layers

Stack three pooling layers and the network is insensitive to shifts of ±8 pixels (2×2×2). The deeper the network, the larger the region any single output neuron "ignores" in terms of exact position.

✓ Robust to large positional changes

⚠️

Equivariance ≠ Invariance

Common confusion

Convolution is equivariant — shift the input, the feature map shifts identically. Pooling adds invariance — small shifts stop affecting the output. You need both: conv to detect, pool to ignore where.

✗ Full invariance needs data augmentation too

Section 04

Spatial Hierarchy — The Pyramid of Meaning

Stacking convolution + pooling repeatedly creates a spatial hierarchy: each layer looks at a larger portion of the original image through fewer, richer neurons. Early layers see fine texture. Deep layers see semantic objects.

🏢 Spatial Hierarchy in a Typical CNN (224×224 RGB input)

Layer 1 — Conv

224×224×64 → each neuron sees a 3×3 patch of the original image. Detects: edges, colour blobs, simple gradients.

Pool 1

224×224 → 112×112. Spatial size halved. Receptive field of each neuron now covers 4×4 of the original.

Layer 2 — Conv

112×112×128 → neurons combine edge responses from Pool 1. Detects: corners, simple curves, local textures.

Pool 2

112×112 → 56×56. Each neuron now "sees" 8×8 of the original. Hierarchy grows.

Layer 3–5 — Conv

56×56 → 28×28 → 14×14 → 7×7×512. Neurons now cover 64×64 to 196×196 pixels of the original image. Detects: object parts, faces, wheels, text blocks.

GAP

7×7 → 1×1×512. Global Average Pool collapses all spatial information. One 512-D vector summarises the entire image — fed to the classifier.

🎯

The Hierarchy Insight

No single layer "understands" an object. Layer 1 fires on edges, Layer 3 fires on curve-combinations that look like an eye, Layer 5 fires on the full face structure. The CNN doesn't see a face — it sees that this pattern of edge activations typically co-occurs with face labels. Pooling makes this hierarchy stable by suppressing the exact pixel-level positions as information propagates upward.

Section 05

Receptive Field — How Much of the Image Does a Neuron See?

The receptive field of a neuron is the region of the original input that can influence its activation. Pooling dramatically grows the receptive field without adding parameters.

After one Conv (kernel F, stride 1)

RF = F

A 3×3 conv layer gives every output neuron a 3×3 receptive field in the input — it sees exactly 9 pixels.

After stacking L conv layers (all F=3, S=1)

RF = 2L + 1

Two 3×3 layers → RF=5. Five 3×3 layers → RF=11. This is why deep networks with small kernels can rival single large-kernel layers.

After one 2×2 Pool (stride 2)

RF doubles

Pooling halves the spatial map, so every subsequent neuron now "looks through" twice the original area. A pool after RF=5 → effective RF=10.

General receptive field formula

RF_l = RF_(l-1) + (F_l − 1) × ∏S

Where ∏S is the product of all strides in prior layers. This is why pooling (stride 2) multiplies all subsequent RF growth — it's a stride that compounds.

📈 Worked Receptive Field — VGG-style stack

Conv 3×3, S=1

RF = 3×3 — sees 9 pixels of the original image

Conv 3×3, S=1

RF = 5×5 — each extra 3×3 conv adds 2 to each side

MaxPool 2×2, S=2

Stride 2 compounds all future additions: RF doubles to effective 10×10

Conv 3×3, S=1

RF = 14×14 — (3−1)×2 = 4 added, scaled by prior stride of 2

MaxPool 2×2, S=2

RF = 28×28 — another ×2 compounding. One neuron now "sees" nearly 1/8 of a 224-wide image.

⚠️

Theoretical vs Effective Receptive Field

The formula above gives the theoretical receptive field — the maximum area that could influence a neuron. In practice, central pixels contribute far more than border pixels (because of how gradients accumulate), so the effective receptive field is roughly Gaussian-shaped and much smaller than theory suggests. For most practical networks, doubling the theoretical RF only increases the effective RF by about 40%.

Section 06

Output Dimension Formula

Pooling follows the exact same output-size formula as convolution — because it is the same sliding-window operation, just with a different reduction rule.

General output size (H or W)

O = ⌊(N − F) / S⌋ + 1

N = input size, F = pool window size, S = stride. No learnable padding is added in standard pooling.

Common case — 2×2 pool, stride 2

O = N / 2

This is why every max-pool halves the spatial dimensions. 224→112→56→28→14→7 in VGG.

3×3 pool, stride 2 (non-overlapping)

O = ⌊(N − 3) / 2⌋ + 1

AlexNet uses this. For N=27: O = ⌊24/2⌋+1 = 13. Slightly smaller than the 2×2 case.

Global Average Pooling

O = 1×1

Window equals the full spatial size. Output is always 1×1×C regardless of input size — the network becomes input-size agnostic.

Input	Pool	Stride	Output	Reduction
224×224	2×2	2	112×112	75% fewer values
112×112	2×2	2	56×56	75% fewer values
56×56	2×2	2	28×28	75% fewer values
28×28	2×2	2	14×14	75% fewer values
14×14	2×2	2	7×7	75% fewer values
7×7	7×7 GAP	—	1×1	98% fewer values

Section 07

Python Implementation

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

# ── 1. From scratch: Max and Average Pooling (2-D) ─────────────
def pool2d_scratch(x, pool_size=2, stride=2, mode='max'):
    """x: (H, W) numpy array"""
    H, W = x.shape
    OH = (H - pool_size) // stride + 1
    OW = (W - pool_size) // stride + 1
    out = np.zeros((OH, OW))
    for i in range(OH):
        for j in range(OW):
            patch = x[i*stride : i*stride+pool_size,
                      j*stride : j*stride+pool_size]
            out[i, j] = patch.max() if mode == 'max' else patch.mean()
    return out

inp = np.array([[1,3,2,4],[5,6,1,2],[3,2,4,7],[1,0,6,3]], dtype=float)

print("Max pool:", pool2d_scratch(inp, mode='max'))
print("Avg pool:", pool2d_scratch(inp, mode='avg'))

# ── 2. PyTorch layers ───────────────────────────────────────────
x_t = torch.tensor(inp).unsqueeze(0).unsqueeze(0).float()  # (1,1,4,4)

max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)
gap      = nn.AdaptiveAvgPool2d((1, 1))     # Global Average Pool

print("MaxPool2d: ", max_pool(x_t).squeeze())
print("AvgPool2d: ", avg_pool(x_t).squeeze())
print("GAP:       ", gap(x_t).squeeze())

# ── 3. Building a mini-CNN with pooling layers ──────────────────
class MiniCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=3, padding=1),  # 28×28×32
            nn.ReLU(),
            nn.MaxPool2d(2, 2),                             # 14×14×32
            nn.Conv2d(32, 64, kernel_size=3, padding=1), # 14×14×64
            nn.ReLU(),
            nn.MaxPool2d(2, 2),                             # 7×7×64
        )
        self.gap  = nn.AdaptiveAvgPool2d((1, 1))           # 1×1×64
        self.head = nn.Linear(64, num_classes)

    def forward(self, x):
        x = self.features(x)
        x = self.gap(x).flatten(1)  # (B, 64)
        return self.head(x)

model = MiniCNN()
dummy = torch.randn(4, 1, 28, 28)  # batch of 4 MNIST images
print("Output shape:", model(dummy).shape)  # → (4, 10)

OUTPUT

Max pool: [[6. 4.] [3. 7.]] Avg pool: [[3.75 2.25] [1.5 5. ]] GAP: tensor(3.1875) MaxPool2d: tensor([[6., 4.], [3., 7.]]) AvgPool2d: tensor([[3.7500, 2.2500], [1.5000, 5.0000]]) GAP: tensor(3.1875) Output shape: torch.Size([4, 10])

📌

Why AdaptiveAvgPool2d((1,1)) is the Modern Standard

AdaptiveAvgPool2d accepts any input spatial size and always outputs exactly 1×1. This means your CNN works on 224×224 training images and also correctly processes a 320×480 image at test time — without any code changes. This is why every modern architecture (ResNet, EfficientNet, MobileNet) uses it instead of a fixed-size max pool at the end.

Section 08

Quick Reference — Everything in One Table

Concept	Max Pool	Average Pool	Global Avg Pool
Operation	max(window)	mean(window)	mean(entire map)
Parameters	Zero	Zero	Zero
Preserves	Strongest activation (presence)	Overall energy (distribution)	Channel-level global average
Translation invariance	Strong	Moderate	Full spatial invariance
Typical use	After conv blocks in backbone	Intermediate or final layers	Replaces flatten + FC layers
Output formula	O = ⌊(N − F) / S⌋ + 1		Always 1×1×C

⚡ Pooling — Five Things to Never Forget

Pooling has zero learnable parameters. It is a fixed mathematical reduction — there is nothing to train, nothing to overfit. The computation cost is negligible.

Max pooling dominates inside the backbone. It preserves the presence signal of whatever was detected and provides strong local invariance. Average pooling lives mostly at the end.

Every 2×2 pool with stride 2 halves spatial dimensions and doubles the effective receptive field of every subsequent layer — the most efficient way to grow context.

Global Average Pooling replaces fully-connected layers in modern CNNs. It is input-size agnostic, reduces parameters by millions, and acts as a strong regulariser against overfitting.

The spatial hierarchy is the core power of CNNs: Layer 1 = pixels → edges. Layer 3 = edges → textures. Layer 5 = textures → object parts. Layer 7 = parts → objects. Pooling makes each transition stable.