Deep Learning ๐Ÿ“‚ Convolutional neural networks (CNN) ยท 2 of 4 32 min read

Pooling & Spatial Hierarchy in CNNs

Pooling layers slide a window across a feature map and replace each neighbourhood with a single number โ€” the maximum (max pooling) or the mean (average pooling). This shrinks spatial dimensions, introduces translation invariance, grows the receptive field, and builds a pyramid of meaning where each successive layer detects progressively larger, more abstract patterns. The tutorial covers the mechanics, animated diagrams, worked numericals, the receptive field formula, and a full PyTorch implemen

Section 01

The Story โ€” Why Pooling Exists

Reading a Newspaper From Across the Room
Imagine reading a newspaper. Up close, you see every pixel of every letter. Step back five metres and you can no longer read individual letters โ€” but you can still instantly spot the headline, the photo, and the section borders. Step back twenty metres and you perceive only the rough layout โ€” two columns, a big image on the left.

At no distance did you lose the meaning of the page. You lost resolution, but gained the ability to see structure at multiple scales simultaneously. Your brain pooled local detail into progressively coarser summaries โ€” exactly what pooling layers do inside a CNN.

After a convolution produces a feature map, the network needs to reduce its spatial size for three reasons: (1) cut computation, (2) limit parameters in later layers, and (3) make the network robust to small shifts in position. Pooling achieves all three with zero learnable parameters โ€” a fixed mathematical operation that simply summarises each small neighbourhood into one number.

💡
Pooling in One Sentence

Slide a small window across the feature map; replace every window with a single summary statistic (the maximum or the average). The result is a smaller map that retains the essence of what was detected without caring exactly where inside the window it appeared.


Section 02

Max Pooling vs Average Pooling

Both operations use the same sliding-window mechanics as convolution โ€” a window size and a stride โ€” but replace the dot product with a simpler rule.

▲ Max Pooling โ€” Keep the Loudest Signal
PropertyDetail
RuleOutput = maximum value in each window
EffectAsks: "Did this feature appear anywhere here?"
KeepsThe strongest activation โ€” presence detection
Best forDetecting sharp features: edges, textures, corners
Used inAlexNet, VGG, ResNet โ€” virtually every classification CNN
≈ Average Pooling โ€” Blend the Neighbourhood
PropertyDetail
RuleOutput = mean value across the window
EffectAsks: "How active is this region overall?"
KeepsThe distributed signal โ€” background / diffuse features
Best forGlobal context, final layer summarisation
Used inGoogLeNet GAP, MobileNet, NLP token pooling
📊 Numerical Example โ€” 4ร—4 Feature Map, 2ร—2 Pool, Stride 2
Input
Row 0:  1  3  2  4
Row 1:  5  6  1  2
Row 2:  3  2  4  7
Row 3:  1  0  6  3
Window [0:2, 0:2]
Values: 1, 3, 5, 6  โ†’  Max = 6  |  Avg = 3.75
Window [0:2, 2:4]
Values: 2, 4, 1, 2  โ†’  Max = 4  |  Avg = 2.25
Window [2:4, 0:2]
Values: 3, 2, 1, 0  โ†’  Max = 3  |  Avg = 1.50
Window [2:4, 2:4]
Values: 4, 7, 6, 3  โ†’  Max = 7  |  Avg = 5.00
Max Output
4ร—4 โ†’ 2ร—2  :  [[6, 4], [3, 7]] โ€” halved both dims, 75% fewer values
Avg Output
4ร—4 โ†’ 2ร—2  :  [[3.75, 2.25], [1.50, 5.00]] โ€” softer, all values influence output
Max Pooling โ€” 2ร—2 window, stride 2
The amber window slides across the 4ร—4 input. At each position it keeps only the maximum value โ†’ one green cell in the output. Press Play or step manually.
Press Play or Next to begin.
3
Step 1 / 4
Average Pooling โ€” 2ร—2 window, stride 2
Same sliding window, but each output cell is the mean of the four values in the window. Notice how the output values are smoother โ€” no single extreme dominates.
Press Play or Next to begin.
Step 1 / 4
Global Average Pooling (GAP) โ€” the entire map becomes one number per channel
Instead of a sliding window, GAP takes every single value in the feature map and averages them into one scalar. A 7ร—7 feature map โ†’ 1 number. Used right before the classifier in modern CNNs (GoogLeNet, MobileNet) to eliminate fully-connected layers.
🔑
The Key Difference in One Line

Max pooling answers "was this feature present?" โ€” average pooling answers "how strongly present was this feature on average?" Max pooling is preferred for detection tasks. Average pooling (especially Global Average Pooling) is preferred at the end of a network where you want a holistic summary before classification.


Section 03

Translation Invariance โ€” Why CNNs Don't Care Where You Are

The Dog Detector
You train a CNN to detect dogs. In training, every dog photo has the dog roughly centred. At inference, a dog appears in the top-left corner. Should the network fail?

Without pooling โ€” almost yes. Each neuron is tied to a fixed spatial position. Shift the dog two pixels right and a different set of neurons fires. With pooling, the story changes: the detector neuron fires strongly somewhere inside the 2ร—2 window. Max pooling says "I don't care which pixel inside this window triggered โ€” the maximum tells me the feature is present in this region." A small shift moves the dog within the same window โ€” the maximum is unchanged. Invariance emerges.
🔍
Local Invariance
From pooling
A 2ร—2 max pool with stride 2 makes the network invariant to shifts up to ยฑ1 pixel in any direction within each window. Small jitter, noise, or minor misalignment no longer changes the output.
✓ Robust to minor positional shifts
🌎
Global Invariance
From stacked layers
Stack three pooling layers and the network is insensitive to shifts of ยฑ8 pixels (2ร—2ร—2). The deeper the network, the larger the region any single output neuron "ignores" in terms of exact position.
✓ Robust to large positional changes
⚠️
Equivariance โ‰  Invariance
Common confusion
Convolution is equivariant โ€” shift the input, the feature map shifts identically. Pooling adds invariance โ€” small shifts stop affecting the output. You need both: conv to detect, pool to ignore where.
✗ Full invariance needs data augmentation too

Section 04

Spatial Hierarchy โ€” The Pyramid of Meaning

Stacking convolution + pooling repeatedly creates a spatial hierarchy: each layer looks at a larger portion of the original image through fewer, richer neurons. Early layers see fine texture. Deep layers see semantic objects.

🏢 Spatial Hierarchy in a Typical CNN (224ร—224 RGB input)
Layer 1 โ€” Conv
224ร—224ร—64  โ†’  each neuron sees a 3ร—3 patch of the original image. Detects: edges, colour blobs, simple gradients.
Pool 1
224ร—224 โ†’ 112ร—112. Spatial size halved. Receptive field of each neuron now covers 4ร—4 of the original.
Layer 2 โ€” Conv
112ร—112ร—128  โ†’  neurons combine edge responses from Pool 1. Detects: corners, simple curves, local textures.
Pool 2
112ร—112 โ†’ 56ร—56. Each neuron now "sees" 8ร—8 of the original. Hierarchy grows.
Layer 3โ€“5 โ€” Conv
56ร—56 โ†’ 28ร—28 โ†’ 14ร—14 โ†’ 7ร—7ร—512. Neurons now cover 64ร—64 to 196ร—196 pixels of the original image. Detects: object parts, faces, wheels, text blocks.
GAP
7ร—7 โ†’ 1ร—1ร—512. Global Average Pool collapses all spatial information. One 512-D vector summarises the entire image โ€” fed to the classifier.
🎯
The Hierarchy Insight

No single layer "understands" an object. Layer 1 fires on edges, Layer 3 fires on curve-combinations that look like an eye, Layer 5 fires on the full face structure. The CNN doesn't see a face โ€” it sees that this pattern of edge activations typically co-occurs with face labels. Pooling makes this hierarchy stable by suppressing the exact pixel-level positions as information propagates upward.


Section 05

Receptive Field โ€” How Much of the Image Does a Neuron See?

The receptive field of a neuron is the region of the original input that can influence its activation. Pooling dramatically grows the receptive field without adding parameters.

After one Conv (kernel F, stride 1)
RF = F
A 3ร—3 conv layer gives every output neuron a 3ร—3 receptive field in the input โ€” it sees exactly 9 pixels.
After stacking L conv layers (all F=3, S=1)
RF = 2L + 1
Two 3ร—3 layers โ†’ RF=5. Five 3ร—3 layers โ†’ RF=11. This is why deep networks with small kernels can rival single large-kernel layers.
After one 2ร—2 Pool (stride 2)
RF doubles
Pooling halves the spatial map, so every subsequent neuron now "looks through" twice the original area. A pool after RF=5 โ†’ effective RF=10.
General receptive field formula
RF_l = RF_(l-1) + (F_l โˆ’ 1) ร— โˆS
Where โˆS is the product of all strides in prior layers. This is why pooling (stride 2) multiplies all subsequent RF growth โ€” it's a stride that compounds.
📈 Worked Receptive Field โ€” VGG-style stack
Conv 3ร—3, S=1
RF = 3ร—3  โ€”  sees 9 pixels of the original image
Conv 3ร—3, S=1
RF = 5ร—5  โ€”  each extra 3ร—3 conv adds 2 to each side
MaxPool 2ร—2, S=2
Stride 2 compounds all future additions: RF doubles to effective 10ร—10
Conv 3ร—3, S=1
RF = 14ร—14  โ€”  (3โˆ’1)ร—2 = 4 added, scaled by prior stride of 2
MaxPool 2ร—2, S=2
RF = 28ร—28  โ€”  another ร—2 compounding. One neuron now "sees" nearly 1/8 of a 224-wide image.
⚠️
Theoretical vs Effective Receptive Field

The formula above gives the theoretical receptive field โ€” the maximum area that could influence a neuron. In practice, central pixels contribute far more than border pixels (because of how gradients accumulate), so the effective receptive field is roughly Gaussian-shaped and much smaller than theory suggests. For most practical networks, doubling the theoretical RF only increases the effective RF by about 40%.


Section 06

Output Dimension Formula

Pooling follows the exact same output-size formula as convolution โ€” because it is the same sliding-window operation, just with a different reduction rule.

General output size (H or W)
O = โŒŠ(N โˆ’ F) / SโŒ‹ + 1
N = input size, F = pool window size, S = stride. No learnable padding is added in standard pooling.
Common case โ€” 2ร—2 pool, stride 2
O = N / 2
This is why every max-pool halves the spatial dimensions. 224โ†’112โ†’56โ†’28โ†’14โ†’7 in VGG.
3ร—3 pool, stride 2 (non-overlapping)
O = โŒŠ(N โˆ’ 3) / 2โŒ‹ + 1
AlexNet uses this. For N=27: O = โŒŠ24/2โŒ‹+1 = 13. Slightly smaller than the 2ร—2 case.
Global Average Pooling
O = 1ร—1
Window equals the full spatial size. Output is always 1ร—1ร—C regardless of input size โ€” the network becomes input-size agnostic.
InputPoolStrideOutputReduction
224ร—2242ร—22112ร—11275% fewer values
112ร—1122ร—2256ร—5675% fewer values
56ร—562ร—2228ร—2875% fewer values
28ร—282ร—2214ร—1475% fewer values
14ร—142ร—227ร—775% fewer values
7ร—77ร—7 GAPโ€”1ร—198% fewer values

Section 07

Python Implementation

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

# โ”€โ”€ 1. From scratch: Max and Average Pooling (2-D) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def pool2d_scratch(x, pool_size=2, stride=2, mode='max'):
    """x: (H, W) numpy array"""
    H, W = x.shape
    OH = (H - pool_size) // stride + 1
    OW = (W - pool_size) // stride + 1
    out = np.zeros((OH, OW))
    for i in range(OH):
        for j in range(OW):
            patch = x[i*stride : i*stride+pool_size,
                      j*stride : j*stride+pool_size]
            out[i, j] = patch.max() if mode == 'max' else patch.mean()
    return out

inp = np.array([[1,3,2,4],[5,6,1,2],[3,2,4,7],[1,0,6,3]], dtype=float)

print("Max pool:", pool2d_scratch(inp, mode='max'))
print("Avg pool:", pool2d_scratch(inp, mode='avg'))

# โ”€โ”€ 2. PyTorch layers โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
x_t = torch.tensor(inp).unsqueeze(0).unsqueeze(0).float()  # (1,1,4,4)

max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)
gap      = nn.AdaptiveAvgPool2d((1, 1))     # Global Average Pool

print("MaxPool2d: ", max_pool(x_t).squeeze())
print("AvgPool2d: ", avg_pool(x_t).squeeze())
print("GAP:       ", gap(x_t).squeeze())

# โ”€โ”€ 3. Building a mini-CNN with pooling layers โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
class MiniCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=3, padding=1),  # 28ร—28ร—32
            nn.ReLU(),
            nn.MaxPool2d(2, 2),                             # 14ร—14ร—32
            nn.Conv2d(32, 64, kernel_size=3, padding=1), # 14ร—14ร—64
            nn.ReLU(),
            nn.MaxPool2d(2, 2),                             # 7ร—7ร—64
        )
        self.gap  = nn.AdaptiveAvgPool2d((1, 1))           # 1ร—1ร—64
        self.head = nn.Linear(64, num_classes)

    def forward(self, x):
        x = self.features(x)
        x = self.gap(x).flatten(1)  # (B, 64)
        return self.head(x)

model = MiniCNN()
dummy = torch.randn(4, 1, 28, 28)  # batch of 4 MNIST images
print("Output shape:", model(dummy).shape)  # โ†’ (4, 10)
OUTPUT
Max pool: [[6. 4.] [3. 7.]] Avg pool: [[3.75 2.25] [1.5 5. ]] GAP: tensor(3.1875) MaxPool2d: tensor([[6., 4.], [3., 7.]]) AvgPool2d: tensor([[3.7500, 2.2500], [1.5000, 5.0000]]) GAP: tensor(3.1875) Output shape: torch.Size([4, 10])
📌
Why AdaptiveAvgPool2d((1,1)) is the Modern Standard

AdaptiveAvgPool2d accepts any input spatial size and always outputs exactly 1ร—1. This means your CNN works on 224ร—224 training images and also correctly processes a 320ร—480 image at test time โ€” without any code changes. This is why every modern architecture (ResNet, EfficientNet, MobileNet) uses it instead of a fixed-size max pool at the end.


Section 08

Quick Reference โ€” Everything in One Table

Concept Max Pool Average Pool Global Avg Pool
Operation max(window) mean(window) mean(entire map)
Parameters Zero Zero Zero
Preserves Strongest activation (presence) Overall energy (distribution) Channel-level global average
Translation invariance Strong Moderate Full spatial invariance
Typical use After conv blocks in backbone Intermediate or final layers Replaces flatten + FC layers
Output formula O = โŒŠ(N โˆ’ F) / SโŒ‹ + 1 Always 1ร—1ร—C
⚡ Pooling โ€” Five Things to Never Forget
1
Pooling has zero learnable parameters. It is a fixed mathematical reduction โ€” there is nothing to train, nothing to overfit. The computation cost is negligible.
2
Max pooling dominates inside the backbone. It preserves the presence signal of whatever was detected and provides strong local invariance. Average pooling lives mostly at the end.
3
Every 2ร—2 pool with stride 2 halves spatial dimensions and doubles the effective receptive field of every subsequent layer โ€” the most efficient way to grow context.
4
Global Average Pooling replaces fully-connected layers in modern CNNs. It is input-size agnostic, reduces parameters by millions, and acts as a strong regulariser against overfitting.
5
The spatial hierarchy is the core power of CNNs: Layer 1 = pixels โ†’ edges. Layer 3 = edges โ†’ textures. Layer 5 = textures โ†’ object parts. Layer 7 = parts โ†’ objects. Pooling makes each transition stable.