The Story โ Why Pooling Exists
At no distance did you lose the meaning of the page. You lost resolution, but gained the ability to see structure at multiple scales simultaneously. Your brain pooled local detail into progressively coarser summaries โ exactly what pooling layers do inside a CNN.
After a convolution produces a feature map, the network needs to reduce its spatial size for three reasons: (1) cut computation, (2) limit parameters in later layers, and (3) make the network robust to small shifts in position. Pooling achieves all three with zero learnable parameters โ a fixed mathematical operation that simply summarises each small neighbourhood into one number.
Slide a small window across the feature map; replace every window with a single summary statistic (the maximum or the average). The result is a smaller map that retains the essence of what was detected without caring exactly where inside the window it appeared.
Max Pooling vs Average Pooling
Both operations use the same sliding-window mechanics as convolution โ a window size and a stride โ but replace the dot product with a simpler rule.
| Property | Detail |
|---|---|
| Rule | Output = maximum value in each window |
| Effect | Asks: "Did this feature appear anywhere here?" |
| Keeps | The strongest activation โ presence detection |
| Best for | Detecting sharp features: edges, textures, corners |
| Used in | AlexNet, VGG, ResNet โ virtually every classification CNN |
| Property | Detail |
|---|---|
| Rule | Output = mean value across the window |
| Effect | Asks: "How active is this region overall?" |
| Keeps | The distributed signal โ background / diffuse features |
| Best for | Global context, final layer summarisation |
| Used in | GoogLeNet GAP, MobileNet, NLP token pooling |
Row 1: 5 6 1 2
Row 2: 3 2 4 7
Row 3: 1 0 6 3
Max pooling answers "was this feature present?" โ average pooling answers "how strongly present was this feature on average?" Max pooling is preferred for detection tasks. Average pooling (especially Global Average Pooling) is preferred at the end of a network where you want a holistic summary before classification.
Translation Invariance โ Why CNNs Don't Care Where You Are
Without pooling โ almost yes. Each neuron is tied to a fixed spatial position. Shift the dog two pixels right and a different set of neurons fires. With pooling, the story changes: the detector neuron fires strongly somewhere inside the 2ร2 window. Max pooling says "I don't care which pixel inside this window triggered โ the maximum tells me the feature is present in this region." A small shift moves the dog within the same window โ the maximum is unchanged. Invariance emerges.
Spatial Hierarchy โ The Pyramid of Meaning
Stacking convolution + pooling repeatedly creates a spatial hierarchy: each layer looks at a larger portion of the original image through fewer, richer neurons. Early layers see fine texture. Deep layers see semantic objects.
No single layer "understands" an object. Layer 1 fires on edges, Layer 3 fires on curve-combinations that look like an eye, Layer 5 fires on the full face structure. The CNN doesn't see a face โ it sees that this pattern of edge activations typically co-occurs with face labels. Pooling makes this hierarchy stable by suppressing the exact pixel-level positions as information propagates upward.
Receptive Field โ How Much of the Image Does a Neuron See?
The receptive field of a neuron is the region of the original input that can influence its activation. Pooling dramatically grows the receptive field without adding parameters.
The formula above gives the theoretical receptive field โ the maximum area that could influence a neuron. In practice, central pixels contribute far more than border pixels (because of how gradients accumulate), so the effective receptive field is roughly Gaussian-shaped and much smaller than theory suggests. For most practical networks, doubling the theoretical RF only increases the effective RF by about 40%.
Output Dimension Formula
Pooling follows the exact same output-size formula as convolution โ because it is the same sliding-window operation, just with a different reduction rule.
| Input | Pool | Stride | Output | Reduction |
|---|---|---|---|---|
| 224ร224 | 2ร2 | 2 | 112ร112 | 75% fewer values |
| 112ร112 | 2ร2 | 2 | 56ร56 | 75% fewer values |
| 56ร56 | 2ร2 | 2 | 28ร28 | 75% fewer values |
| 28ร28 | 2ร2 | 2 | 14ร14 | 75% fewer values |
| 14ร14 | 2ร2 | 2 | 7ร7 | 75% fewer values |
| 7ร7 | 7ร7 GAP | โ | 1ร1 | 98% fewer values |
Python Implementation
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
# โโ 1. From scratch: Max and Average Pooling (2-D) โโโโโโโโโโโโโ
def pool2d_scratch(x, pool_size=2, stride=2, mode='max'):
"""x: (H, W) numpy array"""
H, W = x.shape
OH = (H - pool_size) // stride + 1
OW = (W - pool_size) // stride + 1
out = np.zeros((OH, OW))
for i in range(OH):
for j in range(OW):
patch = x[i*stride : i*stride+pool_size,
j*stride : j*stride+pool_size]
out[i, j] = patch.max() if mode == 'max' else patch.mean()
return out
inp = np.array([[1,3,2,4],[5,6,1,2],[3,2,4,7],[1,0,6,3]], dtype=float)
print("Max pool:", pool2d_scratch(inp, mode='max'))
print("Avg pool:", pool2d_scratch(inp, mode='avg'))
# โโ 2. PyTorch layers โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
x_t = torch.tensor(inp).unsqueeze(0).unsqueeze(0).float() # (1,1,4,4)
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)
gap = nn.AdaptiveAvgPool2d((1, 1)) # Global Average Pool
print("MaxPool2d: ", max_pool(x_t).squeeze())
print("AvgPool2d: ", avg_pool(x_t).squeeze())
print("GAP: ", gap(x_t).squeeze())
# โโ 3. Building a mini-CNN with pooling layers โโโโโโโโโโโโโโโโโโ
class MiniCNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(1, 32, kernel_size=3, padding=1), # 28ร28ร32
nn.ReLU(),
nn.MaxPool2d(2, 2), # 14ร14ร32
nn.Conv2d(32, 64, kernel_size=3, padding=1), # 14ร14ร64
nn.ReLU(),
nn.MaxPool2d(2, 2), # 7ร7ร64
)
self.gap = nn.AdaptiveAvgPool2d((1, 1)) # 1ร1ร64
self.head = nn.Linear(64, num_classes)
def forward(self, x):
x = self.features(x)
x = self.gap(x).flatten(1) # (B, 64)
return self.head(x)
model = MiniCNN()
dummy = torch.randn(4, 1, 28, 28) # batch of 4 MNIST images
print("Output shape:", model(dummy).shape) # โ (4, 10)
AdaptiveAvgPool2d((1,1)) is the Modern Standard
AdaptiveAvgPool2d accepts any input spatial size and always outputs
exactly 1ร1. This means your CNN works on 224ร224 training images and also
correctly processes a 320ร480 image at test time โ without any code changes.
This is why every modern architecture (ResNet, EfficientNet, MobileNet) uses it
instead of a fixed-size max pool at the end.
Quick Reference โ Everything in One Table
| Concept | Max Pool | Average Pool | Global Avg Pool |
|---|---|---|---|
| Operation | max(window) | mean(window) | mean(entire map) |
| Parameters | Zero | Zero | Zero |
| Preserves | Strongest activation (presence) | Overall energy (distribution) | Channel-level global average |
| Translation invariance | Strong | Moderate | Full spatial invariance |
| Typical use | After conv blocks in backbone | Intermediate or final layers | Replaces flatten + FC layers |
| Output formula | O = โ(N โ F) / Sโ + 1 | Always 1ร1รC | |