The Story Behind Convolution
After sliding across the entire painting, your list of numbers forms a new image — a feature map. You have detected one specific pattern everywhere it appears, regardless of where in the painting it lives.
That flashlight is the kernel. The painting is your input. The sliding is convolution. This is the fundamental mechanism of every CNN ever built.
Discrete convolution takes an input signal (a 1-D array, a 2-D image, a 3-D volume) and slides a small weight matrix — the kernel or filter — over it. At every position, it computes a dot product between the kernel weights and the local patch of input. The result at each position is one value in the output, called the feature map or activation map.
Two properties make convolution the foundation of visual AI: local connectivity (each output only depends on a small neighborhood, not the full input) and weight sharing (the same kernel is reused everywhere, so the network learns a pattern once and detects it anywhere). This slashes parameters and gives CNNs their legendary sample efficiency.
Convolution vs Cross-Correlation — The Honest Truth
Mathematically, a true convolution flips the kernel both horizontally and vertically before sliding it. Cross-correlation does not flip — it just slides and dots. In signal processing, the flip matters. In deep learning, it does not: the network learns whatever weights it needs, flipped or not.
| Step | Action |
|---|---|
| 1 | Flip kernel 180° |
| 2 | Slide over input |
| 3 | Dot product at each position |
| Note | Commutative: f★g = g★f |
| Step | Action |
|---|---|
| 1 | No flip — use kernel as-is |
| 2 | Slide over input |
| 3 | Dot product at each position |
| Note | Not commutative, but equivalent for learning |
PyTorch's nn.Conv2d, TensorFlow's tf.keras.layers.Conv2D,
and virtually every other deep learning library implement cross-correlation
but call it convolution. Since the kernel weights are learned, the flip is irrelevant —
the network just learns the "pre-flipped" kernel if it needs to. You will see both terms
used interchangeably in the literature. Now you know the truth.
The Kernel as a Feature Detector
A kernel is a tiny grid of numbers — typically 3×3, 5×5, or 7×7. Each kernel is hard-wired to respond maximally when the input patch it covers matches its own pattern. Different kernels detect different features.
In classical image processing, engineers hand-crafted kernels (Sobel, Laplacian, Gabor). In CNNs, the kernels are random at initialisation and learned via backpropagation. The network discovers, on its own, that detecting edges in layer 1, curves in layer 3, and textures in layer 5 is the optimal strategy for the task you gave it.
Interactive Diagrams — Kernel, Stride & Padding
The five tabs below let you see every concept in action. Hit Play on the first tab to watch a 3×3 kernel sweep across a 5×5 input in real time, then switch to Stride and Padding to drag sliders and immediately see how output dimensions change. Tabs 4 and 5 walk through the two numericals dot-product by dot-product.
Kernel sliding — press Play and watch the amber patch sweep across the input. Each stop produces one green output value. Stride — drag the stride slider from 1 to 3 and watch the output grid shrink. Padding — drag from P=0 to P=2 and watch purple zeros appear around the border, growing the output. Numericals 1 & 2 — drag the position slider to see every dot-product computed step by step, with the arithmetic shown below the canvas.
Output Dimension Formula
Before you build a CNN, you must know what size feature map will come out. The formula is:
Python Implementation
Below is a clean implementation — first from scratch with NumPy to show exactly what happens, then the one-liner you'll actually use in production.
import numpy as np
from scipy.signal import correlate2d
# ── From scratch: 2-D cross-correlation ─────────────────────
def conv2d_scratch(x, k, stride=1, padding=0):
"""x: (H,W) input, k: (Fh,Fw) kernel"""
if padding > 0:
x = np.pad(x, padding, mode='constant')
H, W = x.shape
Fh, Fw = k.shape
Oh = (H - Fh) // stride + 1
Ow = (W - Fw) // stride + 1
out = np.zeros((Oh, Ow))
for i in range(Oh):
for j in range(Ow):
patch = x[i*stride : i*stride+Fh,
j*stride : j*stride+Fw]
out[i, j] = np.sum(patch * k) # dot product
return out
# ── Test with Numerical Example 2 ───────────────────────────
image = np.array([
[1,1,1,0],
[1,1,0,0],
[1,0,0,0],
[0,0,0,0]
], dtype=float)
kernel = np.array([
[1, 0,-1],
[0, 0, 0],
[-1,0, 1]
], dtype=float)
result = conv2d_scratch(image, kernel)
print("From scratch:", result)
# ── Production one-liner (SciPy) ────────────────────────────
result_scipy = correlate2d(image, kernel, mode='valid')
print("SciPy valid: ", result_scipy)
# ── PyTorch (GPU-accelerated, used in real CNNs) ─────────────
import torch
import torch.nn.functional as F
x_t = torch.tensor(image).unsqueeze(0).unsqueeze(0).float() # (1,1,4,4)
k_t = torch.tensor(kernel).unsqueeze(0).unsqueeze(0).float() # (1,1,3,3)
result_torch = F.conv2d(x_t, k_t, padding=0, stride=1)
print("PyTorch: ", result_torch.squeeze())
convolve2d vs correlate2d
SciPy's convolve2d flips the kernel (true mathematical convolution).
correlate2d does not flip (cross-correlation) — which is what PyTorch/TF use.
For symmetric kernels (like a Gaussian blur), both give identical results.
For asymmetric kernels, use correlate2d to match deep learning frameworks.
Putting It All Together — The CNN Layer Picture
A single convolutional layer applies not one kernel but K kernels simultaneously, each detecting a different feature. The output is a 3-D volume with K channels — one feature map per kernel.
As spatial size shrinks (via stride or pooling), the number of channels increases. A typical CNN: 224×224×3 → 112×112×64 → 56×56×128 → 28×28×256. You trade spatial resolution for representational depth — fewer positions, but richer descriptions of what lives at each position. This is the compression pipeline that makes CNNs powerful.
Quick Reference — Everything in One Table
| Concept | Symbol / Formula | Typical Value | Effect When Increased |
|---|---|---|---|
| Kernel size | F × F | 3×3 | Larger receptive field, more parameters, slower |
| Stride | S | 1 | Output shrinks, faster computation, info lost |
| Padding | P | 0 or (F−1)/2 | Preserves spatial size, edges get attention |
| Output size | ⌊(N−F+2P)/S⌋+1 | Depends | — |
| Num. kernels | K | 32–512 | More features detected, more parameters |
| Parameters per layer | K × (F×F×C_in) + K | Varies | Weight sharing keeps this much smaller than FC |
| Convolution vs Corr. | Framework uses corr. | Cross-corr. | Flip only matters for symmetric theory, not practice |