Discrete Convolution in CNNs

Section 01

The Story Behind Convolution

📖 Real World Analogy

The Flashlight Sliding Over a Painting

Imagine a dark room with a large painting on the wall. You have a small square flashlight — it only illuminates a tiny patch at a time. You slide it slowly across the painting, left to right, top to bottom. At every position, you take a photograph of what you see and write down a single number: how much of the painting looks like, say, a horizontal edge.

After sliding across the entire painting, your list of numbers forms a new image — a feature map. You have detected one specific pattern everywhere it appears, regardless of where in the painting it lives.

That flashlight is the kernel. The painting is your input. The sliding is convolution. This is the fundamental mechanism of every CNN ever built.

Discrete convolution takes an input signal (a 1-D array, a 2-D image, a 3-D volume) and slides a small weight matrix — the kernel or filter — over it. At every position, it computes a dot product between the kernel weights and the local patch of input. The result at each position is one value in the output, called the feature map or activation map.

💡

Why This Matters in Deep Learning

Two properties make convolution the foundation of visual AI: local connectivity (each output only depends on a small neighborhood, not the full input) and weight sharing (the same kernel is reused everywhere, so the network learns a pattern once and detects it anywhere). This slashes parameters and gives CNNs their legendary sample efficiency.

Section 02

Convolution vs Cross-Correlation — The Honest Truth

Mathematically, a true convolution flips the kernel both horizontally and vertically before sliding it. Cross-correlation does not flip — it just slides and dots. In signal processing, the flip matters. In deep learning, it does not: the network learns whatever weights it needs, flipped or not.

✖ True Convolution (Signal Processing)

Step	Action
1	Flip kernel 180°
2	Slide over input
3	Dot product at each position
Note	Commutative: f★g = g★f

✓ Cross-Correlation (What CNNs Use)

Step	Action
1	No flip — use kernel as-is
2	Slide over input
3	Dot product at each position
Note	Not commutative, but equivalent for learning

🔑

The Convention Every Framework Uses

PyTorch's nn.Conv2d, TensorFlow's tf.keras.layers.Conv2D, and virtually every other deep learning library implement cross-correlation but call it convolution. Since the kernel weights are learned, the flip is irrelevant — the network just learns the "pre-flipped" kernel if it needs to. You will see both terms used interchangeably in the literature. Now you know the truth.

Section 03

The Kernel as a Feature Detector

A kernel is a tiny grid of numbers — typically 3×3, 5×5, or 7×7. Each kernel is hard-wired to respond maximally when the input patch it covers matches its own pattern. Different kernels detect different features.

↔

Horizontal Edge Detector

Sobel-X variant

A kernel with +1 +1 +1 on top, 0 0 0 in the middle, and −1 −1 −1 on the bottom fires strongly wherever pixel intensity changes from bright to dark going downward — a horizontal boundary.

↕

Vertical Edge Detector

Sobel-Y variant

A kernel with +1 0 −1 repeated across three rows detects vertical transitions. The dot product is large when one side of the patch is bright and the other dark — a vertical boundary.

⸻

Blur / Averaging Kernel

Uniform weights

A 3×3 kernel where all 9 values equal 1/9 averages the neighborhood, smoothing noise. In CNNs, early layers often learn similar averaging patterns to suppress high-frequency noise before feature detection.

💡

The CNN Insight — Kernels Are Learned, Not Designed

In classical image processing, engineers hand-crafted kernels (Sobel, Laplacian, Gabor). In CNNs, the kernels are random at initialisation and learned via backpropagation. The network discovers, on its own, that detecting edges in layer 1, curves in layer 3, and textures in layer 5 is the optimal strategy for the task you gave it.

Section 04

Interactive Diagrams — Kernel, Stride & Padding

The five tabs below let you see every concept in action. Hit Play on the first tab to watch a 3×3 kernel sweep across a 5×5 input in real time, then switch to Stride and Padding to drag sliders and immediately see how output dimensions change. Tabs 4 and 5 walk through the two numericals dot-product by dot-product.

Kernel sliding over a 5×5 input

A 3×3 kernel visits every valid position (no padding, stride 1). Each position computes one dot product → one output value. Press Play or use ◀ ▶ to step manually.

Active patch on input

Kernel weights

Output filled

Press ▶ Play or ▶ Next to begin.

Speed 3

Step 1 / 9

Stride — how far the kernel jumps each step

Input 6×6, Kernel 3×3, Padding 0. Change stride to see the output shrink. Use ◀ ▶ to step through each output position.

Stride S S=1

Step 1 / 16

Padding — adding zeros around the border

Input 5×5, Kernel 3×3. Drag Padding from 0 to 2. Purple cells are the added zeros. Use ◀ ▶ to move the kernel.

Padding P P=0

Step 1 / 9

Numerical 1 — 1-D convolution, step by step

Input: [1, 2, 3, 4, 5] · Kernel: [1, 0, −1] · No padding, Stride 1 → Output size: 5−3+1 = 3

Position 0 / 2

Numerical 2 — 2-D convolution on a 4×4 image

Input: 4×4 binary image · Kernel: 3×3 diagonal detector · No padding → Output: 2×2 feature map.

Out[0,0] — 1/4

💡

How to use these diagrams

Kernel sliding — press Play and watch the amber patch sweep across the input. Each stop produces one green output value. Stride — drag the stride slider from 1 to 3 and watch the output grid shrink. Padding — drag from P=0 to P=2 and watch purple zeros appear around the border, growing the output. Numericals 1 & 2 — drag the position slider to see every dot-product computed step by step, with the arithmetic shown below the canvas.

Section 05

Output Dimension Formula

Before you build a CNN, you must know what size feature map will come out. The formula is:

1-D Output Size

O = ⌊(N − F + 2P) / S⌋ + 1

N = input length, F = kernel size, P = padding on each side, S = stride. The floor handles cases where the kernel doesn't divide evenly.

2-D Output Size (H and W independently)

O_H = ⌊(H − F_H + 2P_H) / S_H⌋ + 1

Apply the same formula to height and width separately. Most layers use square kernels (F_H = F_W) and equal padding, so one formula covers both.

Common Case — "Same" Padding, Stride 1

O = N (output = input size)

Set P = (F − 1) / 2 for odd kernels (3×3 → P=1, 5×5 → P=2). Output spatial size exactly equals input spatial size.

Common Case — No Padding, Stride 1

O = N − F + 1

Every convolution "shrinks" the map by (F−1) on each side. A 32×32 input with a 3×3 kernel gives a 30×30 output — borders are lost.

🔧 Quick Reference — Output Sizes for a 28×28 Input

3×3, P=0, S=1

O = (28 − 3 + 0) / 1 + 1 = 26×26 — shrinks by 2

3×3, P=1, S=1

O = (28 − 3 + 2) / 1 + 1 = 28×28 — preserved

3×3, P=0, S=2

O = (28 − 3 + 0) / 2 + 1 = 13×13 — halved approx.

5×5, P=2, S=1

O = (28 − 5 + 4) / 1 + 1 = 28×28 — preserved

Section 06

Python Implementation

Below is a clean implementation — first from scratch with NumPy to show exactly what happens, then the one-liner you'll actually use in production.

import numpy as np
from scipy.signal import correlate2d

# ── From scratch: 2-D cross-correlation ───────────────────── 
def conv2d_scratch(x, k, stride=1, padding=0):
    """x: (H,W) input, k: (Fh,Fw) kernel"""
    if padding > 0:
        x = np.pad(x, padding, mode='constant')
    H, W   = x.shape
    Fh, Fw = k.shape
    Oh = (H - Fh) // stride + 1
    Ow = (W - Fw) // stride + 1
    out = np.zeros((Oh, Ow))
    for i in range(Oh):
        for j in range(Ow):
            patch = x[i*stride : i*stride+Fh,
                      j*stride : j*stride+Fw]
            out[i, j] = np.sum(patch * k)  # dot product
    return out

# ── Test with Numerical Example 2 ─────────────────────────── 
image = np.array([
    [1,1,1,0],
    [1,1,0,0],
    [1,0,0,0],
    [0,0,0,0]
], dtype=float)

kernel = np.array([
    [1, 0,-1],
    [0, 0, 0],
    [-1,0, 1]
], dtype=float)

result = conv2d_scratch(image, kernel)
print("From scratch:", result)

# ── Production one-liner (SciPy) ──────────────────────────── 
result_scipy = correlate2d(image, kernel, mode='valid')
print("SciPy valid:  ", result_scipy)

# ── PyTorch (GPU-accelerated, used in real CNNs) ───────────── 
import torch
import torch.nn.functional as F

x_t = torch.tensor(image).unsqueeze(0).unsqueeze(0).float()  # (1,1,4,4)
k_t = torch.tensor(kernel).unsqueeze(0).unsqueeze(0).float() # (1,1,3,3)
result_torch = F.conv2d(x_t, k_t, padding=0, stride=1)
print("PyTorch:      ", result_torch.squeeze())

OUTPUT

From scratch: [[-1. 1.] [ 1. 1.]] SciPy valid: [[-1. 1.] [ 1. 1.]] PyTorch: tensor([[-1., 1.], [ 1., 1.]])

⚠️

SciPy's convolve2d vs correlate2d

SciPy's convolve2d flips the kernel (true mathematical convolution). correlate2d does not flip (cross-correlation) — which is what PyTorch/TF use. For symmetric kernels (like a Gaussian blur), both give identical results. For asymmetric kernels, use correlate2d to match deep learning frameworks.

Section 07

Putting It All Together — The CNN Layer Picture

A single convolutional layer applies not one kernel but K kernels simultaneously, each detecting a different feature. The output is a 3-D volume with K channels — one feature map per kernel.

Input Volume

Shape: (H, W, C) — e.g. 32×32×3 for an RGB image. C = 3 input channels.

K Kernels

Each kernel shape: (F, F, C) — it spans all input channels. With K=64 kernels of size 3×3×3, that's 64×3×3×3 = 1,728 weights + 64 biases.

K Convolutions in Parallel

Each kernel slides over the input and produces one 2-D feature map. All K maps are stacked into a 3-D output volume.

Output Volume

Shape: (H_out, W_out, K). With same-padding: 32×32×64. Each "slice" along the depth axis is one feature map — one detector firing across the image.

🎯

Golden Rule — Every CNN Designer Uses This

As spatial size shrinks (via stride or pooling), the number of channels increases. A typical CNN: 224×224×3 → 112×112×64 → 56×56×128 → 28×28×256. You trade spatial resolution for representational depth — fewer positions, but richer descriptions of what lives at each position. This is the compression pipeline that makes CNNs powerful.

Section 08

Quick Reference — Everything in One Table

Concept	Symbol / Formula	Typical Value	Effect When Increased
Kernel size	F × F	3×3	Larger receptive field, more parameters, slower
Stride	S	1	Output shrinks, faster computation, info lost
Padding	P	0 or (F−1)/2	Preserves spatial size, edges get attention
Output size	⌊(N−F+2P)/S⌋+1	Depends	—
Num. kernels	K	32–512	More features detected, more parameters
Parameters per layer	K × (F×F×C_in) + K	Varies	Weight sharing keeps this much smaller than FC
Convolution vs Corr.	Framework uses corr.	Cross-corr.	Flip only matters for symmetric theory, not practice

⚡ Discrete Convolution — Five Things to Never Forget

The output at each position is a single scalar dot product between the kernel and a local patch — nothing more, nothing less.

Weight sharing is what makes CNNs so parameter-efficient. The same 9 weights (3×3 kernel) are reused at every position in a 1000×1000 image.

Deep learning frameworks implement cross-correlation and call it convolution. For learning, this distinction is irrelevant.

Always verify your output shape before building: O = ⌊(N−F+2P)/S⌋+1. A mismatch here causes silent shape errors deep in the network.

Kernels are learned, not hand-crafted in CNNs. What looks like an edge detector after training emerged purely from gradient descent minimising your loss.