Deep Learning ๐Ÿ“‚ Convolutional neural networks (CNN) ยท 3 of 4 44 min read

CNN Fully Solved Numericals

This article solves the complete CNN building block โ€” Convolution, ReLU, Max Pooling โ€” twice in full. Numerical 1 uses a 5ร—5 input with a vertical edge detector kernel, computing all 9 dot products by hand, zeroing negatives with ReLU, then applying a 2ร—2 stride-1 pool. Numerical 2 uses a 4ร—4 input with a sharpening kernel, computing all 4 dot products, applying ReLU, then a 2ร—2 stride-2 (non-overlapping) pool that collapses the result to a single scalar. Every number is shown, every step is wor

Section 01

The Full Pipeline โ€” Every CNN Block in Order

Raw Material โ†’ Inspection โ†’ Rejection โ†’ Packaging
Imagine a factory assembly line. Raw steel sheets come in (the input image). A stamping press shapes them with a mould (the convolution kernel) โ€” every region gets pressed, producing a new shaped sheet (the feature map). A quality inspector then discards every warped or bent piece below zero (the ReLU). Finally, a packing machine groups every 2ร—2 batch of pieces and keeps only the best one (the max pool). What arrives at the warehouse is a compact, high-quality summary of the original sheet.

That is exactly what happens in one CNN block โ€” in that order, every time.
Input
5ร—5 image
โ†’
Conv 2D
kernel 3ร—3
โ†’
Feature Map
3ร—3
โ†’
ReLU
max(0, x)
โ†’
Max Pool
2ร—2, S=1
โ†’
Output
2ร—2
📋
What You Will Compute By Hand

Two complete numericals. Each one goes through Step 1 โ€” Convolution (every dot product, position by position), Step 2 โ€” ReLU (zero out every negative), and Step 3 โ€” Max Pooling (slide a 2ร—2 window and keep the maximum). Nothing skipped, nothing assumed.


Section 02

Numerical 1 โ€” Full Pipeline: Conv โ†’ ReLU โ†’ Max Pool

Given: A 5ร—5 input image and a 3ร—3 kernel. No padding. Stride 1 for both conv and pool (2ร—2 pool, stride 1).

🖼 Input Image (5ร—5)

1
2
3
0
1
4
5
6
1
2
7
8
9
0
3
2
1
0
4
5
6
3
2
1
0
โ˜…
⚙ Kernel (3ร—3)

1
0
-1
1
0
-1
1
0
-1
🔎
What This Kernel Does

This kernel has +1 in the left column, 0 in the middle, โˆ’1 in the right column. It subtracts the right side from the left side of every 3ร—3 patch โ€” a classic vertical edge detector. Bright on the left, dark on the right โ†’ large positive output.

๐Ÿ“ Output Size after Convolution
Formula
O = โŒŠ(N โˆ’ F + 2P) / SโŒ‹ + 1 = โŒŠ(5 โˆ’ 3 + 0) / 1โŒ‹ + 1 = 3 โ†’ Feature map is 3ร—3
After Pool
O = โŒŠ(3 โˆ’ 2) / 1โŒ‹ + 1 = 2 โ†’ Final output is 2ร—2

① Step 1 โ€” Convolution (9 dot products)

The 3ร—3 kernel slides across the 5ร—5 input with stride 1. Every position produces one value. Here are all 9:

Position [0,0] โ€” top-left patch
1
2
3
4
5
6
7
8
9
โŠ™
1
0
-1
1
0
-1
1
0
-1
(1ร—1)+(2ร—0)+(3ร—โˆ’1) + (4ร—1)+(5ร—0)+(6ร—โˆ’1) + (7ร—1)+(8ร—0)+(9ร—โˆ’1)
= (1+0โˆ’3) + (4+0โˆ’6) + (7+0โˆ’9)
= โˆ’2 + (โˆ’2) + (โˆ’2) = โˆ’6   โ†’  FM[0,0] = โˆ’6
Position [0,1] โ€” shift right by 1
2
3
0
5
6
1
8
9
0
โŠ™
1
0
-1
1
0
-1
1
0
-1
(2ร—1)+(3ร—0)+(0ร—โˆ’1) + (5ร—1)+(6ร—0)+(1ร—โˆ’1) + (8ร—1)+(9ร—0)+(0ร—โˆ’1)
= (2+0+0) + (5+0โˆ’1) + (8+0+0)
= 2 + 4 + 8 = 14   โ†’  FM[0,1] = 14
Position [0,2] โ€” shift right again
3
0
1
6
1
2
9
0
3
โŠ™
1
0
-1
1
0
-1
1
0
-1
(3ร—1)+(0ร—0)+(1ร—โˆ’1) + (6ร—1)+(1ร—0)+(2ร—โˆ’1) + (9ร—1)+(0ร—0)+(3ร—โˆ’1)
= (3+0โˆ’1) + (6+0โˆ’2) + (9+0โˆ’3)
= 2 + 4 + 6 = 12   โ†’  FM[0,2] = 12
Position [1,0] โ€” move down to row 1
4
5
6
7
8
9
2
1
0
โŠ™
1
0
-1
1
0
-1
1
0
-1
(4ร—1)+(5ร—0)+(6ร—โˆ’1) + (7ร—1)+(8ร—0)+(9ร—โˆ’1) + (2ร—1)+(1ร—0)+(0ร—โˆ’1)
= (4+0โˆ’6) + (7+0โˆ’9) + (2+0+0)
= โˆ’2 + (โˆ’2) + 2 = โˆ’2   โ†’  FM[1,0] = โˆ’2
Position [1,1] โ€” centre of feature map
5
6
1
8
9
0
1
0
4
โŠ™
1
0
-1
1
0
-1
1
0
-1
(5ร—1)+(6ร—0)+(1ร—โˆ’1) + (8ร—1)+(9ร—0)+(0ร—โˆ’1) + (1ร—1)+(0ร—0)+(4ร—โˆ’1)
= (5+0โˆ’1) + (8+0+0) + (1+0โˆ’4)
= 4 + 8 + (โˆ’3) = 9   โ†’  FM[1,1] = 9
Position [1,2]
6
1
2
9
0
3
0
4
5
โŠ™
1
0
-1
1
0
-1
1
0
-1
(6ร—1)+(1ร—0)+(2ร—โˆ’1) + (9ร—1)+(0ร—0)+(3ร—โˆ’1) + (0ร—1)+(4ร—0)+(5ร—โˆ’1)
= (6+0โˆ’2) + (9+0โˆ’3) + (0+0โˆ’5)
= 4 + 6 + (โˆ’5) = 5   โ†’  FM[1,2] = 5
Position [2,0] โ€” bottom row, left
7
8
9
2
1
0
6
3
2
โŠ™
1
0
-1
1
0
-1
1
0
-1
(7ร—1)+(8ร—0)+(9ร—โˆ’1) + (2ร—1)+(1ร—0)+(0ร—โˆ’1) + (6ร—1)+(3ร—0)+(2ร—โˆ’1)
= (7+0โˆ’9) + (2+0+0) + (6+0โˆ’2)
= โˆ’2 + 2 + 4 = 4   โ†’  FM[2,0] = 4
Position [2,1]
8
9
0
1
0
4
3
2
1
โŠ™
1
0
-1
1
0
-1
1
0
-1
(8ร—1)+(9ร—0)+(0ร—โˆ’1) + (1ร—1)+(0ร—0)+(4ร—โˆ’1) + (3ร—1)+(2ร—0)+(1ร—โˆ’1)
= (8+0+0) + (1+0โˆ’4) + (3+0โˆ’1)
= 8 + (โˆ’3) + 2 = 7   โ†’  FM[2,1] = 7
Position [2,2] โ€” bottom-right patch
9
0
3
0
4
5
2
1
0
โŠ™
1
0
-1
1
0
-1
1
0
-1
(9ร—1)+(0ร—0)+(3ร—โˆ’1) + (0ร—1)+(4ร—0)+(5ร—โˆ’1) + (2ร—1)+(1ร—0)+(0ร—โˆ’1)
= (9+0โˆ’3) + (0+0โˆ’5) + (2+0+0)
= 6 + (โˆ’5) + 2 = 3   โ†’  FM[2,2] = 3
📈 Feature Map after Convolution (3ร—3)

โˆ’6
14
12
โˆ’2
9
5
4
7
3

② Step 2 โ€” ReLU Activation: max(0, x)

Apply ReLU element-wise. Every negative value becomes 0. Every positive value stays unchanged.

PositionConv OutputReLU RuleResult
[0,0]โˆ’6โ†’ max(0, โˆ’6)0
[0,1]14โ†’ max(0, 14)14
[0,2]12โ†’ max(0, 12)12
[1,0]โˆ’2โ†’ max(0, โˆ’2)0
[1,1]9โ†’ max(0, 9)9
[1,2]5โ†’ max(0, 5)5
[2,0]4โ†’ max(0, 4)4
[2,1]7โ†’ max(0, 7)7
[2,2]3โ†’ max(0, 3)3
⚡ After ReLU (3ร—3)

0
14
12
0
9
5
4
7
3

③ Step 3 โ€” Max Pooling: 2ร—2 window, Stride 1

Output size: O = โŒŠ(3 โˆ’ 2)/1โŒ‹ + 1 = 2 โ†’ 2ร—2 output. Slide the 2ร—2 window over the ReLU map:

Window [0:2, 0:2] โ†’ Out[0,0]
0
14
0
9
max(0, 14, 0, 9) = 14
Window [0:2, 1:3] โ†’ Out[0,1]
14
12
9
5
max(14, 12, 9, 5) = 14
Window [1:3, 0:2] โ†’ Out[1,0]
0
9
4
7
max(0, 9, 4, 7) = 9
Window [1:3, 1:3] โ†’ Out[1,1]
9
5
7
3
max(9, 5, 7, 3) = 9
🏆 Final Output after Max Pool (2ร—2)

14
14
9
9
🎯
Numerical 1 โ€” Full Summary

Input 5ร—5 โ†’ Conv (3ร—3 kernel, S=1, P=0) โ†’ Feature Map 3ร—3 [โˆ’6,14,12 / โˆ’2,9,5 / 4,7,3] โ†’ ReLU [0,14,12 / 0,9,5 / 4,7,3] โ†’ MaxPool (2ร—2, S=1) โ†’ Final [[14,14],[9,9]]. The two negatives (โˆ’6 and โˆ’2) were killed by ReLU. Max pooling then pulled the strongest signal (14 โ€” the edge response) into both top cells.


Section 03

Numerical 2 โ€” Different Kernel, Stride 2 Pool

Given: A 4ร—4 input image and a 3ร—3 sharpening kernel. No padding. Stride 1 conv, then 2ร—2 MaxPool with stride 2 (non-overlapping).

Input
4ร—4
โ†’
Conv 3ร—3
S=1, P=0
โ†’
Feature Map
2ร—2
โ†’
ReLU
max(0,x)
โ†’
MaxPool 2ร—2
Stride 2
โ†’
Output
1ร—1
🖼 Input Image (4ร—4)

2
4
1
3
5
8
2
6
1
3
7
4
0
2
5
9
โ˜…
⚙ Kernel (3ร—3) โ€” Sharpening

0
-1
0
-1
5
-1
0
-1
0
๐Ÿ“ Output Sizes
After Conv
O = โŒŠ(4 โˆ’ 3 + 0) / 1โŒ‹ + 1 = 2 โ†’ Feature map is 2ร—2
After Pool
O = โŒŠ(2 โˆ’ 2) / 2โŒ‹ + 1 = 1 โ†’ Final output is 1ร—1 (a single number!)

① Step 1 โ€” Convolution (4 dot products on the 4ร—4 input)

Position [0,0]
2
4
1
5
8
2
1
3
7
โŠ™
0
-1
0
-1
5
-1
0
-1
0
(2ร—0)+(4ร—โˆ’1)+(1ร—0) + (5ร—โˆ’1)+(8ร—5)+(2ร—โˆ’1) + (1ร—0)+(3ร—โˆ’1)+(7ร—0)
= (0โˆ’4+0) + (โˆ’5+40โˆ’2) + (0โˆ’3+0)
= โˆ’4 + 33 + (โˆ’3) = 26   โ†’  FM[0,0] = 26
Position [0,1]
4
1
3
8
2
6
3
7
4
โŠ™
0
-1
0
-1
5
-1
0
-1
0
(4ร—0)+(1ร—โˆ’1)+(3ร—0) + (8ร—โˆ’1)+(2ร—5)+(6ร—โˆ’1) + (3ร—0)+(7ร—โˆ’1)+(4ร—0)
= (0โˆ’1+0) + (โˆ’8+10โˆ’6) + (0โˆ’7+0)
= โˆ’1 + (โˆ’4) + (โˆ’7) = โˆ’12   โ†’  FM[0,1] = โˆ’12
Position [1,0]
5
8
2
1
3
7
0
2
5
โŠ™
0
-1
0
-1
5
-1
0
-1
0
(5ร—0)+(8ร—โˆ’1)+(2ร—0) + (1ร—โˆ’1)+(3ร—5)+(7ร—โˆ’1) + (0ร—0)+(2ร—โˆ’1)+(5ร—0)
= (0โˆ’8+0) + (โˆ’1+15โˆ’7) + (0โˆ’2+0)
= โˆ’8 + 7 + (โˆ’2) = โˆ’3   โ†’  FM[1,0] = โˆ’3
Position [1,1]
8
2
6
3
7
4
2
5
9
โŠ™
0
-1
0
-1
5
-1
0
-1
0
(8ร—0)+(2ร—โˆ’1)+(6ร—0) + (3ร—โˆ’1)+(7ร—5)+(4ร—โˆ’1) + (2ร—0)+(5ร—โˆ’1)+(9ร—0)
= (0โˆ’2+0) + (โˆ’3+35โˆ’4) + (0โˆ’5+0)
= โˆ’2 + 28 + (โˆ’5) = 21   โ†’  FM[1,1] = 21
📈 Feature Map after Convolution (2ร—2)

26
โˆ’12
โˆ’3
21

② Step 2 โ€” ReLU Activation

PositionConv OutputReLU RuleResult
[0,0]26โ†’ max(0, 26)26
[0,1]โˆ’12โ†’ max(0, โˆ’12)0
[1,0]โˆ’3โ†’ max(0, โˆ’3)0
[1,1]21โ†’ max(0, 21)21
⚡ After ReLU (2ร—2)

26
0
0
21

③ Step 3 โ€” Max Pooling: 2ร—2 window, Stride 2

Output size: O = โŒŠ(2 โˆ’ 2)/2โŒ‹ + 1 = 1 โ†’ single scalar output. Only one window โ€” it covers the entire 2ร—2 ReLU map:

Window [0:2, 0:2] โ€” the entire ReLU map โ†’ Out[0,0]
26
0
0
21
max(26, 0, 0, 21) = 26
🏆 Final Output after Max Pool (1ร—1)

26
🎯
Numerical 2 โ€” Full Summary

Input 4ร—4 โ†’ Conv (sharpening 3ร—3, S=1, P=0) โ†’ Feature Map 2ร—2 [26, โˆ’12, โˆ’3, 21] โ†’ ReLU โ†’ [26, 0, 0, 21] โ†’ MaxPool (2ร—2, S=2) โ†’ single value 26. The sharpening kernel amplified the two "high-contrast" patches (strong neighbours) and suppressed the rest. ReLU removed the two negative responses. Max pool selected the strongest โ€” 26.


Section 04

Side-by-Side Pipeline Summary

StageNumerical 1 (5ร—5 input)Numerical 2 (4ร—4 input)
Input 5ร—5 = 25 values 4ร—4 = 16 values
Kernel 3ร—3 vertical edge detector [1,0,โˆ’1 / 1,0,โˆ’1 / 1,0,โˆ’1] 3ร—3 sharpening [0,โˆ’1,0 / โˆ’1,5,โˆ’1 / 0,โˆ’1,0]
After Conv 3ร—3 feature map [โˆ’6,14,12 / โˆ’2,9,5 / 4,7,3] 2ร—2 feature map [26,โˆ’12 / โˆ’3,21]
Negatives 2 values (โˆ’6, โˆ’2) 2 values (โˆ’12, โˆ’3)
After ReLU [0,14,12 / 0,9,5 / 4,7,3] [26,0 / 0,21]
Pool Config 2ร—2, Stride 1 โ†’ overlapping 2ร—2, Stride 2 โ†’ non-overlapping
Final Output [[14,14],[9,9]] โ€” 2ร—2 [[26]] โ€” single scalar

Section 05

Python โ€” Verify Both Pipelines

import numpy as np

# โ”€โ”€ Convolution (cross-correlation, no flip) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def conv2d(x, k):
    """No padding, stride 1."""
    KH, KW = k.shape
    OH, OW = x.shape[0]-KH+1, x.shape[1]-KW+1
    out = np.zeros((OH, OW))
    for i in range(OH):
        for j in range(OW):
            out[i, j] = np.sum(x[i:i+KH, j:j+KW] * k)
    return out

# โ”€โ”€ ReLU โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def relu(x):
    return np.maximum(0, x)

# โ”€โ”€ Max Pool โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def max_pool(x, pool=2, stride=2):
    OH = (x.shape[0] - pool) // stride + 1
    OW = (x.shape[1] - pool) // stride + 1
    out = np.zeros((OH, OW))
    for i in range(OH):
        for j in range(OW):
            out[i, j] = x[i*stride:i*stride+pool, j*stride:j*stride+pool].max()
    return out

# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# NUMERICAL 1 โ€” 5ร—5 input, vertical edge kernel
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
inp1 = np.array([
    [1,2,3,0,1],
    [4,5,6,1,2],
    [7,8,9,0,3],
    [2,1,0,4,5],
    [6,3,2,1,0]
])
k1 = np.array([[1,0,-1],[1,0,-1],[1,0,-1]])

fm1 = conv2d(inp1, k1)
r1  = relu(fm1)
p1  = max_pool(r1, pool=2, stride=1)

print("N1 Feature Map:\n", fm1)
print("N1 After ReLU:\n",  r1)
print("N1 Max Pool output:\n", p1)

# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# NUMERICAL 2 โ€” 4ร—4 input, sharpening kernel
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
inp2 = np.array([
    [2,4,1,3],
    [5,8,2,6],
    [1,3,7,4],
    [0,2,5,9]
])
k2 = np.array([[0,-1,0],[-1,5,-1],[0,-1,0]])

fm2 = conv2d(inp2, k2)
r2  = relu(fm2)
p2  = max_pool(r2, pool=2, stride=2)

print("N2 Feature Map:\n", fm2)
print("N2 After ReLU:\n",  r2)
print("N2 Max Pool output:\n", p2)
OUTPUT
N1 Feature Map: [[ -6. 14. 12.] [ -2. 9. 5.] [ 4. 7. 3.]] N1 After ReLU: [[ 0. 14. 12.] [ 0. 9. 5.] [ 4. 7. 3.]] N1 Max Pool output: [[14. 14.] [ 9. 9.]] N2 Feature Map: [[ 26. -12.] [ -3. 21.]] N2 After ReLU: [[26. 0.] [ 0. 21.]] N2 Max Pool output: [[26.]]

Section 06

Golden Rules โ€” The Three-Step Sequence

⚡ Conv โ†’ ReLU โ†’ MaxPool โ€” What Every Student Must Internalise
1
Always compute output size before you start. O = โŒŠ(N โˆ’ F + 2P)/SโŒ‹ + 1. Know your dimensions at every stage โ€” an error here means all subsequent numbers are wrong.
2
Convolution is a dot product, not a multiplication. Multiply element-wise then sum all 9 (or 4, or 25) products to get one number. Do not multiply entire rows or columns.
3
ReLU is trivial but critical. Every negative โ†’ 0, every positive stays. It is the non-linearity that lets the network learn non-linear decision boundaries. Without it, stacking convolutions is just one big linear transform.
4
Max pooling uses the ReLU output, not the raw feature map. The order is always: Conv โ†’ ReLU โ†’ Pool. Reversing ReLU and Pool is incorrect โ€” pooling before ReLU allows negatives to propagate.
5
Pool stride controls spatial compression. Stride 1 = barely any size reduction. Stride 2 = halved spatial size. Stride = pool size = no overlap. These are distinct behaviours with very different effects on the network.