Deep Learning ๐Ÿ“‚ Convolutional neural networks (CNN) ยท 4 of 4 56 min read

Python Implementation of Convolutional Neural Networks (CNN)

A comprehensive deep-dive into Convolutional Neural Networks using Python and TensorFlow/Keras. Covers the convolution operation, pooling, activation functions, batch normalisation, dropout, and backpropagation โ€” each with detailed syntax explanations. Includes a complete CIFAR-10 implementation and a real-world case study: detecting pneumonia from chest X-rays using transfer learning with VGG16.

Section 01

The Story That Explains Why CNN Was Born

The Detective Who Reads Faces
Imagine you are a detective. Someone hands you a photo of a suspect. You don't look at every single pixel in a 1000ร—1000 image independently โ€” that's 1,000,000 inputs! Instead, your brain naturally scans for local patterns: you look for eyes first, then a nose, then the jawline. Then your brain combines those parts into a full face recognition.

Now imagine you had to describe that face to a computer that uses a plain MLP (Multi-Layer Perceptron). You would have to flatten the image into a 1,000,000-element vector, and the network would need billions of weights just for one image. That's insane โ€” and it loses all spatial structure.

Convolutional Neural Networks were designed to do exactly what your detective brain does: scan for local patterns, detect features, and gradually build global understanding. That is the entire philosophy.

In 1989, Yann LeCun introduced the concept of CNNs. In 1998 he built LeNet-5 โ€” a CNN that read handwritten digits on cheques for banks. By 2012, AlexNet won the ImageNet competition with a 15.3% error rate, crushing all traditional methods. CNNs had arrived. Today, they power face recognition on your phone, medical imaging, self-driving cars, and satellite image analysis.

๐Ÿง 
The Core Problem CNN Solves

A plain neural network treats every pixel independently โ€” it has no concept of "neighbours". CNNs enforce a spatial prior: nearby pixels are related, patterns can appear anywhere in the image (translation invariance), and complex features are built from simple ones hierarchically.


Section 02

CNN Architecture โ€” The Big Picture

A CNN is not one single operation โ€” it is a pipeline of specialised layers. Think of it as a factory assembly line where each station adds value.

๐Ÿญ The CNN Factory Assembly Line
INPUT
Raw Image โ†’ A 3D tensor of shape (Height, Width, Channels). E.g., a 32ร—32 RGB image is (32, 32, 3)
CONV
Convolutional Layer โ†’ Slides filters over the image to detect local patterns (edges, curves, textures). Each filter produces one feature map.
RELU
Activation (ReLU) โ†’ Adds non-linearity. Kills all negative values. Without this, stacking layers is mathematically pointless (just one big linear transform).
POOL
Pooling Layer โ†’ Shrinks the feature maps. Reduces computation. Makes detection robust to small shifts in position.
FLAT
Flatten โ†’ Collapses the 3D feature maps into a 1D vector so Dense layers can process it.
DENSE
Fully Connected Layers โ†’ Traditional MLP layers. Combine the extracted features to make the final classification decision.
OUTPUT
Softmax / Sigmoid โ†’ Converts raw scores into probabilities. Softmax for multi-class, Sigmoid for binary.
๐Ÿ“ CNN Architecture Diagram โ€” Data Flow
INPUT 32ร—32ร—3 Image CONV2D 32 filters 3ร—3, ReLU โ†’ 30ร—30ร—32 MAXPOOL 2ร—2 โ†’ 15ร—15ร—32 CONV2D 64 filters 3ร—3, ReLU โ†’ 13ร—13ร—64 MAXPOOL 2ร—2 โ†’ 6ร—6ร—64 FLATTEN 2304 units DENSE 128, ReLU + Dropout OUTPUT 10 classes Softmax Input Feature Extraction Downsample Deeper Features Downsample Reshape Classify Predict

โฌ† Data shrinks spatially but grows in depth (channels) as it moves through the network. Spatial richness is traded for semantic richness.


Section 03

The Convolution Operation โ€” The Heart of CNN

The Flashlight in the Dark Room
Imagine you are in a dark room with a large painting. You have a small flashlight. You slide the flashlight across the painting, scanning one small area at a time. At each position, you note what you see (edge? colour blob? texture?). When you're done, you have a complete "map" of what you found.

That flashlight is the kernel (filter). The map you draw is the feature map (activation map). Sliding the flashlight is the convolution operation.

Mathematically, a 2D convolution applies a small matrix (the filter/kernel) to patches of the input. The filter slides across the image with a given stride, performing an element-wise multiplication at each position, then summing the result into a single number.

Output Size (no padding)
W_out = (W - F) / S + 1
W = input width, F = filter size, S = stride. For W=32, F=3, S=1 โ†’ output is 30.
Output Size (same padding)
W_out = W / S
With 'same' padding, zeros are added around the input so output size equals input size when S=1. Most commonly used in deep nets.
Parameter Count per Layer
params = (F ร— F ร— C_in + 1) ร— C_out
C_in = input channels, C_out = number of filters (output channels). +1 for bias per filter.
Receptive Field Growth
RF = 1 + depth ร— (F - 1)
After stacking N conv layers with filter size F, each output neuron "sees" this many input pixels. Deeper = wider view of original image.

๐Ÿ”ฌ Numerical Example: A 5ร—5 Image, 3ร—3 Filter

๐Ÿ–ผ Input Patch (5ร—5) โ€” one channel shown
10101
01110
11011
00100
10011
โœ… Kernel (3ร—3) โ€” Vertical Edge Detector
-10+1
-10+1
-10+1

At top-left position: (1ร—-1)+(0ร—0)+(1ร—1)+(0ร—-1)+(1ร—0)+(1ร—1)+(1ร—-1)+(1ร—0)+(0ร—1) = 1

๐Ÿ’ก
Why Different Filters Detect Different Things

A vertical edge kernel has opposite signs in left and right columns. A horizontal edge kernel has them in top and bottom rows. A blur kernel has equal weights (1/9 each). In deep learning, we don't hand-craft these kernels โ€” the network learns the optimal kernel values through backpropagation. That's the magic: the filters self-organise to detect whatever is most useful for the task.

๐ŸŽฏ What CNN Filters Learn at Each Depth
๐Ÿ“
Layer 1 Filters
Low-Level Features
Detect simple edges (horizontal, vertical, diagonal), colour blobs, and gradients. These are universal โ€” same patterns appear in every image.
โ†’ Gabor-like edge detectors
๐Ÿ”ถ
Layer 2โ€“3 Filters
Mid-Level Features
Detect textures, corners, curves, simple shapes. Combinations of edges. Like recognising "there is a circular region here".
โ†’ Texture and shape detectors
๐Ÿพ
Deep Filters
High-Level Features
Detect semantic objects: eyes, wheels, fur patterns. Task-specific. A model trained on cats learns "cat face" detectors; one trained on cars learns "wheel" detectors.
โ†’ Object-part detectors

Section 04

Padding and Stride โ€” The Controls

๐Ÿ”ณ
No Padding (valid)
padding='valid'
Filter only placed where it fits fully inside the image. Output is smaller than input. Edges of the image are seen less often by the filter โ€” information loss at borders.
โœ” Smaller output โ€” fewer params
โœ˜ Corners underrepresented
โฌœ
Zero Padding (same)
padding='same'
Zeros are added around the input so the output has the same spatial size as the input (when stride=1). Every position, including corners, contributes equally. Standard choice in most modern architectures.
โœ” Preserves spatial dimensions
โœ˜ Slight boundary artefact from zeros
โžก๏ธ
Stride
strides=(2,2)
Step size of the filter. Stride=1: move 1 pixel at a time (dense coverage). Stride=2: skip every other pixel โ€” output is halved in each dimension. Used instead of pooling to downsample in modern nets (e.g., ResNet).
โœ” Learnable downsampling
โœ˜ Risk of missing fine patterns

Section 05

Pooling Layers โ€” Compressing Without Losing Soul

After convolution, we have feature maps that are large and potentially redundant. Pooling aggregates each small region into one value. It makes the representation more compact, reduces parameters, and introduces spatial invariance โ€” a feature detected slightly off-position still fires.

โฌ› Before: Feature Map (4ร—4)
1220811
185147
392216
613419
โœ… After MaxPool 2ร—2 (2ร—2 output)
20 14
13 22

Max of each 2ร—2 block is retained. 4ร—4 โ†’ 2ร—2. 75% of values discarded, most important kept.

๐Ÿ”
Max Pooling
tf.keras.layers.MaxPool2D
Takes the maximum value in each window. Focuses on the strongest activation โ€” "Was this feature present at all?" Best for classification tasks. Most commonly used.
โž—
Average Pooling
tf.keras.layers.AvgPool2D
Takes the average of each window. Smoother representation. Used in Global Average Pooling at the end of modern networks (replaces Flatten + Dense entirely).
๐ŸŒ
Global Average Pooling
GlobalAveragePooling2D
Collapses each entire feature map into a single number (its mean). If you have 64 feature maps, output is a vector of 64 values. Eliminates Flatten + Dense. Used in ResNet, MobileNet.

Section 06

Activation Functions โ€” Why Non-Linearity Matters

โš ๏ธ
Without Activation Functions, CNNs Are Useless

A stack of linear operations (conv + conv + conv) is mathematically equivalent to a single linear operation. You could have 100 layers and it's no more powerful than 1. Activation functions break this โ€” they introduce non-linearity, enabling the network to model complex, curved decision boundaries.

๐Ÿ“ˆ
ReLU
activation='relu' โ†’ max(0, x)
The workhorse. Returns x if x>0, else 0. Why this syntax? max(0,x) is cheap to compute and its gradient is simply 1 (for x>0) or 0 โ€” no expensive exp() calculations.

Why used: Prevents vanishing gradients better than sigmoid/tanh. Sparse activation (many zeros) = efficient computation.
โœ” Fast, sparse, no vanishing gradient
โœ˜ "Dying ReLU" problem (neurons stuck at 0)
๐Ÿ”„
Leaky ReLU / ELU
activation='leaky_relu'
Fixes dying ReLU: instead of 0 for negative x, it returns ฮฑร—x (small slope, e.g. 0.01). ELU uses an exponential curve for negatives. Why used: Keeps all neurons "alive" during training. Preferred for very deep networks.
โœ” No dead neurons
โœ˜ Extra hyperparameter ฮฑ
๐Ÿ“‰
Softmax
activation='softmax'
Used ONLY in the final output layer for multi-class classification. Converts raw logits into a probability distribution that sums to 1. e^x_i / ฮฃ e^x_j.

Why this formula? The exponential amplifies differences between logits, making the largest value dominate. Sum normalisation ensures valid probabilities.
โœ” Clean probability output
โœ˜ Only for output, never hidden layers

Section 07

Batch Normalisation and Dropout โ€” The Regulators

๐Ÿ“Š
Batch Normalisation
Normalises the output of a layer so it has zero mean and unit variance, per mini-batch. Then rescales with two learnable parameters ฮณ (scale) and ฮฒ (shift).

Why is this syntax used?
BatchNormalization() is placed after Conv2D and before activation in most architectures. It prevents "internal covariate shift" โ€” the problem where layer inputs keep changing distribution during training, making learning slow.

Effect: Allows higher learning rates, acts as a mild regulariser, dramatically stabilises training of deep networks.
keras.layers.BatchNormalization()
๐ŸŽฒ
Dropout
During training, randomly sets a fraction of neurons to 0 at each forward pass. At inference, all neurons are active but their outputs are scaled.

Why this syntax is used?
Dropout(rate=0.5) means 50% of neurons are killed per step. This forces the network to not rely on any one neuron โ€” it must learn redundant representations. Like training the team to work even when some members are absent.

Position: After Dense layers (rarely after Conv layers in modern nets โ€” spatial dropout exists for that).
keras.layers.Dropout(0.5)
๐Ÿ“
L2 Regularisation
Adds a penalty proportional to the square of the weight magnitudes to the loss function. Forces weights to stay small.

Why used: Large weights memorise training data. L2 keeps them modest. In Keras: kernel_regularizer=l2(0.001).

Pairs well with Batch Norm. When BN is used, L2 on weights matters less because BN already controls scale.
regularizers.l2(0.001)

Section 08

How CNNs Learn โ€” Backpropagation Through Convolutions

01
Forward Pass
Input image passes through all layers. At each conv layer, current filter weights are used to compute feature maps. At the output, logits (raw scores) are produced.
02
Loss Computation
The loss function (e.g. Cross-Entropy for classification) compares predicted probabilities to true labels. Loss = โˆ’ฮฃ y_true ร— log(y_pred). High loss = wrong prediction, low loss = correct.
03
Backward Pass (Gradients)
The chain rule of calculus propagates the loss gradient back through each layer. For conv layers, the gradient w.r.t. each filter weight is computed by correlating the upstream gradient with the input patch. This tells us "increase or decrease this filter weight?"
04
Weight Update (Optimiser)
All filter weights and biases are updated: w = w - lr ร— gradient. With Adam optimiser, the learning rate is adaptive per-parameter. Filters gradually evolve from random noise into meaningful feature detectors.
05
Repeat for N Epochs
The entire training dataset is passed through multiple times (epochs). Each epoch refines the filters further. Early layers (edges) converge quickly; deep layers converge slowly as they need the early layers to stabilise first.

Section 09

Full Python Implementation โ€” Step by Step

Now we build a complete CNN from scratch using TensorFlow/Keras. Each syntax choice is explained in detail. We will structure the code into clean, logical steps.

Step 1 โ€” Import Libraries

# Core libraries โ€” understand each import's role
import numpy as np                          # Array operations, data manipulation
import matplotlib.pyplot as plt            # Visualisation of images, loss curves
import tensorflow as tf                     # Deep learning framework

# Keras API โ€” tf.keras is the high-level API built into TensorFlow 2.x
# 'from tensorflow import keras' is equivalent
from tensorflow.keras import layers, models, optimizers, regularizers
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# sklearn utilities for evaluation
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns

# Set seeds for reproducibility โ€” without this, results differ each run
np.random.seed(42)
tf.random.set_seed(42)
๐Ÿ“˜
Why tf.keras and Not Standalone Keras?

TensorFlow 2.x ships with Keras built in. tf.keras is tightly integrated with TF's computational graph, automatic differentiation (tf.GradientTape), and GPU support. You could use standalone Keras with different backends, but tf.keras is the production standard in 2024/2025.

Step 2 โ€” Load and Prepare the CIFAR-10 Dataset

# CIFAR-10: 60,000 colour images, 32ร—32 pixels, 10 classes
# Classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()

# Shapes after loading:
print(f"X_train shape: {X_train.shape}")  # (50000, 32, 32, 3)
print(f"X_test shape:  {X_test.shape}")   # (10000, 32, 32, 3)
print(f"y_train shape: {y_train.shape}")  # (50000, 1)

# โ”€โ”€ WHY NORMALISE TO [0,1]? โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Pixel values are uint8: 0 to 255. Neural nets train poorly with large inputs:
#   โ€ข Gradients become unstable (very large or very small)
#   โ€ข Weight initialisations are calibrated for ~unit-scale inputs
# Dividing by 255.0 maps values to [0.0, 1.0] and makes training stable.

X_train = X_train.astype('float32') / 255.0
X_test  = X_test.astype('float32')  / 255.0

# โ”€โ”€ WHY float32 and NOT float64? โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# GPUs are optimised for float32. float64 gives no benefit for neural net
# training but uses 2ร— memory and is 2-4ร— slower on GPU hardware.

# โ”€โ”€ ONE-HOT ENCODING THE LABELS โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# y_train contains integers 0-9. Softmax outputs 10 probabilities.
# We need labels as 10-element vectors for categorical_crossentropy.
# E.g.: class 3 โ†’ [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]

y_train_cat = tf.keras.utils.to_categorical(y_train, 10)
y_test_cat  = tf.keras.utils.to_categorical(y_test,  10)

# Class names for display
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']

Step 3 โ€” Data Augmentation

# โ”€โ”€ WHY DATA AUGMENTATION? โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# CNNs can memorise 50,000 training images perfectly (overfit).
# Augmentation artificially expands the dataset by creating variations
# of each image. This forces the network to learn features that are
# INVARIANT to flips, shifts, and rotations โ€” more generalisable.

# ImageDataGenerator applies transforms RANDOMLY on the fly during training.
# The original images on disk are never modified.

datagen = ImageDataGenerator(
    rotation_range=15,      # Rotate image up to ยฑ15 degrees randomly
                              # WHY 15ยฐ: CIFAR-10 objects are mostly upright; ยฑ15ยฐ is realistic
    width_shift_range=0.1,   # Shift horizontally by up to 10% of width
    height_shift_range=0.1, # Shift vertically by up to 10% of height
                              # WHY SHIFTS: objects aren't always perfectly centred
    horizontal_flip=True,    # Mirror image left-right 50% of the time
                              # WHY: A cat is still a cat when mirrored. NOT used for digits/text.
    zoom_range=0.1,          # Zoom in/out by up to 10%
    fill_mode='nearest'      # How to fill pixels created by shifts: copy nearest edge pixel
                              # Alternatives: 'reflect', 'wrap', 'constant'
)

# Fit the generator on training data (computes internal statistics if needed)
datagen.fit(X_train)

Step 4 โ€” Build the CNN Model

# โ”€โ”€ SEQUENTIAL API vs FUNCTIONAL API โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Sequential: simple stack of layers โ€” suitable for most standard CNNs.
# Functional API: used when you need skip connections, multiple inputs/outputs.
# We use Sequential here for clarity.

def build_cnn(input_shape=(32, 32, 3), num_classes=10):
    model = models.Sequential([

        # โ”€โ”€ BLOCK 1: First Convolutional Block โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
        # Conv2D(32, (3,3), ...)
        #   32        = number of filters. Produces 32 feature maps.
        #              WHY 32 FIRST?: Start small โ€” capture basic features.
        #              Larger first layers waste computation on simple features.
        #   (3,3)     = kernel size. 3ร—3 is the gold standard (proven by VGG).
        #              WHY 3ร—3?: Two 3ร—3 convs have same receptive field as one 5ร—5
        #              but with fewer parameters and more non-linearity.
        #   padding='same' = preserve spatial dimensions (32ร—32 โ†’ 32ร—32)
        #   activation='relu' = apply ReLU after convolution (most common choice)
        #   input_shape = only needed on FIRST layer. TensorFlow infers rest.

        layers.Conv2D(32, (3, 3), padding='same', activation='relu',
                       input_shape=input_shape,
                       kernel_regularizer=regularizers.l2(1e-4)),
                       # kernel_regularizer=l2(1e-4): add tiny L2 penalty on weights
                       # 1e-4 is a mild penalty โ€” prevents extreme weight values

        layers.BatchNormalization(),  # Normalise after conv, before or after relu โ€” debate exists.
                                       # Original paper: before relu. Modern practice: after.
                                       # We use after activation to follow modern convention.

        layers.Conv2D(32, (3, 3), padding='same', activation='relu',
                       kernel_regularizer=regularizers.l2(1e-4)),
        # WHY TWO CONV LAYERS BEFORE POOLING?
        # Stacking 2 conv layers increases the effective receptive field
        # (two 3ร—3 layers = 5ร—5 receptive field) while using fewer parameters
        # than a single 5ร—5 layer. More non-linearity = richer representations.

        layers.MaxPooling2D(2, 2),  # Halve spatial dims: 32ร—32 โ†’ 16ร—16
                                      # (2,2) = pool window, default stride = pool size
        layers.Dropout(0.25),        # Drop 25% of units. Lighter dropout after conv (25% vs 50%)
                                      # because conv layers have fewer params and need less regularisation.

        # โ”€โ”€ BLOCK 2: Deeper, More Filters โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
        # 64 filters: we DOUBLE the number after pooling. WHY?
        # Pooling halved the spatial resolution (information loss).
        # Doubling filters compensates by increasing the channel depth โ€”
        # we trade spatial richness for semantic richness.

        layers.Conv2D(64, (3, 3), padding='same', activation='relu',
                       kernel_regularizer=regularizers.l2(1e-4)),
        layers.BatchNormalization(),
        layers.Conv2D(64, (3, 3), padding='same', activation='relu',
                       kernel_regularizer=regularizers.l2(1e-4)),
        layers.MaxPooling2D(2, 2),  # 16ร—16 โ†’ 8ร—8
        layers.Dropout(0.25),

        # โ”€โ”€ BLOCK 3: Deepest Block โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
        layers.Conv2D(128, (3, 3), padding='same', activation='relu',
                        kernel_regularizer=regularizers.l2(1e-4)),
        layers.BatchNormalization(),
        layers.MaxPooling2D(2, 2),  # 8ร—8 โ†’ 4ร—4
        layers.Dropout(0.25),

        # โ”€โ”€ CLASSIFICATION HEAD โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
        # Flatten converts 3D tensor (4, 4, 128) โ†’ 1D vector (2048,)
        # WHY FLATTEN NOT GlobalAveragePooling2D?
        # Flatten + Dense gives more parameters (more capacity) but risks overfit.
        # GAP reduces spatial dims to 1ร—1, losing less info, fewer params.
        # For CIFAR-10 (small images), Flatten works well.

        layers.Flatten(),

        # Dense(512): fully connected layer with 512 neurons
        # WHY 512?: Rule of thumb โ€” start large and narrow down. Gives the
        # network capacity to combine features. Reduce toward output size.
        layers.Dense(512, activation='relu',
                      kernel_regularizer=regularizers.l2(1e-4)),
        layers.BatchNormalization(),
        layers.Dropout(0.5),   # Heavier dropout on Dense layers (50%) โ€” more overfit risk

        # Final output layer โ€” 10 neurons for 10 classes
        # activation='softmax': converts 10 raw scores to probabilities summing to 1
        # WHY NOT 'sigmoid' here?: sigmoid outputs independent probabilities (0-1) each.
        # softmax creates a COMPETITION between classes โ€” exactly what we want for
        # mutually exclusive classification (one image = one class).
        layers.Dense(10, activation='softmax')
    ])
    return model

model = build_cnn()
model.summary()
MODEL SUMMARY (abbreviated)
Layer (type) Output Shape Param # ================================================================ conv2d (Conv2D) (None, 32, 32, 32) 896 batch_normalization (None, 32, 32, 32) 128 conv2d_1 (Conv2D) (None, 32, 32, 32) 9,248 batch_normalization_1 (None, 32, 32, 32) 128 max_pooling2d (MaxPooling2D) (None, 16, 16, 32) 0 dropout (Dropout) (None, 16, 16, 32) 0 conv2d_2 (Conv2D) (None, 16, 16, 64) 18,496 ... dense (Dense) (None, 512) 1,049,088 dense_1 (Dense) (None, 10) 5,130 ================================================================ Total params: 1,290,794 Trainable params: 1,290,154 Non-trainable params: 640 โ† BatchNorm running statistics

Step 5 โ€” Compile the Model

# โ”€โ”€ COMPILE: tells Keras HOW to train โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

# optimizer='adam': Adaptive Moment Estimation
#   WHY ADAM over plain SGD?
#   Adam maintains per-parameter learning rates, adapted based on:
#     - momentum (m): exponential average of gradients (ฮฒ1=0.9 default)
#     - velocity  (v): exponential average of squared gradients (ฮฒ2=0.999)
#   Result: fast convergence, robust to hyperparameter choices.
#   learning_rate=0.001 is Adam's default and usually a good start.

# loss='categorical_crossentropy':
#   WHY THIS LOSS for classification?
#   It measures the distance between the predicted probability distribution
#   and the true one-hot distribution. Mathematically: -ฮฃ y_true * log(y_pred).
#   Penalises confident wrong predictions VERY harshly (log(0) โ†’ -โˆž).
#   Use 'sparse_categorical_crossentropy' if labels are integers (no to_categorical).

# metrics=['accuracy']: tracks accuracy during training for human monitoring.
#   Note: accuracy is NOT the loss โ€” just a readable metric for logging.

model.compile(
    optimizer=optimizers.Adam(learning_rate=0.001),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

Step 6 โ€” Define Callbacks

# โ”€โ”€ CALLBACKS: actions taken at end of each epoch โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

# 1. EarlyStopping:
#    Monitor validation loss. If it doesn't improve for 'patience' epochs,
#    stop training. Prevents wasting compute on a model that's overfit.
#    restore_best_weights=True: roll back to the epoch with best val_loss.
#    WHY monitor val_loss not val_accuracy?: accuracy is coarse (steps of 1/N),
#    loss is continuous โ€” more sensitive signal for early stopping.

early_stop = EarlyStopping(
    monitor='val_loss',
    patience=10,
    restore_best_weights=True,
    verbose=1
)

# 2. ReduceLROnPlateau:
#    When val_loss stops improving, reduce learning rate by factor 'factor'.
#    WHY?: Adam can get stuck in a local minimum. Reducing LR lets it take
#    smaller steps and potentially escape or fine-tune more precisely.
#    factor=0.5 means LR is halved. min_lr prevents it from reaching zero.

reduce_lr = ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.5,
    patience=5,
    min_lr=1e-6,
    verbose=1
)

# 3. ModelCheckpoint:
#    Save the model weights whenever val_accuracy improves.
#    WHY save_best_only=True?: training can degrade after a peak epoch.
#    This guarantees we always have the best version on disk.

checkpoint = ModelCheckpoint(
    filepath='best_cnn_model.h5',
    monitor='val_accuracy',
    save_best_only=True,
    verbose=1
)

callbacks = [early_stop, reduce_lr, checkpoint]

Step 7 โ€” Train the Model

# โ”€โ”€ model.fit() โ€” the training loop โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
#
# datagen.flow(X_train, y_train_cat, batch_size=64):
#   Creates a generator that yields augmented batches of 64 images.
#   WHY batch_size=64?:
#     - Too small (8): noisy gradient estimates, slow progress
#     - Too large (512): smooth but may get stuck in sharp minima
#     - 32-128 is the practical sweet spot for CNNs on standard datasets
#
# steps_per_epoch = len(X_train) // batch_size:
#   How many batches = one epoch. 50,000 / 64 โ‰ˆ 781 steps per epoch.
#   WHY specify this?: When using a generator, Keras can't automatically
#   know when one epoch ends.
#
# validation_data=(X_test, y_test_cat):
#   After each epoch, evaluate on test set. Never used for training.
#   WHY not augment validation data?: We want to evaluate on REAL images,
#   not augmented ones โ€” augmentation is only for training diversity.
#
# epochs=100: maximum training epochs (EarlyStopping will likely stop before)

history = model.fit(
    datagen.flow(X_train, y_train_cat, batch_size=64),
    steps_per_epoch=len(X_train) // 64,
    epochs=100,
    validation_data=(X_test, y_test_cat),
    callbacks=callbacks,
    verbose=1
)
TRAINING OUTPUT (sample epochs)
Epoch 1/100 781/781 [======] - 22s - loss: 1.6234 - accuracy: 0.4012 - val_loss: 1.4821 - val_accuracy: 0.4701 Epoch 10/100 781/781 [======] - 20s - loss: 0.9821 - accuracy: 0.6543 - val_loss: 0.9102 - val_accuracy: 0.6812 Epoch 25/100 781/781 [======] - 21s - loss: 0.7234 - accuracy: 0.7491 - val_loss: 0.7109 - val_accuracy: 0.7550 Epoch 47/100 781/781 [======] - 20s - loss: 0.5892 - accuracy: 0.7981 - val_loss: 0.6342 - val_accuracy: 0.7890 EarlyStopping: val_loss did not improve for 10 epochs. Restoring best weights.

Step 8 โ€” Evaluate and Visualise Results

# โ”€โ”€ EVALUATION โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
test_loss, test_acc = model.evaluate(X_test, y_test_cat, verbose=0)
print(f"Test Accuracy: {test_acc:.4f}")
print(f"Test Loss:     {test_loss:.4f}")

# โ”€โ”€ CONFUSION MATRIX โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# model.predict returns probability arrays. argmax gives the predicted class.
# WHY argmax? Softmax output is [0.02, 0.01, 0.85, ...]; argmax = index of max.

y_pred_probs = model.predict(X_test)
y_pred = np.argmax(y_pred_probs, axis=1)    # axis=1: argmax across 10 classes
y_true = np.argmax(y_test_cat, axis=1)     # convert one-hot back to integers

cm = confusion_matrix(y_true, y_pred)

# โ”€โ”€ VISUALISE TRAINING CURVES โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss curve
axes[0].plot(history.history['loss'],     label='Train Loss', color='#60a5fa')
axes[0].plot(history.history['val_loss'], label='Val Loss',   color='#f87171')
axes[0].set_title('Loss Curves')
axes[0].legend()

# Accuracy curve
axes[1].plot(history.history['accuracy'],     label='Train Acc', color='#34d399')
axes[1].plot(history.history['val_accuracy'], label='Val Acc',   color='#a78bfa')
axes[1].set_title('Accuracy Curves')
axes[1].legend()

plt.tight_layout()
plt.show()

# Full per-class report
print(classification_report(y_true, y_pred, target_names=class_names))
EVALUATION OUTPUT
Test Accuracy: 0.7890 Test Loss: 0.6342 precision recall f1-score support airplane 0.83 0.82 0.83 1000 automobile 0.90 0.89 0.90 1000 bird 0.70 0.68 0.69 1000 cat 0.64 0.60 0.62 1000 deer 0.78 0.82 0.80 1000 dog 0.70 0.69 0.69 1000 frog 0.84 0.88 0.86 1000 horse 0.86 0.87 0.87 1000 ship 0.87 0.90 0.88 1000 truck 0.87 0.88 0.87 1000 accuracy 0.79 10000

Section 10

Case Study โ€” Pneumonia Detection from Chest X-Rays

Detecting Pneumonia in Hospital Chest X-Rays
It is 3 AM in a rural hospital. One radiologist is on call. A queue of 50 chest X-rays has built up from the emergency department. Each X-ray must be reviewed for pneumonia โ€” a condition that, untreated, can be fatal within hours.

In 2018, Stanford's CheXNet CNN achieved radiologist-level accuracy on 14 chest conditions. In our case study, we use the Kaggle Chest X-Ray dataset (5,863 images: Normal vs Pneumonia) to build a binary CNN classifier that a hospital could use as a screening assistant.

This is a binary classification problem. Every design decision we make below is driven by this medical context.
โš ๏ธ
Medical Context Drives Every Design Choice

In medical imaging, False Negatives (missing real pneumonia) are more dangerous than False Positives. A missed case could cost a life; a false alarm leads to an extra doctor review. This influences our loss function, threshold, and metrics choices below.

Case Study โ€” Full Code

import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications import VGG16
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
import numpy as np
from sklearn.metrics import classification_report, roc_auc_score

# โ”€โ”€ DATASET PATHS โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Dataset: kaggle datasets download -d paultimothymooney/chest-xray-pneumonia
# Structure:
#   chest_xray/
#     train/ NORMAL/ PNEUMONIA/
#     val/   NORMAL/ PNEUMONIA/
#     test/  NORMAL/ PNEUMONIA/

TRAIN_DIR = 'chest_xray/train'
VAL_DIR   = 'chest_xray/val'
TEST_DIR  = 'chest_xray/test'

IMG_SIZE  = (224, 224)   # WHY 224ร—224? VGG16 and many pretrained models
                           # expect 224ร—224. We use transfer learning below.
BATCH     = 32

# โ”€โ”€ DATA GENERATORS โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Training: aggressive augmentation โ€” X-rays can be taken at slight angles,
#           different zoom levels, different patient orientations.
# Validation/Test: NO augmentation โ€” evaluate on real unmodified images.
# rescale=1/255: normalise pixel values to [0,1] for all generators.

train_gen = ImageDataGenerator(
    rescale=1./255,
    rotation_range=10,          # X-rays rarely tilted more than ยฑ10ยฐ
    zoom_range=0.1,
    horizontal_flip=True,        # Lungs are bilaterally symmetric
    shear_range=0.1,             # Small shear โ€” simulates patient lean
    brightness_range=[0.8, 1.2] # X-ray exposure can vary between machines
)

val_test_gen = ImageDataGenerator(rescale=1./255)

# flow_from_directory: reads images from folder structure
# class_mode='binary': returns 0 (NORMAL) or 1 (PNEUMONIA) labels
# WHY binary and not categorical?: Only 2 classes. Sigmoid output.
# target_size: resizes all images to IMG_SIZE on the fly.

train_data = train_gen.flow_from_directory(
    TRAIN_DIR, target_size=IMG_SIZE, batch_size=BATCH, class_mode='binary')

val_data = val_test_gen.flow_from_directory(
    VAL_DIR, target_size=IMG_SIZE, batch_size=BATCH, class_mode='binary',
    shuffle=False)   # WHY shuffle=False on val/test? We need predictions to
                      # align with true labels in the correct order for metrics.

test_data = val_test_gen.flow_from_directory(
    TEST_DIR, target_size=IMG_SIZE, batch_size=BATCH, class_mode='binary',
    shuffle=False)

# โ”€โ”€ CLASS IMBALANCE CHECK โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Kaggle dataset: ~3875 PNEUMONIA, ~1341 NORMAL โ†’ imbalanced 3:1 ratio.
# If we ignore this, the model could predict "always pneumonia" and get 74% accuracy!
# We compute class weights to make the loss penalise minority class more.

n_normal    = 1341
n_pneumonia = 3875
total       = n_normal + n_pneumonia

class_weight = {
    0: total / (2 * n_normal),    # ~1.85 โ€” upweight Normal class
    1: total / (2 * n_pneumonia)   # ~0.64 โ€” slightly downweight Pneumonia
}
print("Class weights:", class_weight)

# โ”€โ”€ TRANSFER LEARNING WITH VGG16 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# WHY TRANSFER LEARNING?
# We have ~5,000 training images. Training a deep CNN from scratch on this
# would overfit badly. VGG16 was pretrained on ImageNet (1.2M images, 1000 classes).
# Its early layers already detect edges, textures, shapes โ€” useful for X-rays too.
# We FREEZE the pretrained layers, add our own head, and only train the head first.

# weights='imagenet': download pretrained ImageNet weights
# include_top=False: exclude VGG16's original Dense classification head
# input_shape=(224, 224, 3): X-rays are grayscale but VGG16 expects 3 channels.
#   We'll use the same image 3ร— (convert grayscale to 3-channel).
# WHY NOT grayscale directly?: Pretrained weights expect 3 channels.
#   Simple fix: repeat grayscale channel 3 times.

base_model = VGG16(weights='imagenet', include_top=False,
                    input_shape=(*IMG_SIZE, 3))

# Freeze ALL base model layers โ€” only our custom head will be trained initially
base_model.trainable = False

# โ”€โ”€ BUILD CUSTOM HEAD ON TOP OF VGG16 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
model = models.Sequential([
    base_model,                             # Frozen VGG16 feature extractor

    # GlobalAveragePooling2D: replaces Flatten.
    # Output of VGG16 (with 224ร—224 input) is (7, 7, 512).
    # GAP2D reduces this to a 512-dim vector by averaging each feature map.
    # WHY GAP instead of Flatten?: 7ร—7ร—512 = 25,088. Flatten โ†’ Dense would be
    # 25,088 ร— 256 = 6.4 million params for one layer. GAP gives 512 ร— 256 = 131k.
    # GAP is regularising: inherently reduces overfitting.
    layers.GlobalAveragePooling2D(),

    layers.Dense(256, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0.5),

    layers.Dense(64, activation='relu'),
    layers.Dropout(0.3),

    # Binary output: 1 neuron, sigmoid activation
    # sigmoid outputs probability of Pneumonia (class 1): value in [0,1]
    # Default decision threshold: >0.5 โ†’ Pneumonia, โ‰ค0.5 โ†’ Normal
    # We will LOWER this threshold to 0.3 to reduce False Negatives.
    layers.Dense(1, activation='sigmoid')
])

# โ”€โ”€ COMPILE: binary_crossentropy for 2-class sigmoid output โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# WHY binary_crossentropy not categorical?:
#   binary_crossentropy = -[y*log(p) + (1-y)*log(1-p)]
#   Designed for a single sigmoid output. categorical_crossentropy is
#   for multi-class softmax with one-hot labels.
# metrics=['accuracy', AUC]: AUC-ROC is critical in medical imaging โ€”
#   it measures discrimination across ALL thresholds, not just 0.5.

model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy', tf.keras.metrics.AUC(name='auc')]
)

# โ”€โ”€ PHASE 1: Train head only (base frozen) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
history1 = model.fit(
    train_data,
    epochs=15,
    validation_data=val_data,
    class_weight=class_weight,     # Apply imbalance correction
    callbacks=[
        EarlyStopping(monitor='val_auc', patience=5, restore_best_weights=True,
                      mode='max'),  # mode='max': AUC should INCREASE
        ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3)
    ]
)

# โ”€โ”€ PHASE 2: Fine-tune โ€” unfreeze last few VGG16 blocks โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Now that our head is trained, carefully unfreeze the last convolutional
# block of VGG16 (block5) and fine-tune with a MUCH smaller learning rate.
# WHY small LR for fine-tuning?: The pretrained weights are good starting points.
# A large LR would destroy them ("catastrophic forgetting").
# We want tiny adjustments to specialise for X-rays, not relearn from scratch.

base_model.trainable = True

# Only unfreeze layers from block5 onward (last 4 conv layers)
for layer in base_model.layers:
    if 'block5' in layer.name:
        layer.trainable = True
    else:
        layer.trainable = False

# Recompile with 10ร— smaller learning rate for fine-tuning
model.compile(
    optimizer=optimizers.Adam(learning_rate=1e-5),  # 0.001 โ†’ 0.00001
    loss='binary_crossentropy',
    metrics=['accuracy', tf.keras.metrics.AUC(name='auc')]
)

history2 = model.fit(
    train_data,
    epochs=20,
    validation_data=val_data,
    class_weight=class_weight,
    callbacks=[
        EarlyStopping(monitor='val_auc', patience=7, restore_best_weights=True, mode='max'),
        ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=4)
    ]
)

# โ”€โ”€ EVALUATION WITH LOWERED THRESHOLD โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Default threshold is 0.5. In medical context, we lower to 0.3:
# โ†’ Catches more True Positives (reduce missed pneumonia cases)
# โ†’ Accepts more False Positives (extra doctor reviews โ€” acceptable trade-off)

y_pred_prob = model.predict(test_data, verbose=0).flatten()
y_true_test = test_data.classes

threshold = 0.3   # Medical decision: sensitivity over specificity
y_pred_class = (y_pred_prob >= threshold).astype(int)

auc_score = roc_auc_score(y_true_test, y_pred_prob)
print(f"\nAUC-ROC Score: {auc_score:.4f}")
print(classification_report(y_true_test, y_pred_class,
                             target_names=['Normal', 'Pneumonia']))
CASE STUDY RESULTS (threshold=0.30)
AUC-ROC Score: 0.9721 precision recall f1-score support Normal 0.91 0.85 0.88 234 Pneumonia 0.93 0.96 0.94 390 accuracy 0.92 624 Key Metrics: Recall (Pneumonia): 0.96 โ† catches 96% of real cases โœ“ Recall (Normal): 0.85 โ† 15% flagged as false alarms (doctors review) AUC-ROC: 0.97 โ† near-perfect discrimination

Section 11

Diagnosing Your CNN โ€” Reading the Loss Curves

๐Ÿ”ด Overfitting Pattern
EpochTrain LossVal Loss
100.450.48
200.280.59
300.150.82
400.081.15

Train loss keeps falling while val loss rises โ†’ memorising training data.

โœ… Healthy Training Pattern
EpochTrain LossVal Loss
100.520.54
200.380.40
300.310.33
400.280.30

Both losses decline together and converge. Small gap = healthy generalisation.

๐Ÿ“‰
Loss Not Decreasing
Underfitting / LR Issues
Loss stays flat from epoch 1. Causes: learning rate too small (try 10ร— larger), model too shallow, data normalisation missing. Try: increase model capacity or LR.
๐Ÿ“ˆ
Oscillating Loss
LR Too High / Bad Data
Loss bounces up and down wildly. Causes: learning rate too large (gradients overshoot), batch size too small (noisy gradients), corrupted data. Try: reduce LR by 10ร—, increase batch size.
๐ŸŽฏ
Val Accuracy Plateaus Early
Need More Augmentation
Train acc keeps rising but val acc stops at ~65%. Dataset is too small or not diverse enough. Try: more aggressive augmentation, transfer learning, collect more data.

Section 12

Golden Rules for Building CNNs

๐Ÿง  CNN Practitioner Rules โ€” Non-Negotiable
1
Always normalise your images first. Pixel values 0โ€“255 cause unstable gradients. Divide by 255.0 at minimum. Use ImageNet mean/std subtraction for transfer learning models. Never train on raw uint8 values.
2
Start with transfer learning if you have fewer than ~100,000 images. Training from scratch on small datasets almost always leads to overfit. Use VGG16, ResNet50, or EfficientNet-B0 as frozen base. Fine-tune later.
3
Double the filters after each pooling layer. This is the VGG principle: 32โ†’64โ†’128โ†’256. As spatial resolution decreases, channel depth increases to preserve information capacity.
4
Use BatchNorm before heavy regularisation. BatchNorm acts as a mild regulariser. Don't apply both aggressive dropout AND L2 to the same layer โ€” you'll underfit. BatchNorm + light Dropout (0.25 on conv, 0.5 on dense) is the standard recipe.
5
Monitor AUC-ROC, not just accuracy for imbalanced datasets. A model predicting the majority class always gets high accuracy. AUC-ROC measures how well the model discriminates between classes at all possible thresholds.
6
Use padding='same' in most conv layers. Valid padding shrinks the spatial dimensions at every layer, losing boundary information. Same padding preserves dimensions until you explicitly downsample with pooling.
7
For fine-tuning pretrained models: freeze all layers first, train head for 10โ€“15 epochs, then unfreeze only the last 1โ€“2 blocks and retrain with LR 10โ€“100ร— smaller than initial. Unfreeze too much too early = catastrophic forgetting.

Section 13

CNN vs Other Architectures โ€” When to Use What

Architecture Best For Dataset Size Speed Accuracy
Custom CNN Learning, small projects, full control 50kโ€“500k Fast Good
VGG16 (Transfer) Medical, domain-specific, small data 1kโ€“50k Moderate Very Good
ResNet50 General vision, deep architectures 10k+ Moderate Excellent
EfficientNet-B0 Production, mobile, efficiency-critical Any Very Fast State of Art
Vision Transformer (ViT) Very large datasets, attention-based 1M+ Slow SOTA on large
๐Ÿ†
The Practitioner's Decision Tree

Under 10,000 images? โ†’ Use transfer learning (VGG16/ResNet). General vision task, medium data? โ†’ ResNet50 or EfficientNet. Production/mobile deployment? โ†’ EfficientNet-B0 or MobileNetV3. Building something from scratch to learn? โ†’ Custom CNN with CIFAR-10 first. Massive dataset, unlimited compute? โ†’ Vision Transformer.

You have completed Convolutional neural networks (CNN). View all sections โ†’