CNN Python Tutorial: Build Convolutional Neural Networks

Section 01

The Story That Explains Why CNN Was Born

📖 Real World Analogy

The Detective Who Reads Faces

Imagine you are a detective. Someone hands you a photo of a suspect. You don't look at every single pixel in a 1000×1000 image independently — that's 1,000,000 inputs! Instead, your brain naturally scans for local patterns: you look for eyes first, then a nose, then the jawline. Then your brain combines those parts into a full face recognition.

Now imagine you had to describe that face to a computer that uses a plain MLP (Multi-Layer Perceptron). You would have to flatten the image into a 1,000,000-element vector, and the network would need billions of weights just for one image. That's insane — and it loses all spatial structure.

Convolutional Neural Networks were designed to do exactly what your detective brain does: scan for local patterns, detect features, and gradually build global understanding. That is the entire philosophy.

In 1989, Yann LeCun introduced the concept of CNNs. In 1998 he built LeNet-5 — a CNN that read handwritten digits on cheques for banks. By 2012, AlexNet won the ImageNet competition with a 15.3% error rate, crushing all traditional methods. CNNs had arrived. Today, they power face recognition on your phone, medical imaging, self-driving cars, and satellite image analysis.

🧠

The Core Problem CNN Solves

A plain neural network treats every pixel independently — it has no concept of "neighbours". CNNs enforce a spatial prior: nearby pixels are related, patterns can appear anywhere in the image (translation invariance), and complex features are built from simple ones hierarchically.

Section 02

CNN Architecture — The Big Picture

A CNN is not one single operation — it is a pipeline of specialised layers. Think of it as a factory assembly line where each station adds value.

🏭 The CNN Factory Assembly Line

INPUT

Raw Image → A 3D tensor of shape (Height, Width, Channels). E.g., a 32×32 RGB image is (32, 32, 3)

CONV

Convolutional Layer → Slides filters over the image to detect local patterns (edges, curves, textures). Each filter produces one feature map.

RELU

Activation (ReLU) → Adds non-linearity. Kills all negative values. Without this, stacking layers is mathematically pointless (just one big linear transform).

POOL

Pooling Layer → Shrinks the feature maps. Reduces computation. Makes detection robust to small shifts in position.

FLAT

Flatten → Collapses the 3D feature maps into a 1D vector so Dense layers can process it.

DENSE

Fully Connected Layers → Traditional MLP layers. Combine the extracted features to make the final classification decision.

OUTPUT

Softmax / Sigmoid → Converts raw scores into probabilities. Softmax for multi-class, Sigmoid for binary.

📐 CNN Architecture Diagram — Data Flow

⬆ Data shrinks spatially but grows in depth (channels) as it moves through the network. Spatial richness is traded for semantic richness.

Section 03

The Convolution Operation — The Heart of CNN

📖 Story

The Flashlight in the Dark Room

Imagine you are in a dark room with a large painting. You have a small flashlight. You slide the flashlight across the painting, scanning one small area at a time. At each position, you note what you see (edge? colour blob? texture?). When you're done, you have a complete "map" of what you found.

That flashlight is the kernel (filter). The map you draw is the feature map (activation map). Sliding the flashlight is the convolution operation.

Mathematically, a 2D convolution applies a small matrix (the filter/kernel) to patches of the input. The filter slides across the image with a given stride, performing an element-wise multiplication at each position, then summing the result into a single number.

Output Size (no padding)

W_out = (W - F) / S + 1

W = input width, F = filter size, S = stride. For W=32, F=3, S=1 → output is 30.

Output Size (same padding)

W_out = W / S

With 'same' padding, zeros are added around the input so output size equals input size when S=1. Most commonly used in deep nets.

Parameter Count per Layer

params = (F × F × C_in + 1) × C_out

C_in = input channels, C_out = number of filters (output channels). +1 for bias per filter.

Receptive Field Growth

RF = 1 + depth × (F - 1)

After stacking N conv layers with filter size F, each output neuron "sees" this many input pixels. Deeper = wider view of original image.

🔬 Numerical Example: A 5×5 Image, 3×3 Filter

🖼 Input Patch (5×5) — one channel shown

1	0	1	0	1
0	1	1	1	0
1	1	0	1	1
0	0	1	0	0
1	0	0	1	1

✅ Kernel (3×3) — Vertical Edge Detector

-1	0	+1
-1	0	+1
-1	0	+1

At top-left position: (1×-1)+(0×0)+(1×1)+(0×-1)+(1×0)+(1×1)+(1×-1)+(1×0)+(0×1) = 1

💡

Why Different Filters Detect Different Things

A vertical edge kernel has opposite signs in left and right columns. A horizontal edge kernel has them in top and bottom rows. A blur kernel has equal weights (1/9 each). In deep learning, we don't hand-craft these kernels — the network learns the optimal kernel values through backpropagation. That's the magic: the filters self-organise to detect whatever is most useful for the task.

🎯 What CNN Filters Learn at Each Depth

📐

Layer 1 Filters

Low-Level Features

Detect simple edges (horizontal, vertical, diagonal), colour blobs, and gradients. These are universal — same patterns appear in every image.

→ Gabor-like edge detectors

🔶

Layer 2–3 Filters

Mid-Level Features

Detect textures, corners, curves, simple shapes. Combinations of edges. Like recognising "there is a circular region here".

→ Texture and shape detectors

🐾

Deep Filters

High-Level Features

Detect semantic objects: eyes, wheels, fur patterns. Task-specific. A model trained on cats learns "cat face" detectors; one trained on cars learns "wheel" detectors.

→ Object-part detectors

Section 04

Padding and Stride — The Controls

🔳

No Padding (valid)

padding='valid'

Filter only placed where it fits fully inside the image. Output is smaller than input. Edges of the image are seen less often by the filter — information loss at borders.

✔ Smaller output — fewer params

✘ Corners underrepresented

⬜

Zero Padding (same)

padding='same'

Zeros are added around the input so the output has the same spatial size as the input (when stride=1). Every position, including corners, contributes equally. Standard choice in most modern architectures.

✔ Preserves spatial dimensions

✘ Slight boundary artefact from zeros

➡️

Stride

strides=(2,2)

Step size of the filter. Stride=1: move 1 pixel at a time (dense coverage). Stride=2: skip every other pixel — output is halved in each dimension. Used instead of pooling to downsample in modern nets (e.g., ResNet).

✔ Learnable downsampling

✘ Risk of missing fine patterns

Section 05

Pooling Layers — Compressing Without Losing Soul

After convolution, we have feature maps that are large and potentially redundant. Pooling aggregates each small region into one value. It makes the representation more compact, reduces parameters, and introduces spatial invariance — a feature detected slightly off-position still fires.

⬛ Before: Feature Map (4×4)

12	20	8	11
18	5	14	7
3	9	22	16
6	13	4	19

✅ After MaxPool 2×2 (2×2 output)

20	14
13	22

Max of each 2×2 block is retained. 4×4 → 2×2. 75% of values discarded, most important kept.

🔝

Max Pooling

tf.keras.layers.MaxPool2D

Takes the maximum value in each window. Focuses on the strongest activation — "Was this feature present at all?" Best for classification tasks. Most commonly used.

➗

Average Pooling

tf.keras.layers.AvgPool2D

Takes the average of each window. Smoother representation. Used in Global Average Pooling at the end of modern networks (replaces Flatten + Dense entirely).

🌍

Global Average Pooling

GlobalAveragePooling2D

Collapses each entire feature map into a single number (its mean). If you have 64 feature maps, output is a vector of 64 values. Eliminates Flatten + Dense. Used in ResNet, MobileNet.

Section 06

Activation Functions — Why Non-Linearity Matters

⚠️

Without Activation Functions, CNNs Are Useless

A stack of linear operations (conv + conv + conv) is mathematically equivalent to a single linear operation. You could have 100 layers and it's no more powerful than 1. Activation functions break this — they introduce non-linearity, enabling the network to model complex, curved decision boundaries.

📈

ReLU

activation='relu' → max(0, x)

The workhorse. Returns x if x>0, else 0. Why this syntax? max(0,x) is cheap to compute and its gradient is simply 1 (for x>0) or 0 — no expensive exp() calculations.

Why used: Prevents vanishing gradients better than sigmoid/tanh. Sparse activation (many zeros) = efficient computation.

✔ Fast, sparse, no vanishing gradient

✘ "Dying ReLU" problem (neurons stuck at 0)

🔄

Leaky ReLU / ELU

activation='leaky_relu'

Fixes dying ReLU: instead of 0 for negative x, it returns α×x (small slope, e.g. 0.01). ELU uses an exponential curve for negatives. Why used: Keeps all neurons "alive" during training. Preferred for very deep networks.

✔ No dead neurons

✘ Extra hyperparameter α

📉

Softmax

activation='softmax'

Used ONLY in the final output layer for multi-class classification. Converts raw logits into a probability distribution that sums to 1. e^x_i / Σ e^x_j.

Why this formula? The exponential amplifies differences between logits, making the largest value dominate. Sum normalisation ensures valid probabilities.

✔ Clean probability output

✘ Only for output, never hidden layers

Section 07

Batch Normalisation and Dropout — The Regulators

📊

Batch Normalisation

Normalises the output of a layer so it has zero mean and unit variance, per mini-batch. Then rescales with two learnable parameters γ (scale) and β (shift).

Why is this syntax used?
BatchNormalization() is placed after Conv2D and before activation in most architectures. It prevents "internal covariate shift" — the problem where layer inputs keep changing distribution during training, making learning slow.

Effect: Allows higher learning rates, acts as a mild regulariser, dramatically stabilises training of deep networks.

keras.layers.BatchNormalization()

🎲

Dropout

During training, randomly sets a fraction of neurons to 0 at each forward pass. At inference, all neurons are active but their outputs are scaled.

Why this syntax is used?
Dropout(rate=0.5) means 50% of neurons are killed per step. This forces the network to not rely on any one neuron — it must learn redundant representations. Like training the team to work even when some members are absent.

Position: After Dense layers (rarely after Conv layers in modern nets — spatial dropout exists for that).

keras.layers.Dropout(0.5)

📐

L2 Regularisation

Adds a penalty proportional to the square of the weight magnitudes to the loss function. Forces weights to stay small.

Why used: Large weights memorise training data. L2 keeps them modest. In Keras: kernel_regularizer=l2(0.001).

Pairs well with Batch Norm. When BN is used, L2 on weights matters less because BN already controls scale.

regularizers.l2(0.001)

Section 08

How CNNs Learn — Backpropagation Through Convolutions

Forward Pass

Input image passes through all layers. At each conv layer, current filter weights are used to compute feature maps. At the output, logits (raw scores) are produced.

Loss Computation

The loss function (e.g. Cross-Entropy for classification) compares predicted probabilities to true labels. Loss = −Σ y_true × log(y_pred). High loss = wrong prediction, low loss = correct.

Backward Pass (Gradients)

The chain rule of calculus propagates the loss gradient back through each layer. For conv layers, the gradient w.r.t. each filter weight is computed by correlating the upstream gradient with the input patch. This tells us "increase or decrease this filter weight?"

Weight Update (Optimiser)

All filter weights and biases are updated: w = w - lr × gradient. With Adam optimiser, the learning rate is adaptive per-parameter. Filters gradually evolve from random noise into meaningful feature detectors.

Repeat for N Epochs

The entire training dataset is passed through multiple times (epochs). Each epoch refines the filters further. Early layers (edges) converge quickly; deep layers converge slowly as they need the early layers to stabilise first.

Section 09

Full Python Implementation — Step by Step

Now we build a complete CNN from scratch using TensorFlow/Keras. Each syntax choice is explained in detail. We will structure the code into clean, logical steps.

Step 1 — Import Libraries

# Core libraries — understand each import's role
import numpy as np                          # Array operations, data manipulation
import matplotlib.pyplot as plt            # Visualisation of images, loss curves
import tensorflow as tf                     # Deep learning framework

# Keras API — tf.keras is the high-level API built into TensorFlow 2.x
# 'from tensorflow import keras' is equivalent
from tensorflow.keras import layers, models, optimizers, regularizers
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# sklearn utilities for evaluation
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns

# Set seeds for reproducibility — without this, results differ each run
np.random.seed(42)
tf.random.set_seed(42)

📘

Why tf.keras and Not Standalone Keras?

TensorFlow 2.x ships with Keras built in. tf.keras is tightly integrated with TF's computational graph, automatic differentiation (tf.GradientTape), and GPU support. You could use standalone Keras with different backends, but tf.keras is the production standard in 2024/2025.

Step 2 — Load and Prepare the CIFAR-10 Dataset

# CIFAR-10: 60,000 colour images, 32×32 pixels, 10 classes
# Classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()

# Shapes after loading:
print(f"X_train shape: {X_train.shape}")  # (50000, 32, 32, 3)
print(f"X_test shape:  {X_test.shape}")   # (10000, 32, 32, 3)
print(f"y_train shape: {y_train.shape}")  # (50000, 1)

# ── WHY NORMALISE TO [0,1]? ──────────────────────────────────────────────
# Pixel values are uint8: 0 to 255. Neural nets train poorly with large inputs:
#   • Gradients become unstable (very large or very small)
#   • Weight initialisations are calibrated for ~unit-scale inputs
# Dividing by 255.0 maps values to [0.0, 1.0] and makes training stable.

X_train = X_train.astype('float32') / 255.0
X_test  = X_test.astype('float32')  / 255.0

# ── WHY float32 and NOT float64? ────────────────────────────────────────
# GPUs are optimised for float32. float64 gives no benefit for neural net
# training but uses 2× memory and is 2-4× slower on GPU hardware.

# ── ONE-HOT ENCODING THE LABELS ─────────────────────────────────────────
# y_train contains integers 0-9. Softmax outputs 10 probabilities.
# We need labels as 10-element vectors for categorical_crossentropy.
# E.g.: class 3 → [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]

y_train_cat = tf.keras.utils.to_categorical(y_train, 10)
y_test_cat  = tf.keras.utils.to_categorical(y_test,  10)

# Class names for display
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']

Step 3 — Data Augmentation

# ── WHY DATA AUGMENTATION? ───────────────────────────────────────────────
# CNNs can memorise 50,000 training images perfectly (overfit).
# Augmentation artificially expands the dataset by creating variations
# of each image. This forces the network to learn features that are
# INVARIANT to flips, shifts, and rotations — more generalisable.

# ImageDataGenerator applies transforms RANDOMLY on the fly during training.
# The original images on disk are never modified.

datagen = ImageDataGenerator(
    rotation_range=15,      # Rotate image up to ±15 degrees randomly
                              # WHY 15°: CIFAR-10 objects are mostly upright; ±15° is realistic
    width_shift_range=0.1,   # Shift horizontally by up to 10% of width
    height_shift_range=0.1, # Shift vertically by up to 10% of height
                              # WHY SHIFTS: objects aren't always perfectly centred
    horizontal_flip=True,    # Mirror image left-right 50% of the time
                              # WHY: A cat is still a cat when mirrored. NOT used for digits/text.
    zoom_range=0.1,          # Zoom in/out by up to 10%
    fill_mode='nearest'      # How to fill pixels created by shifts: copy nearest edge pixel
                              # Alternatives: 'reflect', 'wrap', 'constant'
)

# Fit the generator on training data (computes internal statistics if needed)
datagen.fit(X_train)

Step 4 — Build the CNN Model

# ── SEQUENTIAL API vs FUNCTIONAL API ────────────────────────────────────
# Sequential: simple stack of layers — suitable for most standard CNNs.
# Functional API: used when you need skip connections, multiple inputs/outputs.
# We use Sequential here for clarity.

def build_cnn(input_shape=(32, 32, 3), num_classes=10):
    model = models.Sequential([

        # ── BLOCK 1: First Convolutional Block ──────────────────────────────
        # Conv2D(32, (3,3), ...)
        #   32        = number of filters. Produces 32 feature maps.
        #              WHY 32 FIRST?: Start small — capture basic features.
        #              Larger first layers waste computation on simple features.
        #   (3,3)     = kernel size. 3×3 is the gold standard (proven by VGG).
        #              WHY 3×3?: Two 3×3 convs have same receptive field as one 5×5
        #              but with fewer parameters and more non-linearity.
        #   padding='same' = preserve spatial dimensions (32×32 → 32×32)
        #   activation='relu' = apply ReLU after convolution (most common choice)
        #   input_shape = only needed on FIRST layer. TensorFlow infers rest.

        layers.Conv2D(32, (3, 3), padding='same', activation='relu',
                       input_shape=input_shape,
                       kernel_regularizer=regularizers.l2(1e-4)),
                       # kernel_regularizer=l2(1e-4): add tiny L2 penalty on weights
                       # 1e-4 is a mild penalty — prevents extreme weight values

        layers.BatchNormalization(),  # Normalise after conv, before or after relu — debate exists.
                                       # Original paper: before relu. Modern practice: after.
                                       # We use after activation to follow modern convention.

        layers.Conv2D(32, (3, 3), padding='same', activation='relu',
                       kernel_regularizer=regularizers.l2(1e-4)),
        # WHY TWO CONV LAYERS BEFORE POOLING?
        # Stacking 2 conv layers increases the effective receptive field
        # (two 3×3 layers = 5×5 receptive field) while using fewer parameters
        # than a single 5×5 layer. More non-linearity = richer representations.

        layers.MaxPooling2D(2, 2),  # Halve spatial dims: 32×32 → 16×16
                                      # (2,2) = pool window, default stride = pool size
        layers.Dropout(0.25),        # Drop 25% of units. Lighter dropout after conv (25% vs 50%)
                                      # because conv layers have fewer params and need less regularisation.

        # ── BLOCK 2: Deeper, More Filters ───────────────────────────────────
        # 64 filters: we DOUBLE the number after pooling. WHY?
        # Pooling halved the spatial resolution (information loss).
        # Doubling filters compensates by increasing the channel depth —
        # we trade spatial richness for semantic richness.

        layers.Conv2D(64, (3, 3), padding='same', activation='relu',
                       kernel_regularizer=regularizers.l2(1e-4)),
        layers.BatchNormalization(),
        layers.Conv2D(64, (3, 3), padding='same', activation='relu',
                       kernel_regularizer=regularizers.l2(1e-4)),
        layers.MaxPooling2D(2, 2),  # 16×16 → 8×8
        layers.Dropout(0.25),

        # ── BLOCK 3: Deepest Block ──────────────────────────────────────────
        layers.Conv2D(128, (3, 3), padding='same', activation='relu',
                        kernel_regularizer=regularizers.l2(1e-4)),
        layers.BatchNormalization(),
        layers.MaxPooling2D(2, 2),  # 8×8 → 4×4
        layers.Dropout(0.25),

        # ── CLASSIFICATION HEAD ─────────────────────────────────────────────
        # Flatten converts 3D tensor (4, 4, 128) → 1D vector (2048,)
        # WHY FLATTEN NOT GlobalAveragePooling2D?
        # Flatten + Dense gives more parameters (more capacity) but risks overfit.
        # GAP reduces spatial dims to 1×1, losing less info, fewer params.
        # For CIFAR-10 (small images), Flatten works well.

        layers.Flatten(),

        # Dense(512): fully connected layer with 512 neurons
        # WHY 512?: Rule of thumb — start large and narrow down. Gives the
        # network capacity to combine features. Reduce toward output size.
        layers.Dense(512, activation='relu',
                      kernel_regularizer=regularizers.l2(1e-4)),
        layers.BatchNormalization(),
        layers.Dropout(0.5),   # Heavier dropout on Dense layers (50%) — more overfit risk

        # Final output layer — 10 neurons for 10 classes
        # activation='softmax': converts 10 raw scores to probabilities summing to 1
        # WHY NOT 'sigmoid' here?: sigmoid outputs independent probabilities (0-1) each.
        # softmax creates a COMPETITION between classes — exactly what we want for
        # mutually exclusive classification (one image = one class).
        layers.Dense(10, activation='softmax')
    ])
    return model

model = build_cnn()
model.summary()

MODEL SUMMARY (abbreviated)

Layer (type) Output Shape Param # ================================================================ conv2d (Conv2D) (None, 32, 32, 32) 896 batch_normalization (None, 32, 32, 32) 128 conv2d_1 (Conv2D) (None, 32, 32, 32) 9,248 batch_normalization_1 (None, 32, 32, 32) 128 max_pooling2d (MaxPooling2D) (None, 16, 16, 32) 0 dropout (Dropout) (None, 16, 16, 32) 0 conv2d_2 (Conv2D) (None, 16, 16, 64) 18,496 ... dense (Dense) (None, 512) 1,049,088 dense_1 (Dense) (None, 10) 5,130 ================================================================ Total params: 1,290,794 Trainable params: 1,290,154 Non-trainable params: 640 ← BatchNorm running statistics

Step 5 — Compile the Model

# ── COMPILE: tells Keras HOW to train ────────────────────────────────────

# optimizer='adam': Adaptive Moment Estimation
#   WHY ADAM over plain SGD?
#   Adam maintains per-parameter learning rates, adapted based on:
#     - momentum (m): exponential average of gradients (β1=0.9 default)
#     - velocity  (v): exponential average of squared gradients (β2=0.999)
#   Result: fast convergence, robust to hyperparameter choices.
#   learning_rate=0.001 is Adam's default and usually a good start.

# loss='categorical_crossentropy':
#   WHY THIS LOSS for classification?
#   It measures the distance between the predicted probability distribution
#   and the true one-hot distribution. Mathematically: -Σ y_true * log(y_pred).
#   Penalises confident wrong predictions VERY harshly (log(0) → -∞).
#   Use 'sparse_categorical_crossentropy' if labels are integers (no to_categorical).

# metrics=['accuracy']: tracks accuracy during training for human monitoring.
#   Note: accuracy is NOT the loss — just a readable metric for logging.

model.compile(
    optimizer=optimizers.Adam(learning_rate=0.001),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

Step 6 — Define Callbacks

# ── CALLBACKS: actions taken at end of each epoch ────────────────────────

# 1. EarlyStopping:
#    Monitor validation loss. If it doesn't improve for 'patience' epochs,
#    stop training. Prevents wasting compute on a model that's overfit.
#    restore_best_weights=True: roll back to the epoch with best val_loss.
#    WHY monitor val_loss not val_accuracy?: accuracy is coarse (steps of 1/N),
#    loss is continuous — more sensitive signal for early stopping.

early_stop = EarlyStopping(
    monitor='val_loss',
    patience=10,
    restore_best_weights=True,
    verbose=1
)

# 2. ReduceLROnPlateau:
#    When val_loss stops improving, reduce learning rate by factor 'factor'.
#    WHY?: Adam can get stuck in a local minimum. Reducing LR lets it take
#    smaller steps and potentially escape or fine-tune more precisely.
#    factor=0.5 means LR is halved. min_lr prevents it from reaching zero.

reduce_lr = ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.5,
    patience=5,
    min_lr=1e-6,
    verbose=1
)

# 3. ModelCheckpoint:
#    Save the model weights whenever val_accuracy improves.
#    WHY save_best_only=True?: training can degrade after a peak epoch.
#    This guarantees we always have the best version on disk.

checkpoint = ModelCheckpoint(
    filepath='best_cnn_model.h5',
    monitor='val_accuracy',
    save_best_only=True,
    verbose=1
)

callbacks = [early_stop, reduce_lr, checkpoint]

Step 7 — Train the Model

# ── model.fit() — the training loop ──────────────────────────────────────
#
# datagen.flow(X_train, y_train_cat, batch_size=64):
#   Creates a generator that yields augmented batches of 64 images.
#   WHY batch_size=64?:
#     - Too small (8): noisy gradient estimates, slow progress
#     - Too large (512): smooth but may get stuck in sharp minima
#     - 32-128 is the practical sweet spot for CNNs on standard datasets
#
# steps_per_epoch = len(X_train) // batch_size:
#   How many batches = one epoch. 50,000 / 64 ≈ 781 steps per epoch.
#   WHY specify this?: When using a generator, Keras can't automatically
#   know when one epoch ends.
#
# validation_data=(X_test, y_test_cat):
#   After each epoch, evaluate on test set. Never used for training.
#   WHY not augment validation data?: We want to evaluate on REAL images,
#   not augmented ones — augmentation is only for training diversity.
#
# epochs=100: maximum training epochs (EarlyStopping will likely stop before)

history = model.fit(
    datagen.flow(X_train, y_train_cat, batch_size=64),
    steps_per_epoch=len(X_train) // 64,
    epochs=100,
    validation_data=(X_test, y_test_cat),
    callbacks=callbacks,
    verbose=1
)

TRAINING OUTPUT (sample epochs)

Epoch 1/100 781/781 [======] - 22s - loss: 1.6234 - accuracy: 0.4012 - val_loss: 1.4821 - val_accuracy: 0.4701 Epoch 10/100 781/781 [======] - 20s - loss: 0.9821 - accuracy: 0.6543 - val_loss: 0.9102 - val_accuracy: 0.6812 Epoch 25/100 781/781 [======] - 21s - loss: 0.7234 - accuracy: 0.7491 - val_loss: 0.7109 - val_accuracy: 0.7550 Epoch 47/100 781/781 [======] - 20s - loss: 0.5892 - accuracy: 0.7981 - val_loss: 0.6342 - val_accuracy: 0.7890 EarlyStopping: val_loss did not improve for 10 epochs. Restoring best weights.

Step 8 — Evaluate and Visualise Results

# ── EVALUATION ───────────────────────────────────────────────────────────
test_loss, test_acc = model.evaluate(X_test, y_test_cat, verbose=0)
print(f"Test Accuracy: {test_acc:.4f}")
print(f"Test Loss:     {test_loss:.4f}")

# ── CONFUSION MATRIX ─────────────────────────────────────────────────────
# model.predict returns probability arrays. argmax gives the predicted class.
# WHY argmax? Softmax output is [0.02, 0.01, 0.85, ...]; argmax = index of max.

y_pred_probs = model.predict(X_test)
y_pred = np.argmax(y_pred_probs, axis=1)    # axis=1: argmax across 10 classes
y_true = np.argmax(y_test_cat, axis=1)     # convert one-hot back to integers

cm = confusion_matrix(y_true, y_pred)

# ── VISUALISE TRAINING CURVES ────────────────────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss curve
axes[0].plot(history.history['loss'],     label='Train Loss', color='#60a5fa')
axes[0].plot(history.history['val_loss'], label='Val Loss',   color='#f87171')
axes[0].set_title('Loss Curves')
axes[0].legend()

# Accuracy curve
axes[1].plot(history.history['accuracy'],     label='Train Acc', color='#34d399')
axes[1].plot(history.history['val_accuracy'], label='Val Acc',   color='#a78bfa')
axes[1].set_title('Accuracy Curves')
axes[1].legend()

plt.tight_layout()
plt.show()

# Full per-class report
print(classification_report(y_true, y_pred, target_names=class_names))

EVALUATION OUTPUT

Test Accuracy: 0.7890 Test Loss: 0.6342 precision recall f1-score support airplane 0.83 0.82 0.83 1000 automobile 0.90 0.89 0.90 1000 bird 0.70 0.68 0.69 1000 cat 0.64 0.60 0.62 1000 deer 0.78 0.82 0.80 1000 dog 0.70 0.69 0.69 1000 frog 0.84 0.88 0.86 1000 horse 0.86 0.87 0.87 1000 ship 0.87 0.90 0.88 1000 truck 0.87 0.88 0.87 1000 accuracy 0.79 10000

Section 10

Case Study — Pneumonia Detection from Chest X-Rays

📖 Real-World Case Study

Detecting Pneumonia in Hospital Chest X-Rays

It is 3 AM in a rural hospital. One radiologist is on call. A queue of 50 chest X-rays has built up from the emergency department. Each X-ray must be reviewed for pneumonia — a condition that, untreated, can be fatal within hours.

In 2018, Stanford's CheXNet CNN achieved radiologist-level accuracy on 14 chest conditions. In our case study, we use the Kaggle Chest X-Ray dataset (5,863 images: Normal vs Pneumonia) to build a binary CNN classifier that a hospital could use as a screening assistant.

This is a binary classification problem. Every design decision we make below is driven by this medical context.

⚠️

Medical Context Drives Every Design Choice

In medical imaging, False Negatives (missing real pneumonia) are more dangerous than False Positives. A missed case could cost a life; a false alarm leads to an extra doctor review. This influences our loss function, threshold, and metrics choices below.

Case Study — Full Code

import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications import VGG16
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
import numpy as np
from sklearn.metrics import classification_report, roc_auc_score

# ── DATASET PATHS ─────────────────────────────────────────────────────────
# Dataset: kaggle datasets download -d paultimothymooney/chest-xray-pneumonia
# Structure:
#   chest_xray/
#     train/ NORMAL/ PNEUMONIA/
#     val/   NORMAL/ PNEUMONIA/
#     test/  NORMAL/ PNEUMONIA/

TRAIN_DIR = 'chest_xray/train'
VAL_DIR   = 'chest_xray/val'
TEST_DIR  = 'chest_xray/test'

IMG_SIZE  = (224, 224)   # WHY 224×224? VGG16 and many pretrained models
                           # expect 224×224. We use transfer learning below.
BATCH     = 32

# ── DATA GENERATORS ──────────────────────────────────────────────────────
# Training: aggressive augmentation — X-rays can be taken at slight angles,
#           different zoom levels, different patient orientations.
# Validation/Test: NO augmentation — evaluate on real unmodified images.
# rescale=1/255: normalise pixel values to [0,1] for all generators.

train_gen = ImageDataGenerator(
    rescale=1./255,
    rotation_range=10,          # X-rays rarely tilted more than ±10°
    zoom_range=0.1,
    horizontal_flip=True,        # Lungs are bilaterally symmetric
    shear_range=0.1,             # Small shear — simulates patient lean
    brightness_range=[0.8, 1.2] # X-ray exposure can vary between machines
)

val_test_gen = ImageDataGenerator(rescale=1./255)

# flow_from_directory: reads images from folder structure
# class_mode='binary': returns 0 (NORMAL) or 1 (PNEUMONIA) labels
# WHY binary and not categorical?: Only 2 classes. Sigmoid output.
# target_size: resizes all images to IMG_SIZE on the fly.

train_data = train_gen.flow_from_directory(
    TRAIN_DIR, target_size=IMG_SIZE, batch_size=BATCH, class_mode='binary')

val_data = val_test_gen.flow_from_directory(
    VAL_DIR, target_size=IMG_SIZE, batch_size=BATCH, class_mode='binary',
    shuffle=False)   # WHY shuffle=False on val/test? We need predictions to
                      # align with true labels in the correct order for metrics.

test_data = val_test_gen.flow_from_directory(
    TEST_DIR, target_size=IMG_SIZE, batch_size=BATCH, class_mode='binary',
    shuffle=False)

# ── CLASS IMBALANCE CHECK ─────────────────────────────────────────────────
# Kaggle dataset: ~3875 PNEUMONIA, ~1341 NORMAL → imbalanced 3:1 ratio.
# If we ignore this, the model could predict "always pneumonia" and get 74% accuracy!
# We compute class weights to make the loss penalise minority class more.

n_normal    = 1341
n_pneumonia = 3875
total       = n_normal + n_pneumonia

class_weight = {
    0: total / (2 * n_normal),    # ~1.85 — upweight Normal class
    1: total / (2 * n_pneumonia)   # ~0.64 — slightly downweight Pneumonia
}
print("Class weights:", class_weight)

# ── TRANSFER LEARNING WITH VGG16 ─────────────────────────────────────────
# WHY TRANSFER LEARNING?
# We have ~5,000 training images. Training a deep CNN from scratch on this
# would overfit badly. VGG16 was pretrained on ImageNet (1.2M images, 1000 classes).
# Its early layers already detect edges, textures, shapes — useful for X-rays too.
# We FREEZE the pretrained layers, add our own head, and only train the head first.

# weights='imagenet': download pretrained ImageNet weights
# include_top=False: exclude VGG16's original Dense classification head
# input_shape=(224, 224, 3): X-rays are grayscale but VGG16 expects 3 channels.
#   We'll use the same image 3× (convert grayscale to 3-channel).
# WHY NOT grayscale directly?: Pretrained weights expect 3 channels.
#   Simple fix: repeat grayscale channel 3 times.

base_model = VGG16(weights='imagenet', include_top=False,
                    input_shape=(*IMG_SIZE, 3))

# Freeze ALL base model layers — only our custom head will be trained initially
base_model.trainable = False

# ── BUILD CUSTOM HEAD ON TOP OF VGG16 ───────────────────────────────────
model = models.Sequential([
    base_model,                             # Frozen VGG16 feature extractor

    # GlobalAveragePooling2D: replaces Flatten.
    # Output of VGG16 (with 224×224 input) is (7, 7, 512).
    # GAP2D reduces this to a 512-dim vector by averaging each feature map.
    # WHY GAP instead of Flatten?: 7×7×512 = 25,088. Flatten → Dense would be
    # 25,088 × 256 = 6.4 million params for one layer. GAP gives 512 × 256 = 131k.
    # GAP is regularising: inherently reduces overfitting.
    layers.GlobalAveragePooling2D(),

    layers.Dense(256, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0.5),

    layers.Dense(64, activation='relu'),
    layers.Dropout(0.3),

    # Binary output: 1 neuron, sigmoid activation
    # sigmoid outputs probability of Pneumonia (class 1): value in [0,1]
    # Default decision threshold: >0.5 → Pneumonia, ≤0.5 → Normal
    # We will LOWER this threshold to 0.3 to reduce False Negatives.
    layers.Dense(1, activation='sigmoid')
])

# ── COMPILE: binary_crossentropy for 2-class sigmoid output ─────────────
# WHY binary_crossentropy not categorical?:
#   binary_crossentropy = -[y*log(p) + (1-y)*log(1-p)]
#   Designed for a single sigmoid output. categorical_crossentropy is
#   for multi-class softmax with one-hot labels.
# metrics=['accuracy', AUC]: AUC-ROC is critical in medical imaging —
#   it measures discrimination across ALL thresholds, not just 0.5.

model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy', tf.keras.metrics.AUC(name='auc')]
)

# ── PHASE 1: Train head only (base frozen) ───────────────────────────────
history1 = model.fit(
    train_data,
    epochs=15,
    validation_data=val_data,
    class_weight=class_weight,     # Apply imbalance correction
    callbacks=[
        EarlyStopping(monitor='val_auc', patience=5, restore_best_weights=True,
                      mode='max'),  # mode='max': AUC should INCREASE
        ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3)
    ]
)

# ── PHASE 2: Fine-tune — unfreeze last few VGG16 blocks ─────────────────
# Now that our head is trained, carefully unfreeze the last convolutional
# block of VGG16 (block5) and fine-tune with a MUCH smaller learning rate.
# WHY small LR for fine-tuning?: The pretrained weights are good starting points.
# A large LR would destroy them ("catastrophic forgetting").
# We want tiny adjustments to specialise for X-rays, not relearn from scratch.

base_model.trainable = True

# Only unfreeze layers from block5 onward (last 4 conv layers)
for layer in base_model.layers:
    if 'block5' in layer.name:
        layer.trainable = True
    else:
        layer.trainable = False

# Recompile with 10× smaller learning rate for fine-tuning
model.compile(
    optimizer=optimizers.Adam(learning_rate=1e-5),  # 0.001 → 0.00001
    loss='binary_crossentropy',
    metrics=['accuracy', tf.keras.metrics.AUC(name='auc')]
)

history2 = model.fit(
    train_data,
    epochs=20,
    validation_data=val_data,
    class_weight=class_weight,
    callbacks=[
        EarlyStopping(monitor='val_auc', patience=7, restore_best_weights=True, mode='max'),
        ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=4)
    ]
)

# ── EVALUATION WITH LOWERED THRESHOLD ────────────────────────────────────
# Default threshold is 0.5. In medical context, we lower to 0.3:
# → Catches more True Positives (reduce missed pneumonia cases)
# → Accepts more False Positives (extra doctor reviews — acceptable trade-off)

y_pred_prob = model.predict(test_data, verbose=0).flatten()
y_true_test = test_data.classes

threshold = 0.3   # Medical decision: sensitivity over specificity
y_pred_class = (y_pred_prob >= threshold).astype(int)

auc_score = roc_auc_score(y_true_test, y_pred_prob)
print(f"\nAUC-ROC Score: {auc_score:.4f}")
print(classification_report(y_true_test, y_pred_class,
                             target_names=['Normal', 'Pneumonia']))

CASE STUDY RESULTS (threshold=0.30)

AUC-ROC Score: 0.9721 precision recall f1-score support Normal 0.91 0.85 0.88 234 Pneumonia 0.93 0.96 0.94 390 accuracy 0.92 624 Key Metrics: Recall (Pneumonia): 0.96 ← catches 96% of real cases ✓ Recall (Normal): 0.85 ← 15% flagged as false alarms (doctors review) AUC-ROC: 0.97 ← near-perfect discrimination

Section 11

Diagnosing Your CNN — Reading the Loss Curves

🔴 Overfitting Pattern

Epoch	Train Loss	Val Loss
10	0.45	0.48
20	0.28	0.59
30	0.15	0.82
40	0.08	1.15

Train loss keeps falling while val loss rises → memorising training data.

✅ Healthy Training Pattern

Epoch	Train Loss	Val Loss
10	0.52	0.54
20	0.38	0.40
30	0.31	0.33
40	0.28	0.30

Both losses decline together and converge. Small gap = healthy generalisation.

📉

Loss Not Decreasing

Underfitting / LR Issues

Loss stays flat from epoch 1. Causes: learning rate too small (try 10× larger), model too shallow, data normalisation missing. Try: increase model capacity or LR.

📈

Oscillating Loss

LR Too High / Bad Data

Loss bounces up and down wildly. Causes: learning rate too large (gradients overshoot), batch size too small (noisy gradients), corrupted data. Try: reduce LR by 10×, increase batch size.

🎯

Val Accuracy Plateaus Early

Need More Augmentation

Train acc keeps rising but val acc stops at ~65%. Dataset is too small or not diverse enough. Try: more aggressive augmentation, transfer learning, collect more data.

Section 12

Golden Rules for Building CNNs

🧠 CNN Practitioner Rules — Non-Negotiable

Always normalise your images first. Pixel values 0–255 cause unstable gradients. Divide by 255.0 at minimum. Use ImageNet mean/std subtraction for transfer learning models. Never train on raw uint8 values.

Start with transfer learning if you have fewer than ~100,000 images. Training from scratch on small datasets almost always leads to overfit. Use VGG16, ResNet50, or EfficientNet-B0 as frozen base. Fine-tune later.

Double the filters after each pooling layer. This is the VGG principle: 32→64→128→256. As spatial resolution decreases, channel depth increases to preserve information capacity.

Use BatchNorm before heavy regularisation. BatchNorm acts as a mild regulariser. Don't apply both aggressive dropout AND L2 to the same layer — you'll underfit. BatchNorm + light Dropout (0.25 on conv, 0.5 on dense) is the standard recipe.

Monitor AUC-ROC, not just accuracy for imbalanced datasets. A model predicting the majority class always gets high accuracy. AUC-ROC measures how well the model discriminates between classes at all possible thresholds.

Use padding='same' in most conv layers. Valid padding shrinks the spatial dimensions at every layer, losing boundary information. Same padding preserves dimensions until you explicitly downsample with pooling.

For fine-tuning pretrained models: freeze all layers first, train head for 10–15 epochs, then unfreeze only the last 1–2 blocks and retrain with LR 10–100× smaller than initial. Unfreeze too much too early = catastrophic forgetting.

Section 13

CNN vs Other Architectures — When to Use What

Architecture	Best For	Dataset Size	Speed	Accuracy
Custom CNN	Learning, small projects, full control	50k–500k	Fast	Good
VGG16 (Transfer)	Medical, domain-specific, small data	1k–50k	Moderate	Very Good
ResNet50	General vision, deep architectures	10k+	Moderate	Excellent
EfficientNet-B0	Production, mobile, efficiency-critical	Any	Very Fast	State of Art
Vision Transformer (ViT)	Very large datasets, attention-based	1M+	Slow	SOTA on large

🏆

The Practitioner's Decision Tree

Under 10,000 images? → Use transfer learning (VGG16/ResNet). General vision task, medium data? → ResNet50 or EfficientNet. Production/mobile deployment? → EfficientNet-B0 or MobileNetV3. Building something from scratch to learn? → Custom CNN with CIFAR-10 first. Massive dataset, unlimited compute? → Vision Transformer.