The Story That Explains Why CNN Was Born
Now imagine you had to describe that face to a computer that uses a plain MLP (Multi-Layer Perceptron). You would have to flatten the image into a 1,000,000-element vector, and the network would need billions of weights just for one image. That's insane โ and it loses all spatial structure.
Convolutional Neural Networks were designed to do exactly what your detective brain does: scan for local patterns, detect features, and gradually build global understanding. That is the entire philosophy.
In 1989, Yann LeCun introduced the concept of CNNs. In 1998 he built LeNet-5 โ a CNN that read handwritten digits on cheques for banks. By 2012, AlexNet won the ImageNet competition with a 15.3% error rate, crushing all traditional methods. CNNs had arrived. Today, they power face recognition on your phone, medical imaging, self-driving cars, and satellite image analysis.
A plain neural network treats every pixel independently โ it has no concept of "neighbours". CNNs enforce a spatial prior: nearby pixels are related, patterns can appear anywhere in the image (translation invariance), and complex features are built from simple ones hierarchically.
CNN Architecture โ The Big Picture
A CNN is not one single operation โ it is a pipeline of specialised layers. Think of it as a factory assembly line where each station adds value.
โฌ Data shrinks spatially but grows in depth (channels) as it moves through the network. Spatial richness is traded for semantic richness.
The Convolution Operation โ The Heart of CNN
That flashlight is the kernel (filter). The map you draw is the feature map (activation map). Sliding the flashlight is the convolution operation.
Mathematically, a 2D convolution applies a small matrix (the filter/kernel) to patches of the input. The filter slides across the image with a given stride, performing an element-wise multiplication at each position, then summing the result into a single number.
๐ฌ Numerical Example: A 5ร5 Image, 3ร3 Filter
| 1 | 0 | 1 | 0 | 1 |
| 0 | 1 | 1 | 1 | 0 |
| 1 | 1 | 0 | 1 | 1 |
| 0 | 0 | 1 | 0 | 0 |
| 1 | 0 | 0 | 1 | 1 |
| -1 | 0 | +1 |
| -1 | 0 | +1 |
| -1 | 0 | +1 |
At top-left position: (1ร-1)+(0ร0)+(1ร1)+(0ร-1)+(1ร0)+(1ร1)+(1ร-1)+(1ร0)+(0ร1) = 1
A vertical edge kernel has opposite signs in left and right columns. A horizontal edge kernel has them in top and bottom rows. A blur kernel has equal weights (1/9 each). In deep learning, we don't hand-craft these kernels โ the network learns the optimal kernel values through backpropagation. That's the magic: the filters self-organise to detect whatever is most useful for the task.
Padding and Stride โ The Controls
Pooling Layers โ Compressing Without Losing Soul
After convolution, we have feature maps that are large and potentially redundant. Pooling aggregates each small region into one value. It makes the representation more compact, reduces parameters, and introduces spatial invariance โ a feature detected slightly off-position still fires.
| 12 | 20 | 8 | 11 |
| 18 | 5 | 14 | 7 |
| 3 | 9 | 22 | 16 |
| 6 | 13 | 4 | 19 |
| 20 | 14 |
| 13 | 22 |
Max of each 2ร2 block is retained. 4ร4 โ 2ร2. 75% of values discarded, most important kept.
Activation Functions โ Why Non-Linearity Matters
A stack of linear operations (conv + conv + conv) is mathematically equivalent to a single linear operation. You could have 100 layers and it's no more powerful than 1. Activation functions break this โ they introduce non-linearity, enabling the network to model complex, curved decision boundaries.
max(0,x) is cheap to compute and its gradient is simply 1 (for x>0) or 0 โ no expensive exp() calculations.
Why used: Prevents vanishing gradients better than sigmoid/tanh. Sparse activation (many zeros) = efficient computation.
ฮฑรx (small slope, e.g. 0.01). ELU uses an exponential curve for negatives. Why used: Keeps all neurons "alive" during training. Preferred for very deep networks.
e^x_i / ฮฃ e^x_j.
Why this formula? The exponential amplifies differences between logits, making the largest value dominate. Sum normalisation ensures valid probabilities.
Batch Normalisation and Dropout โ The Regulators
Why is this syntax used?
BatchNormalization() is placed after Conv2D and before activation in most architectures. It prevents "internal covariate shift" โ the problem where layer inputs keep changing distribution during training, making learning slow.
Effect: Allows higher learning rates, acts as a mild regulariser, dramatically stabilises training of deep networks.
Why this syntax is used?
Dropout(rate=0.5) means 50% of neurons are killed per step. This forces the network to not rely on any one neuron โ it must learn redundant representations. Like training the team to work even when some members are absent.
Position: After Dense layers (rarely after Conv layers in modern nets โ spatial dropout exists for that).
Why used: Large weights memorise training data. L2 keeps them modest. In Keras:
kernel_regularizer=l2(0.001).
Pairs well with Batch Norm. When BN is used, L2 on weights matters less because BN already controls scale.
How CNNs Learn โ Backpropagation Through Convolutions
w = w - lr ร gradient. With Adam optimiser, the learning rate is adaptive per-parameter. Filters gradually evolve from random noise into meaningful feature detectors.Full Python Implementation โ Step by Step
Now we build a complete CNN from scratch using TensorFlow/Keras. Each syntax choice is explained in detail. We will structure the code into clean, logical steps.
Step 1 โ Import Libraries
# Core libraries โ understand each import's role
import numpy as np # Array operations, data manipulation
import matplotlib.pyplot as plt # Visualisation of images, loss curves
import tensorflow as tf # Deep learning framework
# Keras API โ tf.keras is the high-level API built into TensorFlow 2.x
# 'from tensorflow import keras' is equivalent
from tensorflow.keras import layers, models, optimizers, regularizers
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from tensorflow.keras.preprocessing.image import ImageDataGenerator
# sklearn utilities for evaluation
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
# Set seeds for reproducibility โ without this, results differ each run
np.random.seed(42)
tf.random.set_seed(42)
tf.keras and Not Standalone Keras?TensorFlow 2.x ships with Keras built in. tf.keras is tightly integrated with TF's computational graph, automatic differentiation (tf.GradientTape), and GPU support. You could use standalone Keras with different backends, but tf.keras is the production standard in 2024/2025.
Step 2 โ Load and Prepare the CIFAR-10 Dataset
# CIFAR-10: 60,000 colour images, 32ร32 pixels, 10 classes
# Classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()
# Shapes after loading:
print(f"X_train shape: {X_train.shape}") # (50000, 32, 32, 3)
print(f"X_test shape: {X_test.shape}") # (10000, 32, 32, 3)
print(f"y_train shape: {y_train.shape}") # (50000, 1)
# โโ WHY NORMALISE TO [0,1]? โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Pixel values are uint8: 0 to 255. Neural nets train poorly with large inputs:
# โข Gradients become unstable (very large or very small)
# โข Weight initialisations are calibrated for ~unit-scale inputs
# Dividing by 255.0 maps values to [0.0, 1.0] and makes training stable.
X_train = X_train.astype('float32') / 255.0
X_test = X_test.astype('float32') / 255.0
# โโ WHY float32 and NOT float64? โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# GPUs are optimised for float32. float64 gives no benefit for neural net
# training but uses 2ร memory and is 2-4ร slower on GPU hardware.
# โโ ONE-HOT ENCODING THE LABELS โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# y_train contains integers 0-9. Softmax outputs 10 probabilities.
# We need labels as 10-element vectors for categorical_crossentropy.
# E.g.: class 3 โ [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
y_train_cat = tf.keras.utils.to_categorical(y_train, 10)
y_test_cat = tf.keras.utils.to_categorical(y_test, 10)
# Class names for display
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
'dog', 'frog', 'horse', 'ship', 'truck']
Step 3 โ Data Augmentation
# โโ WHY DATA AUGMENTATION? โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# CNNs can memorise 50,000 training images perfectly (overfit).
# Augmentation artificially expands the dataset by creating variations
# of each image. This forces the network to learn features that are
# INVARIANT to flips, shifts, and rotations โ more generalisable.
# ImageDataGenerator applies transforms RANDOMLY on the fly during training.
# The original images on disk are never modified.
datagen = ImageDataGenerator(
rotation_range=15, # Rotate image up to ยฑ15 degrees randomly
# WHY 15ยฐ: CIFAR-10 objects are mostly upright; ยฑ15ยฐ is realistic
width_shift_range=0.1, # Shift horizontally by up to 10% of width
height_shift_range=0.1, # Shift vertically by up to 10% of height
# WHY SHIFTS: objects aren't always perfectly centred
horizontal_flip=True, # Mirror image left-right 50% of the time
# WHY: A cat is still a cat when mirrored. NOT used for digits/text.
zoom_range=0.1, # Zoom in/out by up to 10%
fill_mode='nearest' # How to fill pixels created by shifts: copy nearest edge pixel
# Alternatives: 'reflect', 'wrap', 'constant'
)
# Fit the generator on training data (computes internal statistics if needed)
datagen.fit(X_train)
Step 4 โ Build the CNN Model
# โโ SEQUENTIAL API vs FUNCTIONAL API โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Sequential: simple stack of layers โ suitable for most standard CNNs.
# Functional API: used when you need skip connections, multiple inputs/outputs.
# We use Sequential here for clarity.
def build_cnn(input_shape=(32, 32, 3), num_classes=10):
model = models.Sequential([
# โโ BLOCK 1: First Convolutional Block โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Conv2D(32, (3,3), ...)
# 32 = number of filters. Produces 32 feature maps.
# WHY 32 FIRST?: Start small โ capture basic features.
# Larger first layers waste computation on simple features.
# (3,3) = kernel size. 3ร3 is the gold standard (proven by VGG).
# WHY 3ร3?: Two 3ร3 convs have same receptive field as one 5ร5
# but with fewer parameters and more non-linearity.
# padding='same' = preserve spatial dimensions (32ร32 โ 32ร32)
# activation='relu' = apply ReLU after convolution (most common choice)
# input_shape = only needed on FIRST layer. TensorFlow infers rest.
layers.Conv2D(32, (3, 3), padding='same', activation='relu',
input_shape=input_shape,
kernel_regularizer=regularizers.l2(1e-4)),
# kernel_regularizer=l2(1e-4): add tiny L2 penalty on weights
# 1e-4 is a mild penalty โ prevents extreme weight values
layers.BatchNormalization(), # Normalise after conv, before or after relu โ debate exists.
# Original paper: before relu. Modern practice: after.
# We use after activation to follow modern convention.
layers.Conv2D(32, (3, 3), padding='same', activation='relu',
kernel_regularizer=regularizers.l2(1e-4)),
# WHY TWO CONV LAYERS BEFORE POOLING?
# Stacking 2 conv layers increases the effective receptive field
# (two 3ร3 layers = 5ร5 receptive field) while using fewer parameters
# than a single 5ร5 layer. More non-linearity = richer representations.
layers.MaxPooling2D(2, 2), # Halve spatial dims: 32ร32 โ 16ร16
# (2,2) = pool window, default stride = pool size
layers.Dropout(0.25), # Drop 25% of units. Lighter dropout after conv (25% vs 50%)
# because conv layers have fewer params and need less regularisation.
# โโ BLOCK 2: Deeper, More Filters โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# 64 filters: we DOUBLE the number after pooling. WHY?
# Pooling halved the spatial resolution (information loss).
# Doubling filters compensates by increasing the channel depth โ
# we trade spatial richness for semantic richness.
layers.Conv2D(64, (3, 3), padding='same', activation='relu',
kernel_regularizer=regularizers.l2(1e-4)),
layers.BatchNormalization(),
layers.Conv2D(64, (3, 3), padding='same', activation='relu',
kernel_regularizer=regularizers.l2(1e-4)),
layers.MaxPooling2D(2, 2), # 16ร16 โ 8ร8
layers.Dropout(0.25),
# โโ BLOCK 3: Deepest Block โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
layers.Conv2D(128, (3, 3), padding='same', activation='relu',
kernel_regularizer=regularizers.l2(1e-4)),
layers.BatchNormalization(),
layers.MaxPooling2D(2, 2), # 8ร8 โ 4ร4
layers.Dropout(0.25),
# โโ CLASSIFICATION HEAD โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Flatten converts 3D tensor (4, 4, 128) โ 1D vector (2048,)
# WHY FLATTEN NOT GlobalAveragePooling2D?
# Flatten + Dense gives more parameters (more capacity) but risks overfit.
# GAP reduces spatial dims to 1ร1, losing less info, fewer params.
# For CIFAR-10 (small images), Flatten works well.
layers.Flatten(),
# Dense(512): fully connected layer with 512 neurons
# WHY 512?: Rule of thumb โ start large and narrow down. Gives the
# network capacity to combine features. Reduce toward output size.
layers.Dense(512, activation='relu',
kernel_regularizer=regularizers.l2(1e-4)),
layers.BatchNormalization(),
layers.Dropout(0.5), # Heavier dropout on Dense layers (50%) โ more overfit risk
# Final output layer โ 10 neurons for 10 classes
# activation='softmax': converts 10 raw scores to probabilities summing to 1
# WHY NOT 'sigmoid' here?: sigmoid outputs independent probabilities (0-1) each.
# softmax creates a COMPETITION between classes โ exactly what we want for
# mutually exclusive classification (one image = one class).
layers.Dense(10, activation='softmax')
])
return model
model = build_cnn()
model.summary()
Step 5 โ Compile the Model
# โโ COMPILE: tells Keras HOW to train โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# optimizer='adam': Adaptive Moment Estimation
# WHY ADAM over plain SGD?
# Adam maintains per-parameter learning rates, adapted based on:
# - momentum (m): exponential average of gradients (ฮฒ1=0.9 default)
# - velocity (v): exponential average of squared gradients (ฮฒ2=0.999)
# Result: fast convergence, robust to hyperparameter choices.
# learning_rate=0.001 is Adam's default and usually a good start.
# loss='categorical_crossentropy':
# WHY THIS LOSS for classification?
# It measures the distance between the predicted probability distribution
# and the true one-hot distribution. Mathematically: -ฮฃ y_true * log(y_pred).
# Penalises confident wrong predictions VERY harshly (log(0) โ -โ).
# Use 'sparse_categorical_crossentropy' if labels are integers (no to_categorical).
# metrics=['accuracy']: tracks accuracy during training for human monitoring.
# Note: accuracy is NOT the loss โ just a readable metric for logging.
model.compile(
optimizer=optimizers.Adam(learning_rate=0.001),
loss='categorical_crossentropy',
metrics=['accuracy']
)
Step 6 โ Define Callbacks
# โโ CALLBACKS: actions taken at end of each epoch โโโโโโโโโโโโโโโโโโโโโโโโ
# 1. EarlyStopping:
# Monitor validation loss. If it doesn't improve for 'patience' epochs,
# stop training. Prevents wasting compute on a model that's overfit.
# restore_best_weights=True: roll back to the epoch with best val_loss.
# WHY monitor val_loss not val_accuracy?: accuracy is coarse (steps of 1/N),
# loss is continuous โ more sensitive signal for early stopping.
early_stop = EarlyStopping(
monitor='val_loss',
patience=10,
restore_best_weights=True,
verbose=1
)
# 2. ReduceLROnPlateau:
# When val_loss stops improving, reduce learning rate by factor 'factor'.
# WHY?: Adam can get stuck in a local minimum. Reducing LR lets it take
# smaller steps and potentially escape or fine-tune more precisely.
# factor=0.5 means LR is halved. min_lr prevents it from reaching zero.
reduce_lr = ReduceLROnPlateau(
monitor='val_loss',
factor=0.5,
patience=5,
min_lr=1e-6,
verbose=1
)
# 3. ModelCheckpoint:
# Save the model weights whenever val_accuracy improves.
# WHY save_best_only=True?: training can degrade after a peak epoch.
# This guarantees we always have the best version on disk.
checkpoint = ModelCheckpoint(
filepath='best_cnn_model.h5',
monitor='val_accuracy',
save_best_only=True,
verbose=1
)
callbacks = [early_stop, reduce_lr, checkpoint]
Step 7 โ Train the Model
# โโ model.fit() โ the training loop โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
#
# datagen.flow(X_train, y_train_cat, batch_size=64):
# Creates a generator that yields augmented batches of 64 images.
# WHY batch_size=64?:
# - Too small (8): noisy gradient estimates, slow progress
# - Too large (512): smooth but may get stuck in sharp minima
# - 32-128 is the practical sweet spot for CNNs on standard datasets
#
# steps_per_epoch = len(X_train) // batch_size:
# How many batches = one epoch. 50,000 / 64 โ 781 steps per epoch.
# WHY specify this?: When using a generator, Keras can't automatically
# know when one epoch ends.
#
# validation_data=(X_test, y_test_cat):
# After each epoch, evaluate on test set. Never used for training.
# WHY not augment validation data?: We want to evaluate on REAL images,
# not augmented ones โ augmentation is only for training diversity.
#
# epochs=100: maximum training epochs (EarlyStopping will likely stop before)
history = model.fit(
datagen.flow(X_train, y_train_cat, batch_size=64),
steps_per_epoch=len(X_train) // 64,
epochs=100,
validation_data=(X_test, y_test_cat),
callbacks=callbacks,
verbose=1
)
Step 8 โ Evaluate and Visualise Results
# โโ EVALUATION โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
test_loss, test_acc = model.evaluate(X_test, y_test_cat, verbose=0)
print(f"Test Accuracy: {test_acc:.4f}")
print(f"Test Loss: {test_loss:.4f}")
# โโ CONFUSION MATRIX โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# model.predict returns probability arrays. argmax gives the predicted class.
# WHY argmax? Softmax output is [0.02, 0.01, 0.85, ...]; argmax = index of max.
y_pred_probs = model.predict(X_test)
y_pred = np.argmax(y_pred_probs, axis=1) # axis=1: argmax across 10 classes
y_true = np.argmax(y_test_cat, axis=1) # convert one-hot back to integers
cm = confusion_matrix(y_true, y_pred)
# โโ VISUALISE TRAINING CURVES โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Loss curve
axes[0].plot(history.history['loss'], label='Train Loss', color='#60a5fa')
axes[0].plot(history.history['val_loss'], label='Val Loss', color='#f87171')
axes[0].set_title('Loss Curves')
axes[0].legend()
# Accuracy curve
axes[1].plot(history.history['accuracy'], label='Train Acc', color='#34d399')
axes[1].plot(history.history['val_accuracy'], label='Val Acc', color='#a78bfa')
axes[1].set_title('Accuracy Curves')
axes[1].legend()
plt.tight_layout()
plt.show()
# Full per-class report
print(classification_report(y_true, y_pred, target_names=class_names))
Case Study โ Pneumonia Detection from Chest X-Rays
In 2018, Stanford's CheXNet CNN achieved radiologist-level accuracy on 14 chest conditions. In our case study, we use the Kaggle Chest X-Ray dataset (5,863 images: Normal vs Pneumonia) to build a binary CNN classifier that a hospital could use as a screening assistant.
This is a binary classification problem. Every design decision we make below is driven by this medical context.
In medical imaging, False Negatives (missing real pneumonia) are more dangerous than False Positives. A missed case could cost a life; a false alarm leads to an extra doctor review. This influences our loss function, threshold, and metrics choices below.
Case Study โ Full Code
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications import VGG16
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
import numpy as np
from sklearn.metrics import classification_report, roc_auc_score
# โโ DATASET PATHS โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Dataset: kaggle datasets download -d paultimothymooney/chest-xray-pneumonia
# Structure:
# chest_xray/
# train/ NORMAL/ PNEUMONIA/
# val/ NORMAL/ PNEUMONIA/
# test/ NORMAL/ PNEUMONIA/
TRAIN_DIR = 'chest_xray/train'
VAL_DIR = 'chest_xray/val'
TEST_DIR = 'chest_xray/test'
IMG_SIZE = (224, 224) # WHY 224ร224? VGG16 and many pretrained models
# expect 224ร224. We use transfer learning below.
BATCH = 32
# โโ DATA GENERATORS โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Training: aggressive augmentation โ X-rays can be taken at slight angles,
# different zoom levels, different patient orientations.
# Validation/Test: NO augmentation โ evaluate on real unmodified images.
# rescale=1/255: normalise pixel values to [0,1] for all generators.
train_gen = ImageDataGenerator(
rescale=1./255,
rotation_range=10, # X-rays rarely tilted more than ยฑ10ยฐ
zoom_range=0.1,
horizontal_flip=True, # Lungs are bilaterally symmetric
shear_range=0.1, # Small shear โ simulates patient lean
brightness_range=[0.8, 1.2] # X-ray exposure can vary between machines
)
val_test_gen = ImageDataGenerator(rescale=1./255)
# flow_from_directory: reads images from folder structure
# class_mode='binary': returns 0 (NORMAL) or 1 (PNEUMONIA) labels
# WHY binary and not categorical?: Only 2 classes. Sigmoid output.
# target_size: resizes all images to IMG_SIZE on the fly.
train_data = train_gen.flow_from_directory(
TRAIN_DIR, target_size=IMG_SIZE, batch_size=BATCH, class_mode='binary')
val_data = val_test_gen.flow_from_directory(
VAL_DIR, target_size=IMG_SIZE, batch_size=BATCH, class_mode='binary',
shuffle=False) # WHY shuffle=False on val/test? We need predictions to
# align with true labels in the correct order for metrics.
test_data = val_test_gen.flow_from_directory(
TEST_DIR, target_size=IMG_SIZE, batch_size=BATCH, class_mode='binary',
shuffle=False)
# โโ CLASS IMBALANCE CHECK โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Kaggle dataset: ~3875 PNEUMONIA, ~1341 NORMAL โ imbalanced 3:1 ratio.
# If we ignore this, the model could predict "always pneumonia" and get 74% accuracy!
# We compute class weights to make the loss penalise minority class more.
n_normal = 1341
n_pneumonia = 3875
total = n_normal + n_pneumonia
class_weight = {
0: total / (2 * n_normal), # ~1.85 โ upweight Normal class
1: total / (2 * n_pneumonia) # ~0.64 โ slightly downweight Pneumonia
}
print("Class weights:", class_weight)
# โโ TRANSFER LEARNING WITH VGG16 โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# WHY TRANSFER LEARNING?
# We have ~5,000 training images. Training a deep CNN from scratch on this
# would overfit badly. VGG16 was pretrained on ImageNet (1.2M images, 1000 classes).
# Its early layers already detect edges, textures, shapes โ useful for X-rays too.
# We FREEZE the pretrained layers, add our own head, and only train the head first.
# weights='imagenet': download pretrained ImageNet weights
# include_top=False: exclude VGG16's original Dense classification head
# input_shape=(224, 224, 3): X-rays are grayscale but VGG16 expects 3 channels.
# We'll use the same image 3ร (convert grayscale to 3-channel).
# WHY NOT grayscale directly?: Pretrained weights expect 3 channels.
# Simple fix: repeat grayscale channel 3 times.
base_model = VGG16(weights='imagenet', include_top=False,
input_shape=(*IMG_SIZE, 3))
# Freeze ALL base model layers โ only our custom head will be trained initially
base_model.trainable = False
# โโ BUILD CUSTOM HEAD ON TOP OF VGG16 โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
model = models.Sequential([
base_model, # Frozen VGG16 feature extractor
# GlobalAveragePooling2D: replaces Flatten.
# Output of VGG16 (with 224ร224 input) is (7, 7, 512).
# GAP2D reduces this to a 512-dim vector by averaging each feature map.
# WHY GAP instead of Flatten?: 7ร7ร512 = 25,088. Flatten โ Dense would be
# 25,088 ร 256 = 6.4 million params for one layer. GAP gives 512 ร 256 = 131k.
# GAP is regularising: inherently reduces overfitting.
layers.GlobalAveragePooling2D(),
layers.Dense(256, activation='relu'),
layers.BatchNormalization(),
layers.Dropout(0.5),
layers.Dense(64, activation='relu'),
layers.Dropout(0.3),
# Binary output: 1 neuron, sigmoid activation
# sigmoid outputs probability of Pneumonia (class 1): value in [0,1]
# Default decision threshold: >0.5 โ Pneumonia, โค0.5 โ Normal
# We will LOWER this threshold to 0.3 to reduce False Negatives.
layers.Dense(1, activation='sigmoid')
])
# โโ COMPILE: binary_crossentropy for 2-class sigmoid output โโโโโโโโโโโโโ
# WHY binary_crossentropy not categorical?:
# binary_crossentropy = -[y*log(p) + (1-y)*log(1-p)]
# Designed for a single sigmoid output. categorical_crossentropy is
# for multi-class softmax with one-hot labels.
# metrics=['accuracy', AUC]: AUC-ROC is critical in medical imaging โ
# it measures discrimination across ALL thresholds, not just 0.5.
model.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy', tf.keras.metrics.AUC(name='auc')]
)
# โโ PHASE 1: Train head only (base frozen) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
history1 = model.fit(
train_data,
epochs=15,
validation_data=val_data,
class_weight=class_weight, # Apply imbalance correction
callbacks=[
EarlyStopping(monitor='val_auc', patience=5, restore_best_weights=True,
mode='max'), # mode='max': AUC should INCREASE
ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3)
]
)
# โโ PHASE 2: Fine-tune โ unfreeze last few VGG16 blocks โโโโโโโโโโโโโโโโโ
# Now that our head is trained, carefully unfreeze the last convolutional
# block of VGG16 (block5) and fine-tune with a MUCH smaller learning rate.
# WHY small LR for fine-tuning?: The pretrained weights are good starting points.
# A large LR would destroy them ("catastrophic forgetting").
# We want tiny adjustments to specialise for X-rays, not relearn from scratch.
base_model.trainable = True
# Only unfreeze layers from block5 onward (last 4 conv layers)
for layer in base_model.layers:
if 'block5' in layer.name:
layer.trainable = True
else:
layer.trainable = False
# Recompile with 10ร smaller learning rate for fine-tuning
model.compile(
optimizer=optimizers.Adam(learning_rate=1e-5), # 0.001 โ 0.00001
loss='binary_crossentropy',
metrics=['accuracy', tf.keras.metrics.AUC(name='auc')]
)
history2 = model.fit(
train_data,
epochs=20,
validation_data=val_data,
class_weight=class_weight,
callbacks=[
EarlyStopping(monitor='val_auc', patience=7, restore_best_weights=True, mode='max'),
ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=4)
]
)
# โโ EVALUATION WITH LOWERED THRESHOLD โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Default threshold is 0.5. In medical context, we lower to 0.3:
# โ Catches more True Positives (reduce missed pneumonia cases)
# โ Accepts more False Positives (extra doctor reviews โ acceptable trade-off)
y_pred_prob = model.predict(test_data, verbose=0).flatten()
y_true_test = test_data.classes
threshold = 0.3 # Medical decision: sensitivity over specificity
y_pred_class = (y_pred_prob >= threshold).astype(int)
auc_score = roc_auc_score(y_true_test, y_pred_prob)
print(f"\nAUC-ROC Score: {auc_score:.4f}")
print(classification_report(y_true_test, y_pred_class,
target_names=['Normal', 'Pneumonia']))
Diagnosing Your CNN โ Reading the Loss Curves
| Epoch | Train Loss | Val Loss |
|---|---|---|
| 10 | 0.45 | 0.48 |
| 20 | 0.28 | 0.59 |
| 30 | 0.15 | 0.82 |
| 40 | 0.08 | 1.15 |
Train loss keeps falling while val loss rises โ memorising training data.
| Epoch | Train Loss | Val Loss |
|---|---|---|
| 10 | 0.52 | 0.54 |
| 20 | 0.38 | 0.40 |
| 30 | 0.31 | 0.33 |
| 40 | 0.28 | 0.30 |
Both losses decline together and converge. Small gap = healthy generalisation.
Golden Rules for Building CNNs
padding='same' in most conv layers. Valid padding shrinks the spatial dimensions at every layer, losing boundary information. Same padding preserves dimensions until you explicitly downsample with pooling.
CNN vs Other Architectures โ When to Use What
| Architecture | Best For | Dataset Size | Speed | Accuracy |
|---|---|---|---|---|
| Custom CNN | Learning, small projects, full control | 50kโ500k | Fast | Good |
| VGG16 (Transfer) | Medical, domain-specific, small data | 1kโ50k | Moderate | Very Good |
| ResNet50 | General vision, deep architectures | 10k+ | Moderate | Excellent |
| EfficientNet-B0 | Production, mobile, efficiency-critical | Any | Very Fast | State of Art |
| Vision Transformer (ViT) | Very large datasets, attention-based | 1M+ | Slow | SOTA on large |
Under 10,000 images? โ Use transfer learning (VGG16/ResNet). General vision task, medium data? โ ResNet50 or EfficientNet. Production/mobile deployment? โ EfficientNet-B0 or MobileNetV3. Building something from scratch to learn? โ Custom CNN with CIFAR-10 first. Massive dataset, unlimited compute? โ Vision Transformer.