The Story That Makes Backpropagation Click
The coach doesn't just blame the last archer. She walks backwards down the line, asking: how much did each archer contribute to that final error? Archer 3 contributed 60% of the drift. Archer 2's bad stance caused 30%. Archer 1's grip caused the remaining 10%. Each archer gets a personalised correction β not a generic one.
That is backpropagation. The network fires forward, produces an error, then the coach (the algorithm) walks backwards through every layer, assigning a precise share of blame to every single weight β then corrects them all in one efficient backward pass.
Backpropagation (short for "backward propagation of errors") is the algorithm that makes neural network learning possible. It computes how much each weight in the network contributed to the final loss, using the chain rule of calculus applied efficiently through the computation graph β all weights updated in a single backward sweep.
Before backpropagation (formalised by Rumelhart, Hinton & Williams in 1986), there was no efficient way to train deep networks. Computing gradients by perturbing each weight individually requires two forward passes per weight β absurdly slow for millions of parameters. Backprop does it in one forward + one backward pass, regardless of network depth. This is the algorithmic core of every modern deep learning system.
The Four Core Concepts
The Computation Graph β Your Network's Blueprint
Every neural network operation can be drawn as a graph of primitive computations. During the forward pass, data flows left to right. During the backward pass, gradients flow right to left. The graph makes it possible to apply the chain rule mechanically, node by node.
Each node stores its output during the forward pass so the backward pass can reuse it. This is why backprop only needs one forward + one backward sweep.
The Chain Rule β The Mathematical Engine
The answer is the product of each gear ratio along the chain. If AβB has ratio 2Γ, and BβC has ratio 3Γ, then AβC is 6Γ. You multiply the local rates. The chain rule is precisely this β multiply local derivatives along a path in the computation graph.
The chain rule states: if L depends on z through a, then:
At each node, backprop reuses the stored forward-pass values (activations and pre-activations). It computes the local gradient and multiplies it by the incoming gradient β then passes the result to the next node upstream. Every weight gets its gradient in exactly one pass. No redundant computation. This is the genius of the algorithm.
Full Numerical Example β Step-by-Step Mathematics
We will build a tiny neural network with 2 inputs β 1 hidden neuron β 1 output neuron and walk through every single number in both the forward and backward pass. No hand-waving, no skipping β every calculation shown.
Input: x = [0.5, 0.8] | True label: y = 1.0 | Activation: Sigmoid Ο(z) = 1/(1+eβ»αΆ») | Loss: MSE = Β½(Ε· β y)Β²
👉 Step 1 β Forward Pass
zβ = WββΒ·xβ + WββΒ·xβ + bβ
zβ = (0.4)(0.5) + (0.6)(0.8) + 0.1
zβ = 0.20 + 0.48 + 0.10 = 0.68
aβ = 1 / (1 + eβ»β°Β·βΆβΈ) = 1 / (1 + 0.5066) = 1 / 1.5066 = 0.6637
zβ = WβΒ·aβ + bβ
zβ = (0.9)(0.6637) + 0.1 = 0.5973 + 0.1 = 0.6973
Ε· = 1 / (1 + eβ»β°Β·βΆβΉβ·Β³) = 1 / (1 + 0.4977) = 0.6682
L = Β½(0.6682 β 1.0)Β² = Β½(β0.3318)Β² = Β½(0.1101) = 0.0550
👉 Step 2 β Backward Pass
Now we work backwards from the loss, computing gradients layer by layer using the chain rule. We start at the output and propagate the error signal upstream.
Error signal Ξ΄ flows backward. Each layer multiplies the incoming Ξ΄ by its own local derivative (chain rule), then passes the result upstream.
L = Β½(Ε· β y)Β² β dL/dΕ· = Ε· β y = 0.6682 β 1.0 = β0.3318
Ο'(z) = Ο(z)Β·(1 β Ο(z)) = Ε·Β·(1 β Ε·)
Ο'(zβ) = 0.6682 Γ (1 β 0.6682) = 0.6682 Γ 0.3318 = 0.2217
Ξ΄β = (dL/dΕ·) Γ Ο'(zβ) = (β0.3318) Γ 0.2217 = β0.0736
dL/dWβ = Ξ΄β Γ aβ = (β0.0736) Γ 0.6637 = β0.0488
dL/dbβ = Ξ΄β = β0.0736
dL/daβ = Ξ΄β Γ Wβ = (β0.0736) Γ 0.9 = β0.0662
Ο'(zβ) = aβ Γ (1 β aβ) = 0.6637 Γ (1 β 0.6637) = 0.6637 Γ 0.3363 = 0.2232
Ξ΄β = (dL/daβ) Γ Ο'(zβ) = (β0.0662) Γ 0.2232 = β0.0148
dL/dWββ = Ξ΄β Γ xβ = (β0.0148) Γ 0.5 = β0.0074
dL/dWββ = Ξ΄β Γ xβ = (β0.0148) Γ 0.8 = β0.0118
dL/dbβ = Ξ΄β = β0.0148
👉 Step 3 β Weight Update (Gradient Descent)
All gradients were negative, so all weights increased slightly. Since the prediction (0.668) was below the true label (1.0), the network needed to output a higher value next time β and increasing the weights achieves exactly that. Gradient descent nudged every weight in the right direction, all at once.
Layer-by-Layer Diagram β What Each Layer Computes
Complete Gradient Summary
| Parameter | Old Value | Gradient (dL/dΞΈ) | Update (Ξ·=0.5) | New Value | Direction |
|---|---|---|---|---|---|
| Wβ (output weight) | 0.9000 | β0.0488 | +0.0244 | 0.9244 | β increase |
| bβ (output bias) | 0.1000 | β0.0736 | +0.0368 | 0.1368 | β increase |
| Wββ (hidden weight 1) | 0.4000 | β0.0074 | +0.0037 | 0.4037 | β increase |
| Wββ (hidden weight 2) | 0.6000 | β0.0118 | +0.0059 | 0.6059 | β increase |
| bβ (hidden bias) | 0.1000 | β0.0148 | +0.0074 | 0.1074 | β increase |
All gradients are negative because the loss decreases when all weights increase (our prediction was below the target). Gradient descent subtracts the gradient, so W_new = W_old β Ξ· Γ (negative) = W_old + positive β weights go up. The network learns to predict higher, closer to the true label of 1.0.
Python Implementation β From Scratch
Below is a clean, well-commented Python implementation of the exact network we computed by hand above. It verifies all our numbers programmatically.
import numpy as np
# ββ Network weights (matching our manual example) ββββββββββ
W1 = np.array([[0.4, 0.6]]) # shape (1, 2) β 1 hidden neuron, 2 inputs
b1 = np.array([[0.1]]) # shape (1, 1)
W2 = np.array([[0.9]]) # shape (1, 1) β 1 output, 1 hidden neuron
b2 = np.array([[0.1]]) # shape (1, 1)
x = np.array([[0.5], [0.8]]) # shape (2, 1) β column vector
y = np.array([[1.0]]) # true label
# ββ Activation functions βββββββββββββββββββββββββββββββββββ
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def sigmoid_deriv(z):
s = sigmoid(z)
return s * (1 - s) # Ο'(z) = Ο(z)(1 - Ο(z))
# ββ FORWARD PASS ββββββββββββββββββββββββββββββββββββββββββ
z1 = W1 @ x + b1 # pre-activation hidden: (1,1)
a1 = sigmoid(z1) # activation hidden
z2 = W2 @ a1 + b2 # pre-activation output: (1,1)
y_hat = sigmoid(z2) # prediction Ε·
loss = 0.5 * (y_hat - y)**2 # MSE loss
print(f"z1 = {z1[0,0]:.4f}")
print(f"a1 = {a1[0,0]:.4f}")
print(f"z2 = {z2[0,0]:.4f}")
print(f"y_hat = {y_hat[0,0]:.4f}")
print(f"Loss = {loss[0,0]:.4f}")
# ββ BACKWARD PASS βββββββββββββββββββββββββββββββββββββββββ
# Output layer
dL_dyhat = y_hat - y # dL/dΕ·
delta2 = dL_dyhat * sigmoid_deriv(z2) # Ξ΄β = dL/dzβ
dL_dW2 = delta2 @ a1.T # dL/dWβ
dL_db2 = delta2 # dL/dbβ
# Hidden layer
dL_da1 = W2.T @ delta2 # propagate error upstream
delta1 = dL_da1 * sigmoid_deriv(z1) # Ξ΄β = dL/dzβ
dL_dW1 = delta1 @ x.T # dL/dWβ shape (1,2)
dL_db1 = delta1 # dL/dbβ
print(f"\nGradients:")
print(f"delta2 = {delta2[0,0]:.4f} (output error signal)")
print(f"dL/dW2 = {dL_dW2[0,0]:.4f}")
print(f"dL/db2 = {dL_db2[0,0]:.4f}")
print(f"delta1 = {delta1[0,0]:.4f} (hidden error signal)")
print(f"dL/dW11 = {dL_dW1[0,0]:.4f}")
print(f"dL/dW12 = {dL_dW1[0,1]:.4f}")
print(f"dL/db1 = {dL_db1[0,0]:.4f}")
# ββ GRADIENT DESCENT UPDATE βββββββββββββββββββββββββββββββ
lr = 0.5 # learning rate Ξ·
W2 = W2 - lr * dL_dW2
b2 = b2 - lr * dL_db2
W1 = W1 - lr * dL_dW1
b1 = b1 - lr * dL_db1
print(f"\nUpdated Weights:")
print(f"W2 = {W2[0,0]:.4f}")
print(f"b2 = {b2[0,0]:.4f}")
print(f"W11 = {W1[0,0]:.4f} W12 = {W1[0,1]:.4f}")
print(f"b1 = {b1[0,0]:.4f}")
Every number from the hand-calculation appears in the program output. This confirms the correctness of the chain rule application and the backpropagation logic. In practice you would run this in a loop over thousands of training examples and iterations β that loop is training.
The Full Training Loop β 1000 Iterations
import numpy as np
import matplotlib.pyplot as plt
# ββ Data: XOR problem βββββββββββββββββββββββββββββββββββββ
X = np.array([[0,0],[0,1],[1,0],[1,1]], dtype=float).T # (2,4)
Y = np.array([[0,1,1,0]], dtype=float) # (1,4)
# ββ Initialise weights with small random values βββββββββββ
np.random.seed(42)
W1 = np.random.randn(4, 2) * 0.5 # 4 hidden neurons
b1 = np.zeros((4, 1))
W2 = np.random.randn(1, 4) * 0.5
b2 = np.zeros((1, 1))
lr = 2.0
losses = []
for epoch in range(5000):
# ββ Forward ββββββββββββββββββββββββββββββββββββββββββββ
z1 = W1 @ X + b1 # (4,4)
a1 = 1 / (1 + np.exp(-z1)) # sigmoid
z2 = W2 @ a1 + b2 # (1,4)
y_hat = 1 / (1 + np.exp(-z2))
loss = np.mean(0.5 * (y_hat - Y)**2)
losses.append(loss)
# ββ Backward βββββββββββββββββββββββββββββββββββββββββββ
m = X.shape[1] # number of samples = 4
delta2 = (y_hat - Y) * y_hat * (1 - y_hat) / m
dW2 = delta2 @ a1.T
db2 = np.sum(delta2, axis=1, keepdims=True)
dA1 = W2.T @ delta2
delta1 = dA1 * a1 * (1 - a1)
dW1 = delta1 @ X.T
db1 = np.sum(delta1, axis=1, keepdims=True)
# ββ Update βββββββββββββββββββββββββββββββββββββββββββββ
W2 -= lr * dW2; b2 -= lr * db2
W1 -= lr * dW1; b1 -= lr * db1
if epoch % 1000 == 0:
print(f"Epoch {epoch:5d} Loss: {loss:.5f}")
# ββ Final predictions βββββββββββββββββββββββββββββββββββββ
print(f"\nFinal predictions:")
print(np.round(y_hat, 2), " (target: [0,1,1,0])")
The network learned the XOR function β a classic non-linear problem β purely through repeated forward + backward passes. No magic: just the chain rule, applied thousands of times.
Common Pitfalls in Backpropagation
Each layer multiplies the gradient by Ο'(z) β€ 0.25. With 10 layers: 0.25ΒΉβ° β 0.000001. The gradient at layer 1 is a millionth of the gradient at layer 10. Early layers learn nothing. Fix: use ReLU (derivative is 1 for positive z), batch normalisation, or residual connections.
The opposite problem: if weights are large, gradients blow up exponentially through layers. The loss oscillates wildly or returns NaN. Fix: gradient clipping (cap the norm of the gradient before the update), or careful weight initialisation (Xavier, He).
A neuron using ReLU outputs 0 for any negative input. Its gradient is also 0. If a neuron always gets a negative pre-activation, it will never update β it is "dead." Fix: use Leaky ReLU (small slope for negatives) or ELU, or lower the learning rate to prevent neurons from drifting into permanent negative territory.
Golden Rules of Backpropagation
dL/db = Ξ΄.
No input term. This is because bias has coefficient 1 in the linear transform β
β(Wx + b)/βb = 1. Beginners often overcomplicate bias gradients.
optimizer.zero_grad()
in PyTorch, or tape management in TensorFlow.
Accumulated gradients across batches will cause incorrect updates.
Activation Functions & Their Derivatives
| Activation | Formula Ο(z) | Derivative Ο'(z) | Range | Gradient Issue | Use Case |
|---|---|---|---|---|---|
| Sigmoid | 1 / (1 + eβ»αΆ») | Ο(z)(1 β Ο(z)) | (0, 1) | Vanishes for |z|>5 | Binary output |
| Tanh | (eαΆ» β eβ»αΆ»)/(eαΆ» + eβ»αΆ») | 1 β tanhΒ²(z) | (β1, 1) | Less severe vanishing | Hidden layers |
| ReLU | max(0, z) | 1 if z>0 else 0 | [0, β) | Dead neurons (z<0) | Deep hidden layers |
| Leaky ReLU | z if z>0 else 0.01z | 1 if z>0 else 0.01 | (ββ, β) | No dead neurons | Default hidden |
| Softmax | eαΆ»α΅’ / Ξ£eαΆ»β±Ό | Jacobian matrix | (0, 1) | None (output only) | Multi-class output |
PyTorch β Automatic Backpropagation
In practice, frameworks like PyTorch compute all gradients automatically. But understanding the manual derivation is essential β it tells you why the API works the way it does.
import torch
import torch.nn as nn
# ββ Reproduce our manual example in PyTorch βββββββββββββββ
x = torch.tensor([[0.5], [0.8]], dtype=torch.float32)
y = torch.tensor([[1.0]])
# Weights β require_grad=True tracks the computation graph
W1 = torch.tensor([[0.4, 0.6]], requires_grad=True)
b1 = torch.tensor([[0.1]], requires_grad=True)
W2 = torch.tensor([[0.9]], requires_grad=True)
b2 = torch.tensor([[0.1]], requires_grad=True)
# ββ Forward pass (PyTorch builds the computation graph) βββ
z1 = W1 @ x + b1
a1 = torch.sigmoid(z1)
z2 = W2 @ a1 + b2
y_hat = torch.sigmoid(z2)
loss = 0.5 * (y_hat - y)**2
# ββ Backward pass (ONE LINE β PyTorch does all the chain rule)
loss.backward()
# ββ Gradients are now in .grad attributes βββββββββββββββββ
print(f"dL/dW2 = {W2.grad.item():.4f}")
print(f"dL/db2 = {b2.grad.item():.4f}")
print(f"dL/dW11 = {W1.grad[0,0].item():.4f}")
print(f"dL/dW12 = {W1.grad[0,1].item():.4f}")
print(f"dL/db1 = {b1.grad.item():.4f}")
# ββ Gradient descent update βββββββββββββββββββββββββββββββ
with torch.no_grad():
W2 -= 0.5 * W2.grad
b2 -= 0.5 * b2.grad
W1 -= 0.5 * W1.grad
b1 -= 0.5 * b1.grad
print(f"\nUpdated: W2={W2.item():.4f} b2={b2.item():.4f}")
print(f" W1={W1.detach().numpy()} b1={b1.item():.4f}")
loss.backward() does in one line what we computed across 10 manual steps.
Behind the scenes, PyTorch's autograd engine traverses the computation graph
in reverse, applying the chain rule at every node β exactly the algorithm we implemented
by hand. Knowing the manual version means you understand what every framework does internally.