The Story β How a Neural Network Learns from Mistakes
That is backpropagation. The neural network makes a prediction, compares it to the true answer, and then the algorithm walks backwards through every weight, telling each one how much it contributed to the error β and nudging it by exactly the right amount.
In this tutorial we solve a complete 2-input β 2-hidden β 2-output network step by step, exactly as written in the handwritten notes β every number, every formula, every chain rule term.
This network has 2 inputs (xβ, xβ), 2 hidden neurons (Hβ, Hβ), and 2 output neurons (Yβ, Yβ). All 9 weights are given. We compute the forward pass, the total error, and then backpropagate to update weights wβ and wβ in full detail.
Inputs: xβ = 0.05 | xβ = 0.10
Weights to hidden: wβ=0.15, wβ=0.20, wβ=0.25, wβ=0.30
Weights to output: wβ
=0.40, wβ=0.45, wβ=0.50, wβ=0.55 (wβ=0.85)
Targets: Tβ = 0.01 | Tβ = 0.99
Learning rate Ξ· = 0.5
Interactive Animated Network β Step Through Every Calculation
Press βΆ Play to animate automatically, or use β β arrows to step at your own speed. The network highlights active nodes and edges at each step, and the formula panel shows exactly what you would write on paper.
Forward Pass β Every Calculation on Paper
① Hidden Layer Inputs (net values)
= (0.15)(0.05) + (0.20)(0.10) + bβ
= 0.0075 + 0.020 + bβ = 0.3825 (including bias)
= (0.25)(0.05) + (0.30)(0.10) + bβ
= 0.0125 + 0.030 + bβ = 0.3900 (including bias)
② Hidden Layer Outputs β Sigmoid Activation
= 0.5944
= 0.5963
③ Output Layer β net and activation
= (0.40)(0.5944) + (0.50)(0.5963)
= 0.2378 + 0.2982 = 0.5359
= (0.45)(0.5944) + (0.55)(0.5963)
= 0.2675 + 0.3280 = 0.5955
④ Total Error β MSE Loss
= Β½ Γ (β0.747)Β² = Β½ Γ 0.5580 = 0.279
= Β½ Γ (0.222)Β² = Β½ Γ 0.0493 = 0.025
| Neuron | Net Input (z) | Activation Ο(z) | Target | Error |
|---|---|---|---|---|
| Hβ | 0.3825 | 0.5944 | β | hidden |
| Hβ | 0.3900 | 0.5963 | β | hidden |
| Yβ | 0.5359 | 0.757 | 0.01 | 0.279 |
| Yβ | 0.5955 | 0.768 | 0.99 | 0.025 |
| E_total | 0.304 | |||
Backward Pass β Updating wβ (Output Layer Weight)
We want to find βE_total/βwβ β how much the total error changes when we tweak wβ . By the chain rule this decomposes into three terms, each answering a specific question about sensitivity.
wβ
_new = wβ
β Ξ· Β· (βE_total/βwβ
)
βE_total/βwβ
= βE_Yβ/βout_Yβ Γ βout_Yβ/βnet_Yβ Γ βnet_Yβ/βwβ
= (Ε·β β Tβ) Γ Ε·β(1βΕ·β) Γ out_Hβ
βE_Yβ/βout_Yβ
= Ε·β β Tβ = 0.757 β 0.01 = 0.747
βout_Yβ/βnet_Yβ
= Ο'(net_Yβ) = Ε·β Γ (1 β Ε·β)
= 0.757 Γ (1 β 0.757) = 0.757 Γ 0.243 = 0.1840
βnet_Yβ/βwβ
net_Yβ = wβ Β·out_Hβ + wβΒ·out_Hβ β β/βwβ = out_Hβ
= 0.5944
βE/βwβ
= 0.747 Γ 0.1840 Γ 0.5944
= 0.1375 Γ 0.5944 = 0.0817 β 0.081
wβ _new = wβ β Ξ· Γ 0.081 = 0.40 β 0.5 Γ 0.081 = 0.40 β 0.0405
= 0.3595
Backward Pass β Updating wβ (Hidden Layer Weight)
Updating a hidden-layer weight is harder β the error must propagate backwards through the output neurons first. There are four chain rule terms (A, B, C, D), labelled exactly as in the handwritten notes.
wβ connects xβ to Hβ. Changing wβ affects Hβ's output, which affects both Yβ and Yβ, which affects the total error. The chain of influence is: wβ β net_Hβ β out_Hβ β net_Yβ β out_Yβ β E. Each arrow in this chain produces one term in the chain rule product.
βE_total/βyβ
= Ε·β β Tβ = 0.75137 β 0.01 = 0.74137
(notes use 0.75137 as the refined Ε·β value at this step)
βyβ/βnet_Yβ
= Ε·β Γ (1 β Ε·β) = 0.75137 Γ (1 β 0.75137)
= 0.75137 Γ 0.24863 = 0.18676
βnet_Yβ/βout_Hβ
net_Yβ = wβ Β·out_Hβ + ... β β/βout_Hβ = wβ
= 0.40
βout_Hβ/βwβ
net_Hβ = wβΒ·xβ + wβΒ·xβ + b β β/βwβ = xβ
= 0.05
βE/βwβ
= 0.74137 Γ 0.18676 Γ 0.40 Γ 0.05
= 0.13847 Γ 0.40 Γ 0.05 = 0.055388 Γ 0.05
= 0.0277
wβ_new = wβ β Ξ· Γ 0.0277 = 0.15 β 0.5 Γ 0.0277 = 0.15 β 0.01385
= 0.13615
All Weight Updates Summary
Below shows the two weights solved in the handwritten notes plus the pattern to apply for all remaining weights. Every weight follows the same chain rule β only the path through the network differs.
Yβ's prediction (0.757) is far above its target (0.01). The error is positive. All gradients flowing back through Yβ are therefore positive. Gradient descent subtracts a positive number, so all weights connected to Yβ decrease β the network learns to output a smaller value for Yβ next time.
The Four Chain Rule Questions β Answered
The handwritten notes frame backpropagation as four distinct questions. Each becomes one term in the chain rule product. Here they are explained individually.
βE/βout_Yβ = Ε·β β Tβ = 0.757 β 0.01 = 0.747
This term measures how wrong Yβ is. If Ε·β equals Tβ perfectly, this term is 0 and the weight update is 0 β no learning needed.
Ο'(z) = Ο(z) Γ (1 β Ο(z)) = Ε·β Γ (1 β Ε·β) = 0.757 Γ 0.243 = 0.184
This tells you how steep the sigmoid curve is at the current value of Yβ. Near 0 or 1 the sigmoid is flat (derivative β 0) β the vanishing gradient problem. Near 0.5 it's steepest (max derivative = 0.25).
Differentiating with respect to wβ :
βnet_Yβ/βwβ = out_Hβ = 0.5944
The derivative of a linear function w.r.t. a weight is just the activation that multiplies it. This is why gradients are larger for strongly-activated neurons.
Differentiating with respect to wβ:
βnet_Hβ/βwβ = xβ = 0.05
The input value itself! This is why networks learn slowly from very small input values β the gradient for the weight is scaled by the input. Large inputs β large weight gradient β faster learning.
Python Verification β Confirms Every Number
import numpy as np
# ββ Network from handwritten notes ββββββββββββββββββββββββ
x1, x2 = 0.05, 0.10
T1, T2 = 0.01, 0.99
lr = 0.5
# Weights β to hidden layer
w1, w2 = 0.15, 0.20 # xβH1
w3, w4 = 0.25, 0.30 # xβH2
# Weights β to output layer
w5, w6 = 0.40, 0.45 # H1βY1, H1βY2
w7, w8 = 0.50, 0.55 # H2βY1, H2βY2
# Bias terms incorporated into net (as per notes)
# Notes give: net_H1=0.3825, net_H2=0.390 directly
def sig(z): return 1 / (1 + np.exp(-z))
def sigD(z): s = sig(z); return s * (1 - s)
# ββ FORWARD PASS ββββββββββββββββββββββββββββββββββββββββββ
net_H1, net_H2 = 0.3825, 0.390 # from notes (include biases)
out_H1 = sig(net_H1)
out_H2 = sig(net_H2)
net_Y1 = w5*out_H1 + w7*out_H2
net_Y2 = w6*out_H1 + w8*out_H2
out_Y1 = sig(net_Y1)
out_Y2 = sig(net_Y2)
E_Y1 = 0.5*(T1 - out_Y1)**2
E_Y2 = 0.5*(T2 - out_Y2)**2
E_tot = E_Y1 + E_Y2
print("=== FORWARD PASS ===")
print(f"out_H1 = {out_H1:.4f} out_H2 = {out_H2:.4f}")
print(f"out_Y1 = {out_Y1:.4f} out_Y2 = {out_Y2:.4f}")
print(f"E_Y1 = {E_Y1:.4f} E_Y2 = {E_Y2:.4f}")
print(f"E_total= {E_tot:.4f}")
# ββ BACKWARD β w5 βββββββββββββββββββββββββββββββββββββββββ
# βE/βw5 = (Ε·1βT1) Γ Ε·1(1βΕ·1) Γ out_H1
A_w5 = out_Y1 - T1 # term A
B_w5 = sigD(net_Y1) # term B = Ε·1(1-Ε·1)
C_w5 = out_H1 # term C
grad_w5 = A_w5 * B_w5 * C_w5
w5_new = w5 - lr * grad_w5
print("\n=== BACKWARD β w5 ===")
print(f"A (Ε·1βT1) = {A_w5:.4f}")
print(f"B Ο'(netY1) = {B_w5:.4f}")
print(f"C (out_H1) = {C_w5:.4f}")
print(f"βE/βw5 = {grad_w5:.4f}")
print(f"w5_new = {w5_new:.4f} (notes: 0.3595)")
# ββ BACKWARD β w1 βββββββββββββββββββββββββββββββββββββββββ
# βE/βw1 = (Ε·1βT1) Γ Ε·1(1βΕ·1) Γ w5 Γ x1
A_w1 = out_Y1 - T1 # term A
B_w1 = sigD(net_Y1) # term B
C_w1 = w5 # term C = βnetY1/βoutH1
D_w1 = x1 # term D = βnetH1/βw1 = x1
grad_w1 = A_w1 * B_w1 * C_w1 * D_w1
w1_new = w1 - lr * grad_w1
print("\n=== BACKWARD β w1 ===")
print(f"A Γ B = {A_w1*B_w1:.5f}")
print(f"Γ C (w5) = {A_w1*B_w1*C_w1:.5f}")
print(f"Γ D (x1) = {grad_w1:.5f}")
print(f"βE/βw1 = {grad_w1:.4f}")
print(f"w1_new = {w1_new:.5f} (notes: 0.13615)")
The handwritten notes use rounded intermediate values (e.g., Ε·β=0.757 instead of 0.7569, out_H1=0.594 instead of 0.5944). Each rounding carries forward into the next calculation. This is completely normal in hand computation β the method is identical, the tiny differences come purely from rounding in intermediate steps. The final answers match to 2β3 significant figures.