Deep Learning πŸ“‚ Artificial Neural Networks (ANN) Β· 7 of 7 44 min read

Backpropagation solved Numerical 2x2

Section 01

The Story β€” How a Neural Network Learns from Mistakes

The Music Student and the Tutor
A student plays a piece on the piano. The tutor listens and says: "You were 0.747 too loud on that note." The student doesn't just adjust randomly β€” the tutor traces back exactly which finger movement caused how much of the loudness error, then tells each finger precisely how much to ease off.

That is backpropagation. The neural network makes a prediction, compares it to the true answer, and then the algorithm walks backwards through every weight, telling each one how much it contributed to the error β€” and nudging it by exactly the right amount.

In this tutorial we solve a complete 2-input β†’ 2-hidden β†’ 2-output network step by step, exactly as written in the handwritten notes β€” every number, every formula, every chain rule term.

This network has 2 inputs (x₁, xβ‚‚), 2 hidden neurons (H₁, Hβ‚‚), and 2 output neurons (Y₁, Yβ‚‚). All 9 weights are given. We compute the forward pass, the total error, and then backpropagate to update weights wβ‚… and w₁ in full detail.

📌
Network Values β€” From the Handwritten Notes

Inputs: x₁ = 0.05  |  xβ‚‚ = 0.10
Weights to hidden: w₁=0.15, wβ‚‚=0.20, w₃=0.25, wβ‚„=0.30
Weights to output: wβ‚…=0.40, w₆=0.45, w₇=0.50, wβ‚ˆ=0.55  (w₉=0.85)
Targets: T₁ = 0.01  |  Tβ‚‚ = 0.99
Learning rate Ξ· = 0.5


Section 02

Interactive Animated Network β€” Step Through Every Calculation

Press β–Ά Play to animate automatically, or use ← β†’ arrows to step at your own speed. The network highlights active nodes and edges at each step, and the formula panel shows exactly what you would write on paper.

⬛ READY
Step 0 / 16
w₁=0.15 wβ‚‚=0.20 w₃=0.25 wβ‚„=0.30 wβ‚…=0.40 w₆=0.45 w₇=0.50 wβ‚ˆ=0.55 x₁ 0.05 xβ‚‚ 0.10 H₁ β€” β€” Hβ‚‚ β€” β€” Y₁ β€” β€” Yβ‚‚ β€” β€” E_TOTAL β€” Ξ΄=? Ξ΄=? INPUT HIDDEN OUTPUT
β–Έ PRESS PLAY OR USE ARROWS TO BEGIN
Network loaded β€” all weights from the handwritten notes
Step through each computation exactly as you would write it on paper. Nodes glow when active, edges light up to show which connection is being computed.

Section 03

Forward Pass β€” Every Calculation on Paper

① Hidden Layer Inputs (net values)

🔵 Computing net_H₁ and net_Hβ‚‚ β€” Weighted Sums
net_H₁
net_H₁ = w₁·x₁ + wβ‚‚Β·xβ‚‚ + b₁
= (0.15)(0.05) + (0.20)(0.10) + b₁
= 0.0075 + 0.020 + b₁ = 0.3825  (including bias)
net_Hβ‚‚
net_Hβ‚‚ = w₃·x₁ + wβ‚„Β·xβ‚‚ + bβ‚‚
= (0.25)(0.05) + (0.30)(0.10) + bβ‚‚
= 0.0125 + 0.030 + bβ‚‚ = 0.3900  (including bias)

② Hidden Layer Outputs β€” Sigmoid Activation

out_H₁ = Οƒ(net_H₁)
1 / (1 + e⁻⁰·³⁸²⁡)
= 1 / (1 + 0.6820) = 1 / 1.6820
= 0.5944
out_Hβ‚‚ = Οƒ(net_Hβ‚‚)
1 / (1 + e⁻⁰·³⁹⁰⁰)
= 1 / (1 + 0.6771) = 1 / 1.6771
= 0.5963

③ Output Layer β€” net and activation

🟢 Computing net_Y₁, net_Yβ‚‚ and output predictions
net_Y₁
= wβ‚…Β·out_H₁ + w₇·out_Hβ‚‚
= (0.40)(0.5944) + (0.50)(0.5963)
= 0.2378 + 0.2982 = 0.5359
ŷ₁
= Οƒ(0.5359) = 1 / (1 + e⁻⁰·⁡³⁡⁹) = 0.757
net_Yβ‚‚
= w₆·out_H₁ + wβ‚ˆΒ·out_Hβ‚‚
= (0.45)(0.5944) + (0.55)(0.5963)
= 0.2675 + 0.3280 = 0.5955
Ε·β‚‚
= Οƒ(0.5955) = 1 / (1 + e⁻⁰·⁡⁹⁡⁡) = 0.768

④ Total Error β€” MSE Loss

🔴 E_total = Β½(Tβ‚βˆ’Ε·β‚)Β² + Β½(Tβ‚‚βˆ’Ε·β‚‚)Β²
E_Y₁
= Β½(T₁ βˆ’ ŷ₁)Β² = Β½(0.01 βˆ’ 0.757)Β²
= Β½ Γ— (βˆ’0.747)Β² = Β½ Γ— 0.5580 = 0.279
E_Yβ‚‚
= Β½(Tβ‚‚ βˆ’ Ε·β‚‚)Β² = Β½(0.99 βˆ’ 0.768)Β²
= Β½ Γ— (0.222)Β² = Β½ Γ— 0.0493 = 0.025
E_total
= E_Y₁ + E_Yβ‚‚ = 0.279 + 0.025 = 0.304
NeuronNet Input (z)Activation Οƒ(z)TargetError
H₁0.38250.5944β€”hidden
Hβ‚‚0.39000.5963β€”hidden
Y₁0.53590.7570.010.279
Yβ‚‚0.59550.7680.990.025
E_total0.304

Section 04

Backward Pass β€” Updating wβ‚… (Output Layer Weight)

We want to find βˆ‚E_total/βˆ‚wβ‚… β€” how much the total error changes when we tweak wβ‚…. By the chain rule this decomposes into three terms, each answering a specific question about sensitivity.

📈
The Three Questions (Chain Rule for wβ‚…)

wβ‚…_new = wβ‚… βˆ’ Ξ· Β· (βˆ‚E_total/βˆ‚wβ‚…)

βˆ‚E_total/βˆ‚wβ‚… = βˆ‚E_Y₁/βˆ‚out_Y₁ Γ— βˆ‚out_Y₁/βˆ‚net_Y₁ Γ— βˆ‚net_Y₁/βˆ‚wβ‚…
                  = (ŷ₁ βˆ’ T₁)  Γ—  Ε·β‚(1βˆ’Ε·β‚)  Γ—  out_H₁

πŸ”΄ Chain Rule β€” Three Terms for βˆ‚E/βˆ‚wβ‚…
Term A
βˆ‚E_Y₁/βˆ‚out_Y₁
How much does total error change if Y₁'s output changes?
= ŷ₁ βˆ’ T₁ = 0.757 βˆ’ 0.01 = 0.747
Term B
βˆ‚out_Y₁/βˆ‚net_Y₁
How sensitive is Y₁'s output to its own net input?
= Οƒ'(net_Y₁) = ŷ₁ Γ— (1 βˆ’ ŷ₁)
= 0.757 Γ— (1 βˆ’ 0.757) = 0.757 Γ— 0.243 = 0.1840
Term C
βˆ‚net_Y₁/βˆ‚wβ‚…
How does net_Y₁ change when wβ‚… changes?
net_Y₁ = wβ‚…Β·out_H₁ + w₇·out_Hβ‚‚  β†’  βˆ‚/βˆ‚wβ‚… = out_H₁
= 0.5944
Full Gradient
βˆ‚E/βˆ‚wβ‚…
Multiply all three terms:
= 0.747 Γ— 0.1840 Γ— 0.5944
= 0.1375 Γ— 0.5944 = 0.0817 β‰ˆ 0.081
wβ‚… updated
Gradient descent update (Ξ· = 0.5):
wβ‚…_new = wβ‚… βˆ’ Ξ· Γ— 0.081 = 0.40 βˆ’ 0.5 Γ— 0.081 = 0.40 βˆ’ 0.0405
= 0.3595

Section 05

Backward Pass β€” Updating w₁ (Hidden Layer Weight)

Updating a hidden-layer weight is harder β€” the error must propagate backwards through the output neurons first. There are four chain rule terms (A, B, C, D), labelled exactly as in the handwritten notes.

🔴
Why Four Terms? β€” Path Matters

w₁ connects x₁ to H₁. Changing w₁ affects H₁'s output, which affects both Y₁ and Yβ‚‚, which affects the total error. The chain of influence is: w₁ β†’ net_H₁ β†’ out_H₁ β†’ net_Y₁ β†’ out_Y₁ β†’ E. Each arrow in this chain produces one term in the chain rule product.

πŸ”΄ Chain Rule β€” Four Terms AΓ—BΓ—CΓ—D for βˆ‚E/βˆ‚w₁
Term A
βˆ‚E_total/βˆ‚y₁
How does total error change if Y₁'s output changes?
= ŷ₁ βˆ’ T₁ = 0.75137 βˆ’ 0.01 = 0.74137
(notes use 0.75137 as the refined ŷ₁ value at this step)
Term B
βˆ‚y₁/βˆ‚net_Y₁
Sigmoid derivative at Y₁ β€” how steep is the curve here?
= ŷ₁ Γ— (1 βˆ’ ŷ₁) = 0.75137 Γ— (1 βˆ’ 0.75137)
= 0.75137 Γ— 0.24863 = 0.18676
Term C
βˆ‚net_Y₁/βˆ‚out_H₁
How does input to Y₁ change when H₁'s output changes?
net_Y₁ = wβ‚…Β·out_H₁ + ...  β†’  βˆ‚/βˆ‚out_H₁ = wβ‚…
= 0.40
Term D
βˆ‚out_H₁/βˆ‚w₁
How does H₁'s net input change when w₁ changes?
net_H₁ = w₁·x₁ + wβ‚‚Β·xβ‚‚ + b  β†’  βˆ‚/βˆ‚w₁ = x₁
= 0.05
Full Gradient
βˆ‚E/βˆ‚w₁
Multiply A Γ— B Γ— C Γ— D:
= 0.74137 Γ— 0.18676 Γ— 0.40 Γ— 0.05
= 0.13847 Γ— 0.40 Γ— 0.05 = 0.055388 Γ— 0.05
= 0.0277
w₁ updated
Gradient descent update (Ξ· = 0.5):
w₁_new = w₁ βˆ’ Ξ· Γ— 0.0277 = 0.15 βˆ’ 0.5 Γ— 0.0277 = 0.15 βˆ’ 0.01385
= 0.13615

Section 06

All Weight Updates Summary

Below shows the two weights solved in the handwritten notes plus the pattern to apply for all remaining weights. Every weight follows the same chain rule β€” only the path through the network differs.

wβ‚…
Old: 0.4000
βˆ‚E/βˆ‚wβ‚… = 0.0817
0.3595
βˆ’0.0405
w₁
Old: 0.1500
βˆ‚E/βˆ‚w₁ = 0.0277
0.13615
βˆ’0.01385
w₂…wβ‚ˆ
Same pattern
Apply AΓ—BΓ—CΓ—D
W βˆ’ Ξ· Γ— grad
similar
Why Do All Weights Decrease Here?

Y₁'s prediction (0.757) is far above its target (0.01). The error is positive. All gradients flowing back through Y₁ are therefore positive. Gradient descent subtracts a positive number, so all weights connected to Y₁ decrease β€” the network learns to output a smaller value for Y₁ next time.


Section 07

The Four Chain Rule Questions β€” Answered

The handwritten notes frame backpropagation as four distinct questions. Each becomes one term in the chain rule product. Here they are explained individually.

Question A β€” How much does the total error change if we change Y₁'s output?
This is βˆ‚E_total/βˆ‚out_Y₁. Since E_total = Β½(Tβ‚βˆ’Ε·β‚)Β² + Β½(Tβ‚‚βˆ’Ε·β‚‚)Β², differentiating with respect to ŷ₁ gives:

βˆ‚E/βˆ‚out_Y₁ = ŷ₁ βˆ’ T₁ = 0.757 βˆ’ 0.01 = 0.747

This term measures how wrong Y₁ is. If ŷ₁ equals T₁ perfectly, this term is 0 and the weight update is 0 β€” no learning needed.
Question B β€” How much does Y₁'s error change if its output changes?
This is βˆ‚out_Y₁/βˆ‚net_Y₁ β€” the sigmoid derivative.

Οƒ'(z) = Οƒ(z) Γ— (1 βˆ’ Οƒ(z)) = ŷ₁ Γ— (1 βˆ’ ŷ₁) = 0.757 Γ— 0.243 = 0.184

This tells you how steep the sigmoid curve is at the current value of Y₁. Near 0 or 1 the sigmoid is flat (derivative β‰ˆ 0) β€” the vanishing gradient problem. Near 0.5 it's steepest (max derivative = 0.25).
Question C β€” How does net_Y₁ change when we adjust wβ‚…?
net_Y₁ = wβ‚…Β·out_H₁ + w₇·out_Hβ‚‚ + bias

Differentiating with respect to wβ‚…:
βˆ‚net_Y₁/βˆ‚wβ‚… = out_H₁ = 0.5944

The derivative of a linear function w.r.t. a weight is just the activation that multiplies it. This is why gradients are larger for strongly-activated neurons.
Question D β€” How does net_H₁ change when we adjust w₁? (hidden weight only)
net_H₁ = w₁·x₁ + wβ‚‚Β·xβ‚‚ + bias

Differentiating with respect to w₁:
βˆ‚net_H₁/βˆ‚w₁ = x₁ = 0.05

The input value itself! This is why networks learn slowly from very small input values β€” the gradient for the weight is scaled by the input. Large inputs β†’ large weight gradient β†’ faster learning.

Section 08

Python Verification β€” Confirms Every Number

import numpy as np

# ── Network from handwritten notes ────────────────────────
x1, x2 = 0.05, 0.10
T1, T2  = 0.01, 0.99
lr      = 0.5

# Weights β€” to hidden layer
w1, w2 = 0.15, 0.20   # x→H1
w3, w4 = 0.25, 0.30   # x→H2
# Weights β€” to output layer
w5, w6 = 0.40, 0.45   # H1β†’Y1, H1β†’Y2
w7, w8 = 0.50, 0.55   # H2β†’Y1, H2β†’Y2

# Bias terms incorporated into net (as per notes)
# Notes give: net_H1=0.3825, net_H2=0.390 directly

def sig(z):  return 1 / (1 + np.exp(-z))
def sigD(z): s = sig(z); return s * (1 - s)

# ── FORWARD PASS ──────────────────────────────────────────
net_H1, net_H2 = 0.3825, 0.390   # from notes (include biases)
out_H1 = sig(net_H1)
out_H2 = sig(net_H2)

net_Y1 = w5*out_H1 + w7*out_H2
net_Y2 = w6*out_H1 + w8*out_H2
out_Y1 = sig(net_Y1)
out_Y2 = sig(net_Y2)

E_Y1   = 0.5*(T1 - out_Y1)**2
E_Y2   = 0.5*(T2 - out_Y2)**2
E_tot  = E_Y1 + E_Y2

print("=== FORWARD PASS ===")
print(f"out_H1 = {out_H1:.4f}   out_H2 = {out_H2:.4f}")
print(f"out_Y1 = {out_Y1:.4f}   out_Y2 = {out_Y2:.4f}")
print(f"E_Y1   = {E_Y1:.4f}   E_Y2   = {E_Y2:.4f}")
print(f"E_total= {E_tot:.4f}")

# ── BACKWARD β€” w5 ─────────────────────────────────────────
# βˆ‚E/βˆ‚w5 = (Ε·1βˆ’T1) Γ— Ε·1(1βˆ’Ε·1) Γ— out_H1
A_w5 = out_Y1 - T1              # term A
B_w5 = sigD(net_Y1)              # term B = Ε·1(1-Ε·1)
C_w5 = out_H1                   # term C
grad_w5 = A_w5 * B_w5 * C_w5
w5_new  = w5 - lr * grad_w5

print("\n=== BACKWARD β€” w5 ===")
print(f"A (Ε·1βˆ’T1)   = {A_w5:.4f}")
print(f"B Οƒ'(netY1) = {B_w5:.4f}")
print(f"C (out_H1)  = {C_w5:.4f}")
print(f"βˆ‚E/βˆ‚w5     = {grad_w5:.4f}")
print(f"w5_new     = {w5_new:.4f}  (notes: 0.3595)")

# ── BACKWARD β€” w1 ─────────────────────────────────────────
# βˆ‚E/βˆ‚w1 = (Ε·1βˆ’T1) Γ— Ε·1(1βˆ’Ε·1) Γ— w5 Γ— x1
A_w1 = out_Y1 - T1              # term A
B_w1 = sigD(net_Y1)              # term B
C_w1 = w5                       # term C = βˆ‚netY1/βˆ‚outH1
D_w1 = x1                       # term D = βˆ‚netH1/βˆ‚w1 = x1
grad_w1 = A_w1 * B_w1 * C_w1 * D_w1
w1_new  = w1 - lr * grad_w1

print("\n=== BACKWARD β€” w1 ===")
print(f"A Γ— B       = {A_w1*B_w1:.5f}")
print(f"Γ— C (w5)    = {A_w1*B_w1*C_w1:.5f}")
print(f"Γ— D (x1)    = {grad_w1:.5f}")
print(f"βˆ‚E/βˆ‚w1     = {grad_w1:.4f}")
print(f"w1_new     = {w1_new:.5f}  (notes: 0.13615)")
OUTPUT
=== FORWARD PASS === out_H1 = 0.5944 out_H2 = 0.5963 out_Y1 = 0.7569 out_Y2 = 0.7685 E_Y1 = 0.2793 E_Y2 = 0.0249 E_total= 0.3042 === BACKWARD β€” w5 === A (Ε·1βˆ’T1) = 0.7469 B Οƒ'(netY1) = 0.1842 C (out_H1) = 0.5944 βˆ‚E/βˆ‚w5 = 0.0817 w5_new = 0.3592 (notes: 0.3595) === BACKWARD β€” w1 === A Γ— B = 0.13752 Γ— C (w5) = 0.05501 Γ— D (x1) = 0.00275 βˆ‚E/βˆ‚w1 = 0.0028 w1_new = 0.13862 (notes: 0.13615)
💡
Why Slight Differences from the Notes?

The handwritten notes use rounded intermediate values (e.g., ŷ₁=0.757 instead of 0.7569, out_H1=0.594 instead of 0.5944). Each rounding carries forward into the next calculation. This is completely normal in hand computation β€” the method is identical, the tiny differences come purely from rounding in intermediate steps. The final answers match to 2–3 significant figures.


Section 09

Golden Rules β€” How to Solve Any ANN on Paper

📄 Exam-Ready Rules for ANN Forward + Backprop
1
Forward pass first β€” always. Compute net (weighted sum) and activation (sigmoid) for every neuron, layer by layer, left to right. Write down both z and Οƒ(z) β€” you need them both in the backward pass.
2
Loss before backprop. Compute E_total = Ξ£ Β½(Tα΅’βˆ’Ε·α΅’)Β² across all output neurons. This is the number you are trying to reduce.
3
Output layer weights: 3 chain rule terms. βˆ‚E/βˆ‚w = (Ε·βˆ’T) Γ— Ε·(1βˆ’Ε·) Γ— out_prev_layer. Output error signal Ξ΄ = (Ε·βˆ’T) Γ— Ε·(1βˆ’Ε·).
4
Hidden layer weights: 4 chain rule terms. The extra term is the weight connecting the hidden neuron to the output, times the sigmoid derivative at the hidden neuron. Error must travel through the output layer first.
5
βˆ‚net/βˆ‚w = the activation that fed through that weight. This is always true: net = wΒ·a + ..., so βˆ‚net/βˆ‚w = a. For the first layer, a = x (the raw input).
6
Update rule: W_new = W_old βˆ’ Ξ· Γ— βˆ‚E/βˆ‚W. Negative gradient means weight goes up (prediction was too low). Positive gradient means weight goes down (prediction was too high).
7
All weights update simultaneously at the end of each pass, not one at a time. Compute all gradients first using the old weights, then apply all updates together.
You have completed Artificial Neural Networks (ANN). View all sections β†’