Deep Learning ๐Ÿ“‚ Artificial Neural Networks (ANN) ยท 3 of 7 21 min read

Forward Propagation in Neural Networks

Forward propagation is the process by which input data flows through a neural network โ€” layer by layer โ€” using weighted sums, bias additions, and activation functions, until it reaches the output as a probability distribution. It is the network's prediction engine, with no learning involved.

Section 01

The Story: A Whisper Telephone Through Many Rooms

The Translated Message
Imagine you stand at the entrance of a building with many rooms in series. You whisper a message โ€” say, a photo of a cat โ€” into Room 1. The workers in Room 1 don't understand the raw photo. Instead, they each take a weighted blend of every pixel, add their own bias, and pass their result to Room 2. Room 2 does the same, then Room 3, and so on.

By the last room, the original pixels have been transformed into something much more abstract: a sentence of probabilities โ€” "90% cat, 7% fox, 3% dog."

That journey โ€” input โ†’ weighted sums โ†’ activations โ†’ output โ€” is forward propagation. Nothing learns yet. It is pure, deterministic arithmetic flowing in one direction.
๐Ÿง 
One-Line Definition

Forward propagation is the process of passing an input through every layer of a neural network โ€” computing weighted sums and applying activations โ€” to produce a final prediction. No weights change during the forward pass.


Section 02

The Computation Graph โ€” Animated Flow

Each layer is a station. Data flows strictly left โ†’ right. Every station performs two operations: an affine transformation and an activation. The graph below animates the full forward pass.

INPUT x [2ร—1] AFFINE 1 zยน=Wยนx+bยน Wยน: [3ร—2] bยน: [3ร—1] ACTIVATE aยน=ฯƒ(zยน) ReLU / tanh AFFINE 2 zยฒ=Wยฒaยน+bยฒ Wยฒ: [2ร—3] bยฒ: [2ร—1] SOFTMAX ลท=softmax(zยฒ) ฮฃ ลทแตข = 1 Raw features Weighted sum Non-linearity Output weights Probabilities

Section 03

The Four Core Operations

โš–๏ธ
Affine Transformation
z = Wx + b
Each neuron computes a weighted sum of every input, then adds a bias. W rotates and scales; b shifts the result. Pure linear algebra.
โšก
Activation Function
a = f(z)
Applied element-wise after the affine step. Introduces non-linearity so the network can learn curved decision boundaries, not just straight lines.
๐Ÿ”
Layer-by-Layer
aหก = f(Wหก aหกโปยน + bหก)
Output of one layer becomes the input to the next. Depth lets each layer learn increasingly abstract representations of the data.
๐ŸŽฏ
Softmax Output
ลทแตข = eแถปโฑ / ฮฃeแถปสฒ
Converts raw output scores (logits) into a valid probability distribution: all values between 0 and 1, and they sum to exactly 1.
Common Activation Functions
ReLU
f(z) = max(0, z)
Dead neuron risk for z<0
Sigmoid
f(z) = 1/(1+eโปแถป)
Output โˆˆ (0,1) โ€” vanishing gradient
Tanh
(eแถปโˆ’eโปแถป)/(eแถป+eโปแถป)
Output โˆˆ (โˆ’1,1) โ€” zero-centred
Softmax
eแถปโฑ / ฮฃeแถปสฒ
Output layer only โ€” ฮฃ = 1.0

Section 04

Numerical 1 โ€” Single Neuron, One Layer

๐Ÿ“
Setup

A neuron receives inputs x = [2, 3]แต€, weights W = [0.5, โˆ’0.4], bias b = 1. Activation: ReLU.

๐Ÿ”ข Step-by-Step Computation
Step 1
Affine: z = (0.5ร—2) + (โˆ’0.4ร—3) + 1 = 1.0 โˆ’ 1.2 + 1.0 = 0.8
Step 2
ReLU: a = max(0, 0.8) = 0.8
Output
Neuron fires with a = 0.8
xโ‚ = 2 wโ‚ = 0.5 xโ‚‚ = 3 wโ‚‚ = โˆ’0.4 b = 1 ฮฃ z=0.8 ReLU a = 0.8 0.8 โœ“

Section 05

Numerical 2 โ€” Full 2-Layer Network + Softmax

๐Ÿ—๏ธ
Network Architecture

Input: 2 neurons  |  Hidden: 2 neurons (ReLU)  |  Output: 2 neurons (Softmax) โ€” binary classification.

๐Ÿ“Š Given Values
Input
x = [1, 2]แต€
Wยน
[[0.1, 0.2], [0.3, 0.4]]   bยน = [0, 0]แต€
Wยฒ
[[0.5, โˆ’0.3], [โˆ’0.1, 0.6]]   bยฒ = [0, 0]แต€
L1
Layer 1 โ€” Affine: zยน = Wยนx + bยน
zยนโ‚ = 0.1ร—1 + 0.2ร—2 = 0.1 + 0.4 = 0.5
zยนโ‚‚ = 0.3ร—1 + 0.4ร—2 = 0.3 + 0.8 = 1.1
โˆด zยน = [0.5, 1.1]แต€
A1
Layer 1 โ€” Activation: aยน = ReLU(zยน)
aยนโ‚ = ReLU(0.5) = 0.5
aยนโ‚‚ = ReLU(1.1) = 1.1
โˆด aยน = [0.5, 1.1]แต€  (both positive, unchanged)
L2
Layer 2 โ€” Affine: zยฒ = Wยฒaยน + bยฒ
zยฒโ‚ = 0.5ร—0.5 + (โˆ’0.3)ร—1.1 = 0.25 โˆ’ 0.33 = โˆ’0.08
zยฒโ‚‚ = (โˆ’0.1)ร—0.5 + 0.6ร—1.1 = โˆ’0.05 + 0.66 = 0.61
โˆด zยฒ = [โˆ’0.08, 0.61]แต€
SM
Output โ€” Softmax: ลท = softmax(zยฒ)
e^(โˆ’0.08) โ‰ˆ 0.923    e^(0.61) โ‰ˆ 1.840    Sum = 2.763
ลทโ‚ = 0.923 รท 2.763 โ‰ˆ 0.334 โ†’ 33.4%
ลทโ‚‚ = 1.840 รท 2.763 โ‰ˆ 0.666 โ†’ 66.6%
โœ… Sum = 1.000 โ€” valid probability distribution
Softmax Output โ€” Probability Distribution
33.4%
Class 0
66.6%
Class 1 โ† Predicted
โœ…
Verdict

The network predicts Class 1 with 66.6% confidence. These are random weights โ€” no learning has happened yet. Backpropagation will later adjust Wยน, Wยฒ, bยน, bยฒ to improve this output.


Section 06

Python Implementation

import numpy as np

# โ”€โ”€ Inputs and Weights โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
x  = np.array([1, 2], dtype=float)

W1 = np.array([[0.1, 0.2],
               [0.3, 0.4]])
b1 = np.zeros(2)

W2 = np.array([[ 0.5, -0.3],
               [-0.1,  0.6]])
b2 = np.zeros(2)

# โ”€โ”€ Activation helpers โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def relu(z):
    return np.maximum(0, z)

def softmax(z):
    e = np.exp(z - np.max(z))   # subtract max for numerical stability
    return e / e.sum()

# โ”€โ”€ Forward Propagation โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
z1    = W1 @ x + b1        # Layer 1 affine
a1    = relu(z1)            # Layer 1 activation

z2    = W2 @ a1 + b2       # Layer 2 affine
y_hat = softmax(z2)         # Softmax output

print(f"z1    = {z1}")
print(f"a1    = {a1}")
print(f"z2    = {z2}")
print(f"y_hat = {y_hat}")
print(f"Pred  = Class {np.argmax(y_hat)}")
OUTPUT
z1 = [0.5 1.1 ] a1 = [0.5 1.1 ] z2 = [-0.08 0.61] y_hat = [0.334 0.666] Pred = Class 1

Section 07

Golden Rules

โšก Forward Propagation โ€” Non-Negotiable Rules
1
No weights change during the forward pass. It is purely arithmetic โ€” multiply, add, activate, repeat. Learning happens only during backpropagation.
2
Every hidden layer must have a non-linear activation. Without it, stacking layers is pointless โ€” a chain of purely linear transforms collapses into a single linear transform.
3
For multi-class output always use Softmax + Cross-Entropy loss. Softmax ensures outputs are valid probabilities; Cross-Entropy measures how wrong they are.
4
When computing Softmax always subtract the max logit first: e^(z โˆ’ max(z)). This prevents numerical overflow with zero effect on the final probabilities.
5
The forward pass is identical at test time. The same affine-activation chain runs โ€” only Dropout and BatchNorm behave differently between training and inference.