Multilayer Perceptron (MLP) Tutorial

Section 01

The Story That Explains an MLP

📖 Real World Analogy

The Telephone Exchange — Signals That Grow Wiser at Every Floor

Imagine a tall building that processes mail. The ground floor receives raw envelopes — each slot takes one piece of information (age, income, pixel value). Workers on the ground floor do nothing clever; they just receive and pass the signal up.

The middle floors are where the real thinking happens. Workers there can only talk to the floor directly below and above. Each worker combines what they hear, applies their own twist, and forwards a refined signal upward. By Floor 3 the signal is no longer "pixel brightness" — it has become "edge", then "curve", then "nose shape".

The top floor has just a few people — one per possible answer. They listen to the refined signals from below and produce the final verdict: cat or dog, buy or sell, spam or not spam.

That building is a Multilayer Perceptron. The floors are layers. The workers are neurons. The twist each worker applies is an activation function.

A Multilayer Perceptron (MLP) is the simplest fully connected feedforward neural network. Every neuron in one layer connects to every neuron in the next — no skips, no loops, no convolutions. Data flows in one direction: left to right, input to output. Despite that simplicity, MLPs can approximate any function — a fact so powerful it earned its own theorem.

⚡

Why MLP First?

CNNs, RNNs, Transformers — they all contain MLPs inside them. Master the MLP and you understand the building block that every modern architecture is assembled from. It is the hydrogen atom of deep learning.

Section 02

Input, Hidden & Output Layers — The Three Tribes

Every MLP has exactly three kinds of layer. They have fundamentally different jobs, and confusing them is the fastest way to make wrong architectural decisions.

🔘

Input Layer

No computation. Pure ingestion.

One neuron per feature. A 28×28 image → 784 input neurons. These neurons hold the raw data values — they do not apply weights or activations. Think of them as sockets, not processors.

Counted as Layer 0 in many frameworks. Not a "real" layer because it has no trainable parameters.

🧠

Hidden Layers

Where all the learning lives.

Each neuron computes: z = Wx + b, then applies an activation (ReLU, Sigmoid, Tanh). "Hidden" means the outside world never directly observes these values — they are internal representations.

You can have 1 hidden layer or 1,000. The number and size define the model's representational power.

🎯

Output Layer

Task-dependent design.

Size is dictated by the task: 1 neuron → binary or regression. N neurons → N-class classification.

Activation is also task-driven: Sigmoid (binary), Softmax (multi-class), Linear (regression). This layer produces your prediction — treat its design like a contract with your loss function.

💡 Data Flow — Forward Pass at a Glance

Input

Raw feature vector x enters. Shape: (batch_size, n_features). No weights applied.

Hidden 1

Compute h₁ = ReLU(W₁x + b₁). Each neuron detects a low-level pattern.

Hidden 2

Compute h₂ = ReLU(W₂h₁ + b₂). Combinations of patterns become concepts.

Output

Compute ŷ = Softmax(W₃h₂ + b₃). Concepts become a probability distribution.

Section 03

3D Architecture Diagram

The diagram below shows a 3 → 4 → 4 → 2 MLP: three input neurons, two hidden layers of four neurons each, and two output neurons. Every connection carries a trainable weight. Depth gives the network its hierarchical power.

🌐 MLP Architecture — 3D Perspective View (3 → 4 → 4 → 2)

ⓘ Each node in one layer connects to every node in the next — hence "fully connected." Weights on those connections are what the network learns.

Depth is not just about adding more neurons — it's about adding more representational stages. Here's how the same neuron budget changes in power depending on how it's arranged:

📈 Shallow vs Deep — Same Neuron Budget, Different Power

ⓘ The deep network uses the same neuron budget but arranged across 3 hidden layers — enabling it to learn compositional, hierarchical patterns the shallow version cannot.

Section 04

Hidden Layers & Depth — What Each Layer Learns

📖 Story

The Detective and the Chain of Evidence

A detective doesn't jump straight from "raw crime scene" to "arrested suspect." They build a chain: clues → patterns → motives → conclusion. Each step builds on the last. Strip one step out and you cannot make the logical leap.

Deep networks learn exactly this way. Layer 1 sees edges. Layer 2 sees corners and curves (combinations of edges). Layer 3 sees object parts. Layer 4 sees faces. Each layer is one step in the detective's chain — and you cannot skip the middle steps.

Depth is the primary lever for expressiveness in a neural network. Here is what that means in practice across common problem types:

🏫

Tabular / Structured Data

1–3 hidden layers almost always sufficient. Width (neurons per layer) matters more than depth here. Start with 2 layers, 64–256 neurons each.

typical: 2 hidden layers

🖼️

Image Classification (MLP)

3–5 hidden layers to handle spatial composition. Modern practice replaces pure MLPs with CNNs here, but understanding this is the foundation.

typical: 3–5 hidden layers

🗣️

NLP Embeddings / Classification

2–4 hidden layers after embedding. More layers help capture semantic depth — relationships between concepts rather than surface patterns.

typical: 2–4 hidden layers

⚠️

The Vanishing Gradient Problem — Why Depth Has a Cost

Every time the gradient travels backward through a layer, it gets multiplied by that layer's weights and activation derivative. With Sigmoid activations, those derivatives are < 0.25 — the gradient shrinks exponentially with depth. By Layer 10, the gradient reaching Layer 1 may be 10⁻⁶ of what it started as. The first layers stop learning entirely. Modern fixes: ReLU activations, batch normalisation, residual connections.

Section 05

Universal Approximation Theorem

📖 Story

The Sculptor With Infinite Clay

Imagine you are given a lump of clay and told to sculpt any shape in the world — a face, a mountain range, a spiral staircase. The only constraint: you must build it by stacking and smearing flat slabs. Can you do it?

Yes — if you have enough slabs and enough time. Any curved surface can be approximated by enough flat pieces placed at the right angles. The approximation gets better as you add more slabs.

The UAT says the same thing about neural networks. A single hidden layer, wide enough, can approximate any continuous function on a bounded domain. The neurons are your flat slabs; the weights are the angles; activation functions are the bends.

📚

The Theorem — Hornik, 1991

A feedforward network with a single hidden layer containing a finite number of neurons and a non-constant, bounded, monotonically-increasing activation function can approximate any continuous function on a compact subset of ℝⁿ to arbitrary precision.

✅

What It Guarantees

The function exists — a network wide enough can represent it. It says nothing about how to find the right weights via gradient descent, or how long training will take.

❌

What It Does NOT Say

It does not guarantee that gradient descent will converge to it, that it will generalise to unseen data, or that the required number of neurons is computationally feasible. "Can" ≠ "will in practice."

🔍

Why Depth Helps Anyway

The theorem holds for shallow networks, but a deep network can approximate the same functions with exponentially fewer neurons. Depth trades neuron count for layer count — an enormous practical advantage.

🔑

The Practitioner Takeaway

The UAT is a theoretical existence proof, not a design recipe. It tells you that the search space is rich enough — your model architecture is not the bottleneck. In practice, the bottleneck is data quality, training stability, and generalisation. Don't use the UAT to justify over-engineering your architecture.

Section 06

Parameter Counting — The Exact Formula

Understanding parameter count is non-negotiable. It tells you memory footprint, training compute, overfitting risk, and whether your model can even fit on the hardware. The formula is always the same: weights + biases, layer by layer.

Weights per Layer

n_in × n_out

Every neuron in the current layer connects to every neuron in the previous layer. That's n_in × n_out weight scalars.

Biases per Layer

n_out

One bias per output neuron in the layer. Biases shift the activation independently of the input.

Params per Layer

(n_in × n_out) + n_out

Weights plus biases. Equivalent to n_out × (n_in + 1) — this is why adding 1 neuron to the prior layer adds n_out new parameters, not 1.

Total Model Params

Σ [(nᵢ × nᵢ₊₁) + nᵢ₊₁]

Sum across all layers (input→H1, H1→H2, …, Hₙ→output). Count each transition once.

🔢 Worked Example — Network: 3 → 4 → 4 → 2

Layer 1

Input(3) → Hidden1(4) | (3 × 4) + 4 = 16 parameters (12 weights + 4 biases)

Layer 2

Hidden1(4) → Hidden2(4) | (4 × 4) + 4 = 20 parameters (16 weights + 4 biases)

Layer 3

Hidden2(4) → Output(2) | (4 × 2) + 2 = 10 parameters (8 weights + 2 biases)

TOTAL

16 + 20 + 10 = 46 trainable parameters

Architecture	Total Parameters	Regime	Practical Context
784 → 128 → 64 → 10	109,386	Small	MNIST digit classifier
784 → 512 → 512 → 256 → 10	796,938	Medium	Complex image tasks, tabular data
784 → 2048 → 2048 → 1024 → 10	7,360,522	Large	Needs dropout + regularisation to not overfit

⚠️

The Width Trap

Parameters scale as n_in × n_out — quadratically with layer width. Doubling every hidden layer width roughly quadruples parameters. Before adding width, always ask: do I need more representational power, or do I need more data and better regularisation?

Section 07

Python Code — MLP from Scratch to Training

Two implementations: scikit-learn (fast baseline) and PyTorch (transparent, production-ready). Both produce the same result on the same data.

🐍 scikit-learn — Quick Baseline (MNIST subset)

from sklearn.neural_network import MLPClassifier
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load MNIST (70,000 samples, 784 features)
X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# MLPs need feature scaling — pixel range 0-255 → 0-1
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# Architecture: 784 → 256 → 128 → 10
mlp = MLPClassifier(
    hidden_layer_sizes=(256, 128),   # 2 hidden layers
    activation='relu',              # ReLU on every hidden neuron
    solver='adam',                  # adaptive moment optimiser
    learning_rate_init=1e-3,
    max_iter=30,
    batch_size=256,
    random_state=42,
    verbose=True
)

mlp.fit(X_train, y_train)
preds = mlp.predict(X_test)

print(f"Test Accuracy : {accuracy_score(y_test, preds):.4f}")
print(f"Total Params  : {sum(w.size for w in mlp.coefs_) + sum(b.size for b in mlp.intercepts_)}")

OUTPUT

Iteration 1, loss = 0.3812 ... Iteration 30, loss = 0.0241 Test Accuracy : 0.9764 Total Params : 234,378 ← (784×256+256) + (256×128+128) + (128×10+10)

🔥 PyTorch — Transparent MLP with Parameter Inspector

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# ── Define the MLP ───────────────────────────────────────
class MLP(nn.Module):
    def __init__(self, input_dim, hidden_dims, output_dim):
        super().__init__()
        layers = []
        prev = input_dim
        for h in hidden_dims:
            layers.append(nn.Linear(prev, h))
            layers.append(nn.ReLU())
            layers.append(nn.BatchNorm1d(h))    # stabilises deep MLPs
            layers.append(nn.Dropout(0.3))       # regularisation
            prev = h
        layers.append(nn.Linear(prev, output_dim))
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)

# ── Instantiate: 784 → 256 → 128 → 10 ───────────────────
model = MLP(input_dim=784, hidden_dims=[256, 128], output_dim=10)

# ── Count parameters ─────────────────────────────────────
total_params    = sum(p.numel() for p in model.parameters())
trainable       = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total params    : {total_params:,}")
print(f"Trainable params: {trainable:,}")

# ── Layer-by-layer breakdown ──────────────────────────────
for name, param in model.named_parameters():
    if param.requires_grad:
        print(f"  {name:30s} {str(param.shape):20s} {param.numel():>8,} params")

# ── Training loop (abbreviated) ──────────────────────────
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

for epoch in range(1, 6):
    model.train()
    running_loss = 0.0
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        logits = model(X_batch)
        loss   = criterion(logits, y_batch)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    print(f"Epoch {epoch} | Loss: {running_loss/len(train_loader):.4f}")

OUTPUT

Total params : 235,402 Trainable params: 235,402 net.0.weight torch.Size([256, 784]) 200,704 params net.0.bias torch.Size([256]) 256 params net.2.weight torch.Size([256]) 256 params ← BatchNorm γ net.2.bias torch.Size([256]) 256 params ← BatchNorm β net.4.weight torch.Size([128, 256]) 32,768 params net.4.bias torch.Size([128]) 128 params net.6.weight torch.Size([128]) 128 params net.6.bias torch.Size([128]) 128 params net.8.weight torch.Size([10, 128]) 1,280 params net.8.bias torch.Size([10]) 10 params Epoch 1 | Loss: 0.3241 Epoch 2 | Loss: 0.1487 Epoch 3 | Loss: 0.1102 Epoch 4 | Loss: 0.0893 Epoch 5 | Loss: 0.0754

💡

BatchNorm Adds Parameters Too

Notice net.2.weight and net.2.bias — those are the BatchNorm scale (γ) and shift (β) parameters. BatchNorm layers add 2 × n_neurons parameters per hidden layer. Always count them when estimating model size.

Section 08

Golden Rules

⚡ MLP — Non-Negotiable Design Rules

Always scale your inputs. Unlike tree models, MLPs are extremely sensitive to feature magnitude. Use StandardScaler (zero mean, unit variance) as the default. Unscaled inputs → some weights explode while others stay near zero → gradient chaos from epoch one.

Default to ReLU. Sigmoid and Tanh saturate — their gradients near ±1 are nearly zero. ReLU does not saturate on the positive side, so gradients flow cleanly in the early epochs. Only switch to something else (GELU, SiLU) if you have a specific architectural reason.

Count your parameters before training. Run sum(p.numel() for p in model.parameters()) before the first epoch. If you have 10,000 samples and 10,000,000 parameters, you will memorise the training set. The rule of thumb: aim for at least 10–20 training samples per parameter.

Start shallow, go wide, then go deep. The order matters. A single wider hidden layer is much easier to debug than 10 thin layers. Add depth only after confirming that width alone cannot capture the function complexity you need.

Use BatchNorm for networks deeper than 3 hidden layers. It normalises the input to each layer, killing the internal covariate shift problem and dramatically accelerating convergence. Place it after the linear transformation and before the activation: Linear → BN → ReLU.

Match your output layer to your task exactly. Binary classification → 1 neuron + Sigmoid + BCELoss. Multi-class → N neurons + no activation + CrossEntropyLoss (PyTorch applies Softmax internally). Regression → 1 neuron + no activation + MSELoss. Mismatching any of these three produces silent, catastrophic errors.

The Universal Approximation Theorem is not a design licence. It proves a solution exists — not that you will find it, not that it will generalise, and not that you need the biggest network possible. Occam's razor still applies: the simplest model that achieves your validation target is the correct model.