The Story That Explains an MLP
The middle floors are where the real thinking happens. Workers there can only talk to the floor directly below and above. Each worker combines what they hear, applies their own twist, and forwards a refined signal upward. By Floor 3 the signal is no longer "pixel brightness" β it has become "edge", then "curve", then "nose shape".
The top floor has just a few people β one per possible answer. They listen to the refined signals from below and produce the final verdict: cat or dog, buy or sell, spam or not spam.
That building is a Multilayer Perceptron. The floors are layers. The workers are neurons. The twist each worker applies is an activation function.
A Multilayer Perceptron (MLP) is the simplest fully connected feedforward neural network. Every neuron in one layer connects to every neuron in the next β no skips, no loops, no convolutions. Data flows in one direction: left to right, input to output. Despite that simplicity, MLPs can approximate any function β a fact so powerful it earned its own theorem.
CNNs, RNNs, Transformers β they all contain MLPs inside them. Master the MLP and you understand the building block that every modern architecture is assembled from. It is the hydrogen atom of deep learning.
Input, Hidden & Output Layers β The Three Tribes
Every MLP has exactly three kinds of layer. They have fundamentally different jobs, and confusing them is the fastest way to make wrong architectural decisions.
Counted as Layer 0 in many frameworks. Not a "real" layer because it has no trainable parameters.
You can have 1 hidden layer or 1,000. The number and size define the model's representational power.
Activation is also task-driven: Sigmoid (binary), Softmax (multi-class), Linear (regression). This layer produces your prediction β treat its design like a contract with your loss function.
3D Architecture Diagram
The diagram below shows a 3 β 4 β 4 β 2 MLP: three input neurons, two hidden layers of four neurons each, and two output neurons. Every connection carries a trainable weight. Depth gives the network its hierarchical power.
Depth is not just about adding more neurons β it's about adding more representational stages. Here's how the same neuron budget changes in power depending on how it's arranged:
Hidden Layers & Depth β What Each Layer Learns
Deep networks learn exactly this way. Layer 1 sees edges. Layer 2 sees corners and curves (combinations of edges). Layer 3 sees object parts. Layer 4 sees faces. Each layer is one step in the detective's chain β and you cannot skip the middle steps.
Depth is the primary lever for expressiveness in a neural network. Here is what that means in practice across common problem types:
Every time the gradient travels backward through a layer, it gets multiplied by that layer's weights and activation derivative. With Sigmoid activations, those derivatives are < 0.25 β the gradient shrinks exponentially with depth. By Layer 10, the gradient reaching Layer 1 may be 10β»βΆ of what it started as. The first layers stop learning entirely. Modern fixes: ReLU activations, batch normalisation, residual connections.
Universal Approximation Theorem
Yes β if you have enough slabs and enough time. Any curved surface can be approximated by enough flat pieces placed at the right angles. The approximation gets better as you add more slabs.
The UAT says the same thing about neural networks. A single hidden layer, wide enough, can approximate any continuous function on a bounded domain. The neurons are your flat slabs; the weights are the angles; activation functions are the bends.
A feedforward network with a single hidden layer containing a finite number of neurons and a non-constant, bounded, monotonically-increasing activation function can approximate any continuous function on a compact subset of ββΏ to arbitrary precision.
The UAT is a theoretical existence proof, not a design recipe. It tells you that the search space is rich enough β your model architecture is not the bottleneck. In practice, the bottleneck is data quality, training stability, and generalisation. Don't use the UAT to justify over-engineering your architecture.
Parameter Counting β The Exact Formula
Understanding parameter count is non-negotiable. It tells you memory footprint, training compute, overfitting risk, and whether your model can even fit on the hardware. The formula is always the same: weights + biases, layer by layer.
| Architecture | Total Parameters | Regime | Practical Context |
|---|---|---|---|
| 784 β 128 β 64 β 10 | 109,386 | Small | MNIST digit classifier |
| 784 β 512 β 512 β 256 β 10 | 796,938 | Medium | Complex image tasks, tabular data |
| 784 β 2048 β 2048 β 1024 β 10 | 7,360,522 | Large | Needs dropout + regularisation to not overfit |
Parameters scale as n_in Γ n_out β quadratically with layer width. Doubling every hidden layer width roughly quadruples parameters. Before adding width, always ask: do I need more representational power, or do I need more data and better regularisation?
Python Code β MLP from Scratch to Training
Two implementations: scikit-learn (fast baseline) and PyTorch (transparent, production-ready). Both produce the same result on the same data.
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
# Load MNIST (70,000 samples, 784 features)
X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# MLPs need feature scaling β pixel range 0-255 β 0-1
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Architecture: 784 β 256 β 128 β 10
mlp = MLPClassifier(
hidden_layer_sizes=(256, 128), # 2 hidden layers
activation='relu', # ReLU on every hidden neuron
solver='adam', # adaptive moment optimiser
learning_rate_init=1e-3,
max_iter=30,
batch_size=256,
random_state=42,
verbose=True
)
mlp.fit(X_train, y_train)
preds = mlp.predict(X_test)
print(f"Test Accuracy : {accuracy_score(y_test, preds):.4f}")
print(f"Total Params : {sum(w.size for w in mlp.coefs_) + sum(b.size for b in mlp.intercepts_)}")
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
# ββ Define the MLP βββββββββββββββββββββββββββββββββββββββ
class MLP(nn.Module):
def __init__(self, input_dim, hidden_dims, output_dim):
super().__init__()
layers = []
prev = input_dim
for h in hidden_dims:
layers.append(nn.Linear(prev, h))
layers.append(nn.ReLU())
layers.append(nn.BatchNorm1d(h)) # stabilises deep MLPs
layers.append(nn.Dropout(0.3)) # regularisation
prev = h
layers.append(nn.Linear(prev, output_dim))
self.net = nn.Sequential(*layers)
def forward(self, x):
return self.net(x)
# ββ Instantiate: 784 β 256 β 128 β 10 βββββββββββββββββββ
model = MLP(input_dim=784, hidden_dims=[256, 128], output_dim=10)
# ββ Count parameters βββββββββββββββββββββββββββββββββββββ
total_params = sum(p.numel() for p in model.parameters())
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total params : {total_params:,}")
print(f"Trainable params: {trainable:,}")
# ββ Layer-by-layer breakdown ββββββββββββββββββββββββββββββ
for name, param in model.named_parameters():
if param.requires_grad:
print(f" {name:30s} {str(param.shape):20s} {param.numel():>8,} params")
# ββ Training loop (abbreviated) ββββββββββββββββββββββββββ
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
for epoch in range(1, 6):
model.train()
running_loss = 0.0
for X_batch, y_batch in train_loader:
optimizer.zero_grad()
logits = model(X_batch)
loss = criterion(logits, y_batch)
loss.backward()
optimizer.step()
running_loss += loss.item()
print(f"Epoch {epoch} | Loss: {running_loss/len(train_loader):.4f}")
Notice net.2.weight and net.2.bias β those are the BatchNorm
scale (Ξ³) and shift (Ξ²) parameters. BatchNorm layers add 2 Γ n_neurons
parameters per hidden layer. Always count them when estimating model size.
Golden Rules
StandardScaler (zero mean, unit variance)
as the default. Unscaled inputs β some weights explode while others stay near zero
β gradient chaos from epoch one.
sum(p.numel() for p in model.parameters())
before the first epoch. If you have 10,000 samples and 10,000,000 parameters, you
will memorise the training set. The rule of thumb: aim for at least 10β20 training
samples per parameter.
Linear β BN β ReLU.