Why Neurons Need to Be Non-Linear
The fix? Introduce a small non-linear "kink" at every neuron β an activation function. Suddenly your stack of glass becomes a prism, then a kaleidoscope. Each layer now transforms space in ways the next layer can exploit. That is what makes deep networks deep.
An activation function takes a neuron's raw weighted sum z = Wx + b
and applies a non-linear squeeze, shift, or gate before passing it forward.
Without this step, every layer collapses into one β and depth becomes meaningless.
A good activation function must be differentiable almost everywhere so gradients can flow backwards during training. The shape of that derivative β whether it saturates, clips, or stays linear β determines everything about how well your network learns.
Sigmoid & tanh β The Saturation Trap
Yellow dashed = derivative (gradient). Both functions saturate at extremes β gradient β 0 βΉ vanishing gradient problem.
When gradients pass through many saturated sigmoid/tanh neurons, they shrink exponentially. By the time they reach the first layers of a deep network, they're essentially zero β those weights never update, the network never learns. This is why deep networks were nearly impossible to train before 2010.
ReLU β The Revolution & Its Fatal Flaw
This simplicity is ReLU's superpower β gradients for positive activations are always exactly 1, so they flow freely through 100 layers without vanishing. Training became 6Γ faster. Deep networks finally worked. The 2010s deep learning renaissance was largely built on this one dumb function.
But the bouncer has a problem: once a neuron's input goes negative and stays negative, it gets frozen at zero permanently. The gradient is zero, no update happens, it's dead forever. This is the dying neuron problem.
Red dashed = zero-gradient (dead) zone. Positive z flows freely with gradient = 1. A high learning rate can kill entire layers.
If a large gradient update pushes a neuron's weights so that z is
always negative for every training sample, that neuron outputs zero and receives
zero gradient forever. It's dead β contributing nothing.
With aggressive learning rates or poor initialisation, up to 40% of a ReLU
network can die before training is complete.
Leaky ReLU, ELU & GELU β Fixing What ReLU Broke
Three activations emerged to solve the dying neuron problem β each with a
different philosophy about what happens when z < 0.
GELU dips slightly below zero near z = β0.17 before rising, creating a soft gating effect. ELU saturates at βΞ±. Leaky ReLU stays linear with small slope.
| Activation | Dying Neurons | Zero-Centred | Smooth | Compute Cost | Best Use Case |
|---|---|---|---|---|---|
| ReLU | Yes β common | No | No (kink at 0) | Very low | CNNs, fast training baseline |
| Leaky ReLU | No | No | No | Very low | CNNs, GANs |
| ELU | No | Yes (β) | Yes | Medium (exp) | Deep MLPs, regression tasks |
| GELU | No | Yes (β) | Yes | Medium | Transformers, LLMs, BERT, GPT |
Softmax β Output Distributions
In a neural network, the final layer produces raw scores called logits. Softmax turns these into a probability distribution over all classes. The class with the highest probability is your prediction.
Softmax amplifies differences β a logit advantage of 1.4 can translate to a 52% probability swing. Used exclusively in the final classification layer, not hidden layers.
Softmax is never used in hidden layers β it would force every layer to compete in a zero-sum probability game, destroying the independent signal in each neuron. It belongs only at the final step, converting logits to a probability distribution for multi-class classification. For binary classification, a single sigmoid output is preferred.
Python β All Activations From Scratch
import numpy as np
import matplotlib.pyplot as plt
# ββ Define all activation functions βββββββββββββββββββββββ
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def tanh(z):
return np.tanh(z)
def relu(z):
return np.maximum(0, z)
def leaky_relu(z, alpha=0.01):
return np.where(z > 0, z, alpha * z)
def elu(z, alpha=1.0):
return np.where(z > 0, z, alpha * (np.exp(z) - 1))
def gelu(z):
# Approximation used in practice (Hendrycks & Gimpel 2016)
return 0.5 * z * (1 + np.tanh(np.sqrt(2 / np.pi) * (z + 0.044715 * z**3)))
def softmax(z):
# Numerically stable version β subtract max before exp
e = np.exp(z - np.max(z))
return e / e.sum()
# ββ Quick demo βββββββββββββββββββββββββββββββββββββββββββββ
z = np.array([3.2, 1.8, 0.5]) # raw logits
print("Logits: ", z)
print("Softmax: ", np.round(softmax(z), 3))
z_range = np.linspace(-5, 5, 400)
fns = {
"Sigmoid": sigmoid(z_range),
"tanh": tanh(z_range),
"ReLU": relu(z_range),
"Leaky ReLU": leaky_relu(z_range),
"ELU": elu(z_range),
"GELU": gelu(z_range),
}
for name, vals in fns.items():
print(f"{name:12s} | min={vals.min():.3f} max={vals.max():.3f}")
PyTorch β Activations in a Real Network
import torch
import torch.nn as nn
# ββ Model with GELU (modern default) ββββββββββββββββββββββ
class DeepMLP(nn.Module):
def __init__(self, in_dim, hidden, out_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(in_dim, hidden),
nn.GELU(), # β modern choice
nn.LayerNorm(hidden),
nn.Linear(hidden, hidden),
nn.GELU(),
nn.LayerNorm(hidden),
nn.Linear(hidden, out_dim), # raw logits
)
def forward(self, x):
return self.net(x) # Softmax in loss fn
# ββ Comparing activations on gradient flow βββββββββββββββββ
activations = {
"ReLU": nn.ReLU(),
"LeakyReLU": nn.LeakyReLU(0.01),
"ELU": nn.ELU(1.0),
"GELU": nn.GELU(),
}
x = torch.linspace(-3, 3, 10, requires_grad=True)
for name, act in activations.items():
y = act(x).sum()
y.backward()
dead = (x.grad == 0).sum().item()
print(f"{name:12s} | dead neurons: {dead}/10")
x.grad.zero_()
# ββ Loss function β Softmax is baked in here βββββββββββββββ
# Use CrossEntropyLoss: it applies LogSoftmax + NLLLoss
# NEVER add a Softmax layer before CrossEntropyLoss in PyTorch!
criterion = nn.CrossEntropyLoss()
Never add nn.Softmax() before
nn.CrossEntropyLoss() in PyTorch. The loss function already
applies log-softmax internally. Adding your own softmax first causes
numerical instability and double-application. Output raw logits from
your final linear layer and let the loss handle the rest.
How to Choose β The Decision Map
When in doubt: use GELU in hidden layers + nn.CrossEntropyLoss (which handles softmax) for classification. This is the setup used by every major transformer-based model and almost always beats ReLU on anything deeper than 4 layers with minimal tuning.
Golden Rules
nn.CrossEntropyLoss
handles log-softmax internally. Adding a softmax layer before it will corrupt
your gradients and produce numerically unstable training.