Activation Functions Explained

Section 01

Why Neurons Need to Be Non-Linear

📖 The Core Problem

100 Layers of Glass — Still Just Glass

Imagine stacking 100 sheets of clear glass on top of each other. No matter how many layers you add, light still passes straight through — it's still just glass. A stack of linear layers behaves the same way: no matter how deep your network, it can only learn a single linear transformation. It can never separate spirals, model speech, or understand images.

The fix? Introduce a small non-linear "kink" at every neuron — an activation function. Suddenly your stack of glass becomes a prism, then a kaleidoscope. Each layer now transforms space in ways the next layer can exploit. That is what makes deep networks deep.

An activation function takes a neuron's raw weighted sum z = Wx + b and applies a non-linear squeeze, shift, or gate before passing it forward. Without this step, every layer collapses into one — and depth becomes meaningless.

🔑

The One Rule

A good activation function must be differentiable almost everywhere so gradients can flow backwards during training. The shape of that derivative — whether it saturates, clips, or stays linear — determines everything about how well your network learns.

📊 Linear vs Non-Linear — Decision Space

Section 02

Sigmoid & tanh — The Saturation Trap

📖 Analogy

The Volume Knob That Gets Stuck

Imagine a volume knob that only goes from 0 to 1. At the extremes it's glued to the floor or ceiling — no matter how hard you turn it, nothing changes. That is exactly what happens when a sigmoid neuron receives very large or very small inputs. Its output flatlines, its gradient vanishes to near zero, and the neuron stops learning entirely. In a 50-layer network, this silence echoes backward — nothing moves.

📈 Sigmoid σ(z) and tanh(z) — Curves & Gradients

Yellow dashed = derivative (gradient). Both functions saturate at extremes — gradient ≈ 0 ⟹ vanishing gradient problem.

Sigmoid Output Range

σ(z) ∈ (0, 1)

Always positive output → outputs are not zero-centred → zig-zag gradient updates

tanh Output Range

tanh(z) ∈ (−1, +1)

Zero-centred → better gradient flow than sigmoid, but still saturates

Sigmoid Max Gradient

σ′(z) ≤ 0.25

Gradient shrinks by ≥ 75% every layer — 10 layers = gradient × 0.25¹⁰ ≈ 0

tanh Max Gradient

tanh′(z) ≤ 1.0

Better than sigmoid but still collapses at large |z| values

⚠️

The Vanishing Gradient Problem

When gradients pass through many saturated sigmoid/tanh neurons, they shrink exponentially. By the time they reach the first layers of a deep network, they're essentially zero — those weights never update, the network never learns. This is why deep networks were nearly impossible to train before 2010.

Section 03

ReLU — The Revolution & Its Fatal Flaw

📖 Story

The Bouncer Who Lets Everyone In — Except the Negatives

ReLU (Rectified Linear Unit) has a brutally simple rule: if you are positive, pass through unchanged. If you are negative, you get set to zero. That's it. No sigmoid squeeze. No tanh drama. Just a single fold at zero.

This simplicity is ReLU's superpower — gradients for positive activations are always exactly 1, so they flow freely through 100 layers without vanishing. Training became 6× faster. Deep networks finally worked. The 2010s deep learning renaissance was largely built on this one dumb function.

But the bouncer has a problem: once a neuron's input goes negative and stays negative, it gets frozen at zero permanently. The gradient is zero, no update happens, it's dead forever. This is the dying neuron problem.

📈 ReLU(z) = max(0, z) — Curve & Dying Neuron

Red dashed = zero-gradient (dead) zone. Positive z flows freely with gradient = 1. A high learning rate can kill entire layers.

💀

The Dying Neuron Problem

If a large gradient update pushes a neuron's weights so that z is always negative for every training sample, that neuron outputs zero and receives zero gradient forever. It's dead — contributing nothing. With aggressive learning rates or poor initialisation, up to 40% of a ReLU network can die before training is complete.

Section 04

Leaky ReLU, ELU & GELU — Fixing What ReLU Broke

Three activations emerged to solve the dying neuron problem — each with a different philosophy about what happens when z < 0.

📉

Leaky ReLU

f(z) = max(αz, z), α ≈ 0.01

Instead of hard-zero for negatives, allows a small slope α (typically 0.01). Dead neurons can never truly die — they still get a tiny gradient. Fast to compute. Works better than ReLU in many vision tasks.

〰️

ELU

f(z) = z if z>0 else α(eᶻ − 1)

Exponential Linear Unit. Negatives smoothly curve toward −α rather than going linearly negative. Zero-centred outputs, smooth gradient everywhere. Slightly more expensive due to the exp, but better than Leaky ReLU on deeper nets.

🔔

GELU

f(z) = z · Φ(z)

Gaussian Error Linear Unit. Multiplies z by the probability it's positive under a Gaussian — a stochastic soft gate. Used in GPT, BERT, and transformers broadly. Smooth, differentiable everywhere. The default in modern large language models.

📈 ReLU vs Leaky ReLU vs ELU vs GELU — Side by Side

GELU dips slightly below zero near z = −0.17 before rising, creating a soft gating effect. ELU saturates at −α. Leaky ReLU stays linear with small slope.

Activation	Dying Neurons	Zero-Centred	Smooth	Compute Cost	Best Use Case
ReLU	Yes — common	No	No (kink at 0)	Very low	CNNs, fast training baseline
Leaky ReLU	No	No	No	Very low	CNNs, GANs
ELU	No	Yes (≈)	Yes	Medium (exp)	Deep MLPs, regression tasks
GELU	No	Yes (≈)	Yes	Medium	Transformers, LLMs, BERT, GPT

Section 05

Softmax — Output Distributions

📖 Analogy

The Ballot Counter

After an election, you have raw vote counts: Cat got 400 votes, Dog got 300, Fish got 300. Softmax is the ballot counter that converts these into percentages that sum to 100%: Cat 40%, Dog 30%, Fish 30%. The largest number wins the most probability mass. The smallest number is never completely silenced — it still gets some probability.

In a neural network, the final layer produces raw scores called logits. Softmax turns these into a probability distribution over all classes. The class with the highest probability is your prediction.

📊 Softmax — Logits to Probability Distribution

Softmax amplifies differences — a logit advantage of 1.4 can translate to a 52% probability swing. Used exclusively in the final classification layer, not hidden layers.

💡

Softmax is Only for the Output Layer

Softmax is never used in hidden layers — it would force every layer to compete in a zero-sum probability game, destroying the independent signal in each neuron. It belongs only at the final step, converting logits to a probability distribution for multi-class classification. For binary classification, a single sigmoid output is preferred.

Section 06

Python — All Activations From Scratch

import numpy as np
import matplotlib.pyplot as plt

# ── Define all activation functions ───────────────────────

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def tanh(z):
    return np.tanh(z)

def relu(z):
    return np.maximum(0, z)

def leaky_relu(z, alpha=0.01):
    return np.where(z > 0, z, alpha * z)

def elu(z, alpha=1.0):
    return np.where(z > 0, z, alpha * (np.exp(z) - 1))

def gelu(z):
    # Approximation used in practice (Hendrycks & Gimpel 2016)
    return 0.5 * z * (1 + np.tanh(np.sqrt(2 / np.pi) * (z + 0.044715 * z**3)))

def softmax(z):
    # Numerically stable version — subtract max before exp
    e = np.exp(z - np.max(z))
    return e / e.sum()

# ── Quick demo ─────────────────────────────────────────────

z = np.array([3.2, 1.8, 0.5])  # raw logits
print("Logits:  ", z)
print("Softmax: ", np.round(softmax(z), 3))

z_range = np.linspace(-5, 5, 400)
fns = {
    "Sigmoid":    sigmoid(z_range),
    "tanh":       tanh(z_range),
    "ReLU":       relu(z_range),
    "Leaky ReLU": leaky_relu(z_range),
    "ELU":        elu(z_range),
    "GELU":       gelu(z_range),
}

for name, vals in fns.items():
    print(f"{name:12s} | min={vals.min():.3f}  max={vals.max():.3f}")

OUTPUT

Section 07

PyTorch — Activations in a Real Network

import torch
import torch.nn as nn

# ── Model with GELU (modern default) ──────────────────────
class DeepMLP(nn.Module):
    def __init__(self, in_dim, hidden, out_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, hidden),
            nn.GELU(),                  # ← modern choice
            nn.LayerNorm(hidden),
            nn.Linear(hidden, hidden),
            nn.GELU(),
            nn.LayerNorm(hidden),
            nn.Linear(hidden, out_dim),   # raw logits
        )

    def forward(self, x):
        return self.net(x)             # Softmax in loss fn

# ── Comparing activations on gradient flow ─────────────────
activations = {
    "ReLU":       nn.ReLU(),
    "LeakyReLU":  nn.LeakyReLU(0.01),
    "ELU":        nn.ELU(1.0),
    "GELU":       nn.GELU(),
}

x = torch.linspace(-3, 3, 10, requires_grad=True)

for name, act in activations.items():
    y = act(x).sum()
    y.backward()
    dead = (x.grad == 0).sum().item()
    print(f"{name:12s} | dead neurons: {dead}/10")
    x.grad.zero_()

# ── Loss function — Softmax is baked in here ───────────────
# Use CrossEntropyLoss: it applies LogSoftmax + NLLLoss
# NEVER add a Softmax layer before CrossEntropyLoss in PyTorch!
criterion = nn.CrossEntropyLoss()

OUTPUT

ReLU | dead neurons: 5/10 ← 50% of neurons have zero gradient LeakyReLU | dead neurons: 0/10 ELU | dead neurons: 0/10 GELU | dead neurons: 0/10

⚠️

PyTorch Softmax Trap

Never add nn.Softmax() before nn.CrossEntropyLoss() in PyTorch. The loss function already applies log-softmax internally. Adding your own softmax first causes numerical instability and double-application. Output raw logits from your final linear layer and let the loss handle the rest.

Section 08

How to Choose — The Decision Map

🗺️ Activation Function Decision Guide

Output Layer — Multi-class

Use Softmax. Converts logits to a probability distribution. Never use in hidden layers.

Output Layer — Binary

Use Sigmoid. Single probability in (0,1). Threshold at 0.5 for class decision.

Transformers / LLMs

Use GELU. Smooth, differentiable everywhere. Default in BERT, GPT-2/3/4, and all modern attention-based models.

CNNs / Vision models

Start with ReLU. If >10% neurons die, switch to Leaky ReLU. Use batch normalisation to prevent dying.

Deep MLP / Tabular

Try ELU or GELU. Zero-centred, smooth gradients. Better than ReLU on deep feed-forward nets.

GANs / Discriminator

Use Leaky ReLU (α = 0.2). Lets gradients flow for both positive and negative activations. Standard GAN practice.

🏆

The Modern Practitioner Default

When in doubt: use GELU in hidden layers + nn.CrossEntropyLoss (which handles softmax) for classification. This is the setup used by every major transformer-based model and almost always beats ReLU on anything deeper than 4 layers with minimal tuning.

Section 09

Golden Rules

⚡ Activation Functions — Non-Negotiable Rules

Never use sigmoid or tanh in hidden layers of deep networks. Their gradients saturate. Your deep layers will stop learning. Reserve sigmoid for binary output neurons only.

Monitor your dead neuron rate. If ReLU networks have accuracy that plateaus early, log the fraction of zero activations per layer. Above 20% dead is a red flag — switch to Leaky ReLU or lower your learning rate.

Softmax is only for the final output layer. It creates a zero-sum competition between neurons. Hidden layers need independent signals, not a probability race.

In PyTorch, output raw logits. nn.CrossEntropyLoss handles log-softmax internally. Adding a softmax layer before it will corrupt your gradients and produce numerically unstable training.

GELU is the modern default for deep networks. It is smooth, zero-centred on average, and produces better gradient flow than ReLU on anything with residual connections or attention. Use it unless you have a strong reason not to.

Pair activations with appropriate initialisations. ReLU works best with He (Kaiming) initialisation. Sigmoid/tanh with Glorot (Xavier). Wrong initialisation + wrong activation = vanishing or exploding gradients before training even starts.