Deep Learning πŸ“‚ Artificial Neural Networks (ANN) Β· 2 of 7 33 min read

Activation Functions in Deep Learning

A visual, story-driven guide to non-linear activations β€” how Sigmoid, tanh, ReLU, Leaky ReLU, ELU, GELU and Softmax work, where they break, and how to choose the right one.

Section 01

Why Neurons Need to Be Non-Linear

100 Layers of Glass β€” Still Just Glass
Imagine stacking 100 sheets of clear glass on top of each other. No matter how many layers you add, light still passes straight through β€” it's still just glass. A stack of linear layers behaves the same way: no matter how deep your network, it can only learn a single linear transformation. It can never separate spirals, model speech, or understand images.

The fix? Introduce a small non-linear "kink" at every neuron β€” an activation function. Suddenly your stack of glass becomes a prism, then a kaleidoscope. Each layer now transforms space in ways the next layer can exploit. That is what makes deep networks deep.

An activation function takes a neuron's raw weighted sum z = Wx + b and applies a non-linear squeeze, shift, or gate before passing it forward. Without this step, every layer collapses into one β€” and depth becomes meaningless.

πŸ”‘
The One Rule

A good activation function must be differentiable almost everywhere so gradients can flow backwards during training. The shape of that derivative β€” whether it saturates, clips, or stays linear β€” determines everything about how well your network learns.

πŸ“Š Linear vs Non-Linear β€” Decision Space
NO ACTIVATION (Linear) Can only draw a straight line β€” misclassifies spirals WITH ACTIVATION (Non-Linear) Can draw curves β€” separates complex distributions

Section 02

Sigmoid & tanh β€” The Saturation Trap

The Volume Knob That Gets Stuck
Imagine a volume knob that only goes from 0 to 1. At the extremes it's glued to the floor or ceiling β€” no matter how hard you turn it, nothing changes. That is exactly what happens when a sigmoid neuron receives very large or very small inputs. Its output flatlines, its gradient vanishes to near zero, and the neuron stops learning entirely. In a 50-layer network, this silence echoes backward β€” nothing moves.
πŸ“ˆ Sigmoid Οƒ(z) and tanh(z) β€” Curves & Gradients
Οƒ(z) = 1 / (1 + e⁻ᢻ) 1 0 SATURATED SATURATED tanh(z) = (eαΆ» - e⁻ᢻ)/(eαΆ» + e⁻ᢻ) +1 -1 0 tanh(z) gradient Οƒ(z)

Yellow dashed = derivative (gradient). Both functions saturate at extremes β€” gradient β‰ˆ 0 ⟹ vanishing gradient problem.

Sigmoid Output Range
Οƒ(z) ∈ (0, 1)
Always positive output β†’ outputs are not zero-centred β†’ zig-zag gradient updates
tanh Output Range
tanh(z) ∈ (βˆ’1, +1)
Zero-centred β†’ better gradient flow than sigmoid, but still saturates
Sigmoid Max Gradient
Οƒβ€²(z) ≀ 0.25
Gradient shrinks by β‰₯ 75% every layer β€” 10 layers = gradient Γ— 0.25¹⁰ β‰ˆ 0
tanh Max Gradient
tanhβ€²(z) ≀ 1.0
Better than sigmoid but still collapses at large |z| values
⚠️
The Vanishing Gradient Problem

When gradients pass through many saturated sigmoid/tanh neurons, they shrink exponentially. By the time they reach the first layers of a deep network, they're essentially zero β€” those weights never update, the network never learns. This is why deep networks were nearly impossible to train before 2010.


Section 03

ReLU β€” The Revolution & Its Fatal Flaw

The Bouncer Who Lets Everyone In β€” Except the Negatives
ReLU (Rectified Linear Unit) has a brutally simple rule: if you are positive, pass through unchanged. If you are negative, you get set to zero. That's it. No sigmoid squeeze. No tanh drama. Just a single fold at zero.

This simplicity is ReLU's superpower β€” gradients for positive activations are always exactly 1, so they flow freely through 100 layers without vanishing. Training became 6Γ— faster. Deep networks finally worked. The 2010s deep learning renaissance was largely built on this one dumb function.

But the bouncer has a problem: once a neuron's input goes negative and stays negative, it gets frozen at zero permanently. The gradient is zero, no update happens, it's dead forever. This is the dying neuron problem.
πŸ“ˆ ReLU(z) = max(0, z) β€” Curve & Dying Neuron
ReLU(z) = max(0, z) gradient = 0 gradient = 1 z=0 Dying Neuron Problem z = +2.1 z = βˆ’3.4 ReLUβ†’0, βˆ‡=0 no signal partial signal DEAD High LR β†’ weights push z < 0 β†’ neuron frozen forever Can lose 40–50% of neurons in aggressive training

Red dashed = zero-gradient (dead) zone. Positive z flows freely with gradient = 1. A high learning rate can kill entire layers.

πŸ’€
The Dying Neuron Problem

If a large gradient update pushes a neuron's weights so that z is always negative for every training sample, that neuron outputs zero and receives zero gradient forever. It's dead β€” contributing nothing. With aggressive learning rates or poor initialisation, up to 40% of a ReLU network can die before training is complete.


Section 04

Leaky ReLU, ELU & GELU β€” Fixing What ReLU Broke

Three activations emerged to solve the dying neuron problem β€” each with a different philosophy about what happens when z < 0.

πŸ“‰
Leaky ReLU
f(z) = max(Ξ±z, z), Ξ± β‰ˆ 0.01
Instead of hard-zero for negatives, allows a small slope Ξ± (typically 0.01). Dead neurons can never truly die β€” they still get a tiny gradient. Fast to compute. Works better than ReLU in many vision tasks.
〰️
ELU
f(z) = z if z>0 else Ξ±(eαΆ» βˆ’ 1)
Exponential Linear Unit. Negatives smoothly curve toward βˆ’Ξ± rather than going linearly negative. Zero-centred outputs, smooth gradient everywhere. Slightly more expensive due to the exp, but better than Leaky ReLU on deeper nets.
πŸ””
GELU
f(z) = z Β· Ξ¦(z)
Gaussian Error Linear Unit. Multiplies z by the probability it's positive under a Gaussian β€” a stochastic soft gate. Used in GPT, BERT, and transformers broadly. Smooth, differentiable everywhere. The default in modern large language models.
πŸ“ˆ ReLU vs Leaky ReLU vs ELU vs GELU β€” Side by Side
+3 0 βˆ’1 βˆ’3 0 +3 ReLU Leaky ReLU ELU GELU

GELU dips slightly below zero near z = βˆ’0.17 before rising, creating a soft gating effect. ELU saturates at βˆ’Ξ±. Leaky ReLU stays linear with small slope.

Activation Dying Neurons Zero-Centred Smooth Compute Cost Best Use Case
ReLU Yes β€” common No No (kink at 0) Very low CNNs, fast training baseline
Leaky ReLU No No No Very low CNNs, GANs
ELU No Yes (β‰ˆ) Yes Medium (exp) Deep MLPs, regression tasks
GELU No Yes (β‰ˆ) Yes Medium Transformers, LLMs, BERT, GPT

Section 05

Softmax β€” Output Distributions

The Ballot Counter
After an election, you have raw vote counts: Cat got 400 votes, Dog got 300, Fish got 300. Softmax is the ballot counter that converts these into percentages that sum to 100%: Cat 40%, Dog 30%, Fish 30%. The largest number wins the most probability mass. The smallest number is never completely silenced β€” it still gets some probability.

In a neural network, the final layer produces raw scores called logits. Softmax turns these into a probability distribution over all classes. The class with the highest probability is your prediction.
πŸ“Š Softmax β€” Logits to Probability Distribution
LOGITS (z) Cat: 3.2 Dog: 1.8 Fish: 0.5 Softmax eαΆ»α΅’ / Ξ£eαΆ»β±Ό PROBABILITIES 71% Cat 19% Dog 10% Fish Ξ£ = 100% βœ“ Never negative, always sums to 1

Softmax amplifies differences β€” a logit advantage of 1.4 can translate to a 52% probability swing. Used exclusively in the final classification layer, not hidden layers.

πŸ’‘
Softmax is Only for the Output Layer

Softmax is never used in hidden layers β€” it would force every layer to compete in a zero-sum probability game, destroying the independent signal in each neuron. It belongs only at the final step, converting logits to a probability distribution for multi-class classification. For binary classification, a single sigmoid output is preferred.


Section 06

Python β€” All Activations From Scratch

import numpy as np
import matplotlib.pyplot as plt

# ── Define all activation functions ───────────────────────

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def tanh(z):
    return np.tanh(z)

def relu(z):
    return np.maximum(0, z)

def leaky_relu(z, alpha=0.01):
    return np.where(z > 0, z, alpha * z)

def elu(z, alpha=1.0):
    return np.where(z > 0, z, alpha * (np.exp(z) - 1))

def gelu(z):
    # Approximation used in practice (Hendrycks & Gimpel 2016)
    return 0.5 * z * (1 + np.tanh(np.sqrt(2 / np.pi) * (z + 0.044715 * z**3)))

def softmax(z):
    # Numerically stable version β€” subtract max before exp
    e = np.exp(z - np.max(z))
    return e / e.sum()

# ── Quick demo ─────────────────────────────────────────────

z = np.array([3.2, 1.8, 0.5])  # raw logits
print("Logits:  ", z)
print("Softmax: ", np.round(softmax(z), 3))

z_range = np.linspace(-5, 5, 400)
fns = {
    "Sigmoid":    sigmoid(z_range),
    "tanh":       tanh(z_range),
    "ReLU":       relu(z_range),
    "Leaky ReLU": leaky_relu(z_range),
    "ELU":        elu(z_range),
    "GELU":       gelu(z_range),
}

for name, vals in fns.items():
    print(f"{name:12s} | min={vals.min():.3f}  max={vals.max():.3f}")
OUTPUT
Logits: [3.2 1.8 0.5] Softmax: [0.706 0.176 0.118] Sigmoid | min=-0.007 max=0.993 tanh | min=-0.999 max=0.999 ReLU | min= 0.000 max=5.000 Leaky ReLU | min=-0.050 max=5.000 ELU | min=-1.000 max=5.000 GELU | min=-0.170 max=5.000

Section 07

PyTorch β€” Activations in a Real Network

import torch
import torch.nn as nn

# ── Model with GELU (modern default) ──────────────────────
class DeepMLP(nn.Module):
    def __init__(self, in_dim, hidden, out_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, hidden),
            nn.GELU(),                  # ← modern choice
            nn.LayerNorm(hidden),
            nn.Linear(hidden, hidden),
            nn.GELU(),
            nn.LayerNorm(hidden),
            nn.Linear(hidden, out_dim),   # raw logits
        )

    def forward(self, x):
        return self.net(x)             # Softmax in loss fn

# ── Comparing activations on gradient flow ─────────────────
activations = {
    "ReLU":       nn.ReLU(),
    "LeakyReLU":  nn.LeakyReLU(0.01),
    "ELU":        nn.ELU(1.0),
    "GELU":       nn.GELU(),
}

x = torch.linspace(-3, 3, 10, requires_grad=True)

for name, act in activations.items():
    y = act(x).sum()
    y.backward()
    dead = (x.grad == 0).sum().item()
    print(f"{name:12s} | dead neurons: {dead}/10")
    x.grad.zero_()

# ── Loss function β€” Softmax is baked in here ───────────────
# Use CrossEntropyLoss: it applies LogSoftmax + NLLLoss
# NEVER add a Softmax layer before CrossEntropyLoss in PyTorch!
criterion = nn.CrossEntropyLoss()
OUTPUT
ReLU | dead neurons: 5/10 ← 50% of neurons have zero gradient LeakyReLU | dead neurons: 0/10 ELU | dead neurons: 0/10 GELU | dead neurons: 0/10
⚠️
PyTorch Softmax Trap

Never add nn.Softmax() before nn.CrossEntropyLoss() in PyTorch. The loss function already applies log-softmax internally. Adding your own softmax first causes numerical instability and double-application. Output raw logits from your final linear layer and let the loss handle the rest.


Section 08

How to Choose β€” The Decision Map

πŸ—ΊοΈ Activation Function Decision Guide
Output Layer β€” Multi-class
Use Softmax. Converts logits to a probability distribution. Never use in hidden layers.
Output Layer β€” Binary
Use Sigmoid. Single probability in (0,1). Threshold at 0.5 for class decision.
Transformers / LLMs
Use GELU. Smooth, differentiable everywhere. Default in BERT, GPT-2/3/4, and all modern attention-based models.
CNNs / Vision models
Start with ReLU. If >10% neurons die, switch to Leaky ReLU. Use batch normalisation to prevent dying.
Deep MLP / Tabular
Try ELU or GELU. Zero-centred, smooth gradients. Better than ReLU on deep feed-forward nets.
GANs / Discriminator
Use Leaky ReLU (Ξ± = 0.2). Lets gradients flow for both positive and negative activations. Standard GAN practice.
πŸ†
The Modern Practitioner Default

When in doubt: use GELU in hidden layers + nn.CrossEntropyLoss (which handles softmax) for classification. This is the setup used by every major transformer-based model and almost always beats ReLU on anything deeper than 4 layers with minimal tuning.


Section 09

Golden Rules

⚑ Activation Functions β€” Non-Negotiable Rules
1
Never use sigmoid or tanh in hidden layers of deep networks. Their gradients saturate. Your deep layers will stop learning. Reserve sigmoid for binary output neurons only.
2
Monitor your dead neuron rate. If ReLU networks have accuracy that plateaus early, log the fraction of zero activations per layer. Above 20% dead is a red flag β€” switch to Leaky ReLU or lower your learning rate.
3
Softmax is only for the final output layer. It creates a zero-sum competition between neurons. Hidden layers need independent signals, not a probability race.
4
In PyTorch, output raw logits. nn.CrossEntropyLoss handles log-softmax internally. Adding a softmax layer before it will corrupt your gradients and produce numerically unstable training.
5
GELU is the modern default for deep networks. It is smooth, zero-centred on average, and produces better gradient flow than ReLU on anything with residual connections or attention. Use it unless you have a strong reason not to.
6
Pair activations with appropriate initialisations. ReLU works best with He (Kaiming) initialisation. Sigmoid/tanh with Glorot (Xavier). Wrong initialisation + wrong activation = vanishing or exploding gradients before training even starts.