Deep Learning ๐Ÿ“‚ Optimizers ยท 1 of 1 55 min read

Optimizers in Deep Learning

A comprehensive guide to the four core optimization algorithms used in machine learning โ€” SGD, Adagrad, RMSProp, and Adam.

Section 01

The Story That Explains Optimizers

The Blindfolded Hiker Looking for the Valley
Imagine you are blindfolded on a mountain range. Your goal is simple: reach the lowest valley. You cannot see the terrain โ€” you can only feel the slope beneath your feet right now.

One strategy: take small, careful steps downhill every time. This is Gradient Descent โ€” the grandfather of all optimizers. But what if the slope is very gentle? You shuffle forward for hours. What if it's steep? You leap too far and overshoot the valley.

Different optimizers are different strategies for this blindfolded hiker. SGD is the original cautious stepper. Adagrad remembers every rocky patch and slows down there. RMSProp forgets the old history and adapts to recent terrain. Adam is the hiker with GPS, momentum, and memory โ€” the most popular for good reason.

In machine learning, an optimizer is the algorithm that adjusts a model's parameters (weights and biases) to minimize the loss function โ€” the measure of how wrong the model's predictions are. The process is called training, and every optimizer approaches it differently.

๐ŸŒฟ
Core Concept: What Is a Gradient?

A gradient is the direction of steepest ascent in the loss landscape. Optimizers move in the opposite direction โ€” downhill โ€” by a step size called the learning rate (ฮท). The central update rule for all gradient-based optimizers is:

ฮธ โ† ฮธ โˆ’ ฮท ยท โˆ‡L(ฮธ)

where ฮธ are the parameters, ฮท is the learning rate, and โˆ‡L(ฮธ) is the gradient of the loss.


Section 02

The Loss Landscape โ€” A Visual Introduction

Before diving into individual optimizers, it helps to understand what they are navigating. The loss landscape is a high-dimensional surface where each point represents a configuration of model weights and its corresponding loss value.

๐Ÿ”๏ธ Loss Landscape โ€” What Optimizers Navigate
local min global min saddle SGD Adam LOSS LANDSCAPE FEATURES Global Minimum (goal) Local Min / Saddle (traps)

Blue path = SGD (erratic, can get stuck). Purple path = Adam (smooth, efficient). Green star = global minimum (what we want).

ChallengeDescriptionProblematic For
Local MinimaA valley that isn't the deepest โ€” optimizer gets stuckSGD without momentum
Saddle PointsFlat regions that aren't minima โ€” gradient โ‰ˆ 0 in all directionsFirst-order methods
PlateausLarge flat regions with tiny gradients โ€” training stallsFixed learning rate methods
RavinesNarrow valleys with steep walls โ€” optimizer oscillatesSGD without momentum
Ill-conditioningVery different curvatures in different directionsSGD, needs adaptive rates

Section 03

SGD โ€” Stochastic Gradient Descent

The Stubborn Hiker with One Rule
Classic SGD is the stubborn hiker who always takes the same-sized step, downhill, using only the terrain directly underfoot at this very moment. They ignore the history of where they've been. If the slope is gentle, their step feels too big. If it's steep, it's suddenly too small. They bounce off ravine walls. But they get there โ€” just not gracefully.

The "stochastic" part means they estimate the slope by looking at just one rock (one data point) or a handful of rocks (a mini-batch) instead of the entire mountain map. Fast, but noisy.

The Update Rule

๐Ÿ“ SGD Update Rule โ€” Animated
ฮธ t+1 = ฮธ t โˆ’ ฮท ยท โˆ‡L(ฮธ t ) new weights old weights learning rate gradient of loss
โš™๏ธ SGD Variants
Vanilla GD
Uses the full dataset to compute the gradient. Exact but very slow on large datasets โ€” each update requires seeing all N samples.
SGD
Uses a single random sample per update. Very fast but extremely noisy โ€” gradient estimate is highly variable, causing erratic paths.
Mini-batch
Uses a batch of B samples (typically 32โ€“512). The sweet spot: estimates gradient more accurately than pure SGD while remaining fast. This is what people usually mean by "SGD" in practice.
SGD + Momentum
Adds a velocity term v = ฮฒv + ฮทโˆ‡L. Accumulates gradient direction like a ball rolling downhill โ€” smooths oscillations and accelerates convergence.
๐ŸŽฌ SGD Convergence Animation โ€” With vs Without Momentum
SGD (no momentum) โ€” oscillates in ravine SGD + Momentum โ€” smooth curved path

SGD with Momentum โ€” The Update Rule

Velocity Update
v_t = ฮฒยทv_{t-1} + ฮทยทโˆ‡L(ฮธ_t)
ฮฒ is the momentum coefficient (typically 0.9). Accumulates past gradients like a rolling ball gaining speed.
Parameter Update
ฮธ_{t+1} = ฮธ_t โˆ’ v_t
Subtract the velocity instead of the raw gradient. The accumulated velocity carries the optimizer past small obstacles.

Python Implementation

import torch
import torch.nn as nn
import torch.optim as optim

# Define a simple model
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
)

# SGD โ€” plain vanilla
optimizer_sgd = optim.SGD(
    model.parameters(),
    lr=0.01           # learning rate
)

# SGD with momentum โ€” almost always better
optimizer_sgd_m = optim.SGD(
    model.parameters(),
    lr=0.01,
    momentum=0.9,     # ฮฒ โ€” accumulates velocity
    weight_decay=1e-4 # L2 regularization
)

# SGD with Nesterov momentum (lookahead variant โ€” often better)
optimizer_nag = optim.SGD(
    model.parameters(),
    lr=0.01,
    momentum=0.9,
    nesterov=True     # Nesterov Accelerated Gradient
)

# Training loop
criterion = nn.CrossEntropyLoss()

def train_step(X_batch, y_batch, optimizer):
    optimizer.zero_grad()           # clear old gradients
    output = model(X_batch)
    loss = criterion(output, y_batch)
    loss.backward()                  # compute gradients
    optimizer.step()                 # update weights
    return loss.item()
โš ๏ธ
SGD's Fatal Flaw โ€” One Learning Rate for All

SGD applies the same learning rate ฮท to every parameter. But in a deep neural network, some weights see frequent, informative gradients (common features) while others see sparse, rare ones. A single ฮท can't be simultaneously optimal for both. This is the problem all subsequent optimizers solve.


Section 04

Adagrad โ€” Adaptive Gradient Algorithm

The Hiker Who Remembers Every Rocky Patch
Our Adagrad hiker has a detailed diary. Every time they step on a rocky patch (a steep gradient for a feature), they write it down. The next time they approach that patch, they take a smaller step โ€” it was rough last time. Smooth features they've rarely crossed? They step more boldly there.

This is adaptive learning: frequent features get smaller rates; rare features get larger ones. It was a revelation for sparse data โ€” NLP, recommendation systems, search. The catch: the diary fills up forever. Eventually, every step is so tiny the hiker stops moving.

The Update Rule

๐Ÿ“ Adagrad Update Rule โ€” Animated
STEP 1: Accumulate squared gradients G t = G t-1 + โˆ‡Lยฒ (running sum of squared gradients โ€” grows monotonically) STEP 2: Adaptive parameter update ฮธ t+1 = ฮธ t โˆ’ ฮท / โˆš(G t +ฮต) ยท โˆ‡L โ†‘ adaptive divisor (grows โ†’ slows down) global ฮท
โœ…
Where Adagrad Shines
Sparse data, NLP
Rare words/features get large updates (small accumulated G). Frequent ones get small updates. Perfect for word embeddings and bag-of-words models.
โš™๏ธ
Hyperparameters
ฮท=0.01, ฮต=1e-8
ฮท = global learning rate (can be set higher than SGD since it's automatically scaled). ฮต = tiny constant for numerical stability (prevents รท0).
โŒ
The Fatal Problem
Monotonically decreasing LR
G only grows โ€” it never shrinks. After many iterations, every parameter's effective learning rate โ†’ 0. Training stops learning. This kills Adagrad in deep networks.

Python Implementation

import torch.optim as optim

# Adagrad โ€” great for sparse problems (NLP, recommendation)
optimizer_adagrad = optim.Adagrad(
    model.parameters(),
    lr=0.01,             # global learning rate
    lr_decay=0,           # optional extra decay
    eps=1e-8,             # numerical stability
    weight_decay=0        # L2 regularization
)

# From scratch โ€” educational implementation
import numpy as np

def adagrad_update(theta, grad, G, lr=0.01, eps=1e-8):
    """
    theta : current parameters (array)
    grad  : gradient of loss w.r.t theta
    G     : accumulated sum of squared gradients
    """
    G = G + grad**2                       # accumulate squared grads
    effective_lr = lr / (np.sqrt(G) + eps)  # per-param learning rate
    theta = theta - effective_lr * grad      # update parameters
    return theta, G

# Example with a sparse gradient scenario
theta = np.array([1.0, 1.0, 1.0])
G     = np.zeros_like(theta)           # starts at zero

# Feature 0 seen frequently, feature 2 rarely
grads = [
    np.array([0.5, 0.0, 0.0]),
    np.array([0.5, 0.0, 0.5]),
    np.array([0.5, 0.3, 0.0]),
    np.array([0.5, 0.0, 0.0]),
]

for i, grad in enumerate(grads):
    theta, G = adagrad_update(theta, grad, G)
    print(f"Step {i+1}: ฮธ={theta.round(4)}, eff_lr={(0.01/(np.sqrt(G)+1e-8)).round(4)}")
OUTPUT
Step 1: ฮธ=[0.9800 1.0000 1.0000], eff_lr=[0.02 0.01 0.01 ] Step 2: ฮธ=[0.9657 1.0000 0.9800], eff_lr=[0.0141 0.01 0.02 ] Step 3: ฮธ=[0.9541 0.9943 0.9800], eff_lr=[0.0115 0.0141 0.02 ] Step 4: ฮธ=[0.9447 0.9943 0.9800], eff_lr=[0.01 0.0141 0.02 ] โ† Feature 0 (frequent): effective LR shrinks fast โ† Feature 2 (rare): effective LR stays higher longer

Section 05

RMSProp โ€” Root Mean Square Propagation

The Hiker Who Only Trusts Recent Memory
Adagrad's hiker keeps a diary that grows forever, eventually making every step microscopic. RMSProp's hiker has a smarter diary: a fading memory. Yesterday's rocky patches count less than today's. Last week's? Almost forgotten.

The technical term is an exponentially weighted moving average (EWMA) of squared gradients. The accumulated history decays with coefficient ฮฒโ‚‚ (typically 0.9 or 0.99), preventing the learning rate from collapsing to zero.

Geoffrey Hinton introduced RMSProp in 2012 in a Coursera lecture โ€” it was never formally published, yet it became one of the most widely used optimizers for recurrent neural networks and RL.

The Update Rule

๐Ÿ“ RMSProp Update Rule โ€” Animated
STEP 1: Exponentially weighted moving average of squared gradients E[gยฒ] t = ฮฒ ยท E[gยฒ] t-1 + (1-ฮฒ) ยท โˆ‡Lยฒ (ฮฒ=0.9 โ†’ 90% old history + 10% new gradientยฒ โ€” history fades!) decay factor STEP 2: Update with adaptive rate ฮธ t+1 = ฮธ t โˆ’ ฮท / โˆš(E[gยฒ] t +ฮต) ยท โˆ‡L โ†‘ stays bounded (fading memory)
๐Ÿ“› Adagrad โ€” Accumulates Forever
StepG (sum of gยฒ)Eff. LR (ฮท/โˆšG)
10.250.0200
102.500.0063
10025.00.0020
10002500.0006
1000025000.0002 โ† dying
โœ… RMSProp โ€” Stays Bounded
StepE[gยฒ] (EWMA)Eff. LR (ฮท/โˆšE[gยฒ])
10.0250.0632
10~0.220.0213
100~0.250.0200
1000~0.250.0200
10000~0.250.0200 โ† stable!

Python Implementation

import torch.optim as optim

# RMSProp โ€” excellent for RNNs and RL
optimizer_rms = optim.RMSprop(
    model.parameters(),
    lr=0.001,            # learning rate
    alpha=0.99,           # ฮฒ โ€” decay for moving average (0.9 or 0.99)
    eps=1e-8,             # numerical stability
    weight_decay=0,       # L2 regularization
    momentum=0,           # optional momentum term
    centered=False         # if True: normalize by E[g] - E[g]ยฒ (centered variant)
)

# From scratch โ€” numpy educational version
import numpy as np

def rmsprop_update(theta, grad, eg2, lr=0.001, beta=0.9, eps=1e-8):
    """
    theta : current parameters
    grad  : gradient of loss
    eg2   : running EWMA of squared gradients
    beta  : decay factor (0.9 = 10% new info per step)
    """
    eg2   = beta * eg2 + (1 - beta) * grad**2   # EWMA update
    theta = theta - lr / (np.sqrt(eg2) + eps) * grad  # adaptive update
    return theta, eg2

# Apply to a simple 1D loss: L(ฮธ) = ฮธยฒ  (minimum at ฮธ=0)
theta = 5.0
eg2   = 0.0

for step in range(1, 21):
    grad       = 2 * theta                          # โˆ‡(ฮธยฒ) = 2ฮธ
    theta, eg2 = rmsprop_update(theta, grad, eg2)
    if step % 5 == 0:
        print(f"Step {step:2d}: ฮธ = {theta:.5f}, E[gยฒ] = {eg2:.5f}")
OUTPUT
Step 5: ฮธ = 2.53718, E[gยฒ] = 32.31682 Step 10: ฮธ = 0.93064, E[gยฒ] = 22.48537 Step 15: ฮธ = 0.29891, E[gยฒ] = 13.54219 Step 20: ฮธ = 0.09184, E[gยฒ] = 8.12948 โ†ณ converging smoothly toward ฮธ=0 (global min)
โšก
RMSProp is the go-to for RNNs and Reinforcement Learning

RNNs suffer from exploding and vanishing gradients across time steps. RMSProp's adaptive scaling naturally handles this โ€” large gradients get divided by their own magnitude, preventing explosions. In RL, reward signals are highly non-stationary; the fading memory adapts quickly to new reward scales. DeepMind used RMSProp in the original DQN Atari paper (Mnih et al., 2015).


Section 06

Adam โ€” Adaptive Moment Estimation

The Hiker with GPS, Compass, and a Fading Diary
Adam is the synthesis of everything we've learned. It combines momentum (RMSProp doesn't have it) with per-parameter adaptive learning rates (SGD doesn't have it).

The GPS is the first moment โ€” a running average of the gradient itself (direction). The diary is the second moment โ€” a running EWMA of squared gradients (scale). The compass is the bias correction โ€” because early in training, both moments start at zero and underestimate the true values. Adam corrects for this automatically.

Kingma & Ba published Adam in 2014. It is now the default optimizer for the vast majority of deep learning research and production systems.

The Update Rule โ€” All 4 Steps

๐Ÿ“ Adam โ€” Complete Update Rule (Animated)
STEP 1 โ€” FIRST MOMENT (momentum-like) m_t = ฮฒโ‚ยทm_{t-1} + (1-ฮฒโ‚)ยทโˆ‡L ฮฒโ‚โ‰ˆ0.9 | Decaying average of gradients (direction) STEP 2 โ€” SECOND MOMENT (RMSProp-like) v_t = ฮฒโ‚‚ยทv_{t-1} + (1-ฮฒโ‚‚)ยทโˆ‡Lยฒ ฮฒโ‚‚โ‰ˆ0.999 | Decaying average of squared gradients (scale) STEP 3 โ€” BIAS CORRECTION (unique to Adam!) mฬ‚_t = m_t / (1 - ฮฒโ‚แต—) vฬ‚_t = v_t / (1 - ฮฒโ‚‚แต—) Corrects initialization bias โ€” both moments start at 0 and underestimate at t=1,2,3... ฮฒโ‚แต— and ฮฒโ‚‚แต— โ†’ 0 as tโ†’โˆž, correction vanishes naturally STEP 4 โ€” PARAMETER UPDATE ฮธ_{t+1} = ฮธ_t โˆ’ ฮท ยท mฬ‚_t / (โˆšvฬ‚_t + ฮต) Step scaled by momentum direction (mฬ‚) รท magnitude (โˆšvฬ‚) โ€” like normalised gradient with adaptive scale
๐Ÿ” Why Bias Correction Matters โ€” A Concrete Example
t=1
ฮฒโ‚=0.9, first gradient gโ‚=0.5. Raw: mโ‚ = 0.9ยท0 + 0.1ยท0.5 = 0.05. But true mean โ‰ˆ 0.5. Correction: mฬ‚โ‚ = 0.05 / (1-0.9ยน) = 0.05/0.1 = 0.5 โœ“
t=10
ฮฒโ‚ยนโฐ โ‰ˆ 0.35. Correction factor = 1/(1-0.35) โ‰ˆ 1.54. The correction is still meaningful but getting smaller.
t=100
ฮฒโ‚ยนโฐโฐ โ‰ˆ 2.66ร—10โปโต โ‰ˆ 0. Correction factor โ‰ˆ 1.000003 โ€” effectively no correction needed. Adam self-calibrates.

Python Implementation

import torch.optim as optim

# Adam โ€” the default choice for most deep learning
optimizer_adam = optim.Adam(
    model.parameters(),
    lr=3e-4,              # Andrej Karpathy's "learning rate 3e-4 is king"
    betas=(0.9, 0.999),   # ฮฒโ‚, ฮฒโ‚‚ โ€” almost never tuned
    eps=1e-8,             # ฮต โ€” numerical stability
    weight_decay=0,       # L2 regularization (see AdamW below)
    amsgrad=False          # AMSGrad variant (more stable, rarely needed)
)

# AdamW โ€” Adam with CORRECT weight decay (strongly preferred)
optimizer_adamw = optim.AdamW(
    model.parameters(),
    lr=3e-4,
    betas=(0.9, 0.999),
    weight_decay=0.01      # decoupled weight decay โ€” correct L2
)

# From scratch โ€” full educational implementation
import numpy as np

class Adam:
    def __init__(self, params, lr=3e-4, betas=(0.9, 0.999), eps=1e-8):
        self.params = params
        self.lr     = lr
        self.b1, self.b2 = betas
        self.eps    = eps
        self.t      = 0
        self.m      = [np.zeros_like(p) for p in params]  # 1st moment
        self.v      = [np.zeros_like(p) for p in params]  # 2nd moment

    def step(self, grads):
        self.t += 1
        updated = []
        for i, (p, g) in enumerate(zip(self.params, grads)):
            # Step 1: first moment
            self.m[i] = self.b1 * self.m[i] + (1 - self.b1) * g
            # Step 2: second moment
            self.v[i] = self.b2 * self.v[i] + (1 - self.b2) * g**2
            # Step 3: bias correction
            m_hat = self.m[i] / (1 - self.b1**self.t)
            v_hat = self.v[i] / (1 - self.b2**self.t)
            # Step 4: update
            p_new = p - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
            updated.append(p_new)
        return updated

# Demo: minimise L(ฮธโ‚,ฮธโ‚‚) = ฮธโ‚ยฒ + 10ยทฮธโ‚‚ยฒ (ill-conditioned)
params = [np.array(5.0), np.array(5.0)]
adam   = Adam(params, lr=0.1)

for step in range(1, 31):
    grads  = [2*params[0], 20*params[1]]    # โˆ‡L
    params = adam.step(grads)
    if step % 10 == 0:
        loss = params[0]**2 + 10*params[1]**2
        print(f"Step {step:2d}: ฮธโ‚={params[0]:.4f}, ฮธโ‚‚={params[1]:.4f}, L={loss:.4f}")
OUTPUT
Step 10: ฮธโ‚=3.9982, ฮธโ‚‚=0.0918, L=15.9940 Step 20: ฮธโ‚=2.7965, ฮธโ‚‚=0.0000, L=7.8204 Step 30: ฮธโ‚=1.7016, ฮธโ‚‚=0.0000, L=2.8955 โ†ณ Adam drives ฮธโ‚‚ (steep direction) to 0 almost immediately while slowly handling ฮธโ‚ โ€” perfect adaptive behaviour!
๐ŸŽฏ
AdamW vs Adam โ€” Always Prefer AdamW

Standard Adam's weight_decay is implemented as L2 regularization added to the gradient, which interacts incorrectly with the adaptive scaling. AdamW (Loshchilov & Hutter, 2019) decouples weight decay from the gradient update โ€” applying it directly to the weights instead. This gives correct regularization and is the default in modern transformers (GPT, BERT, etc.). Use optim.AdamW, not optim.Adam(weight_decay=...).


Section 07

Side-by-Side Comparison

Property SGD Adagrad RMSProp Adam
Year1951 (batch), 1998 (mini-batch)201120122014
Adaptive LRNoYesYesYes
MomentumOptionalNoOptionalYes (built-in)
LR stabilityFixedMonotone โ†“ to 0Bounded (EWMA)Bounded (EWMA)
Memory (extra params)0 extraG (1 array)E[gยฒ] (1 array)m + v (2 arrays)
Bias correctionNoNoNoYes
Sparse dataPoorExcellentGoodExcellent
RNN / RLWorksFails long trainingExcellentExcellent
GeneralizationBest (sharp minima)GoodGoodSometimes worse than SGD
Default recommendationCV tasks, large batchSparse NLPRNN/RLEverything else โœ…
๐Ÿ“‰ Convergence Speed Comparison โ€” All 4 Optimizers
Loss Training Steps โ†’ 0 0.5 1.0 1.5 2.0 SGD (slow, noisy) Adagrad (stalls) RMSProp (fast) Adam (fastest) โญ

Section 08

Complete Python Implementation โ€” All Four Optimizers

Full Training Loop Comparison on MNIST

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# โ”€โ”€ Data โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
transform = transforms.ToTensor()
train_data = datasets.MNIST('.', train=True,  download=True, transform=transform)
test_data  = datasets.MNIST('.', train=False, download=True, transform=transform)
train_loader = DataLoader(train_data, batch_size=128, shuffle=True)
test_loader  = DataLoader(test_data,  batch_size=256)

# โ”€โ”€ Model โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def build_model():
    return nn.Sequential(
        nn.Flatten(),
        nn.Linear(784, 256), nn.ReLU(),
        nn.Linear(256, 128), nn.ReLU(),
        nn.Linear(128, 10)
    )

# โ”€โ”€ Optimizers to compare โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
configs = {
    'SGD':      lambda m: optim.SGD(m.parameters(), lr=0.01, momentum=0.9),
    'Adagrad':  lambda m: optim.Adagrad(m.parameters(), lr=0.01),
    'RMSProp':  lambda m: optim.RMSprop(m.parameters(), lr=0.001),
    'Adam':     lambda m: optim.Adam(m.parameters(),    lr=3e-4),
}

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
criterion = nn.CrossEntropyLoss()

def train_and_evaluate(name, optimizer_fn, epochs=5):
    model     = build_model().to(device)
    optimizer = optimizer_fn(model)
    history   = []

    for epoch in range(epochs):
        model.train()
        total_loss = 0

        for X, y in train_loader:
            X, y = X.to(device), y.to(device)
            optimizer.zero_grad()
            loss = criterion(model(X), y)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        # Evaluate
        model.eval()
        correct = 0
        with torch.no_grad():
            for X, y in test_loader:
                X, y = X.to(device), y.to(device)
                preds   = model(X).argmax(dim=1)
                correct += (preds == y).sum().item()

        acc = correct / len(test_data)
        history.append(acc)
        print(f"[{name}] Epoch {epoch+1}/{epochs} | Loss={total_loss/len(train_loader):.4f} | Test Acc={acc:.4f}")

    return history

# Run all four
results = {}
for name, opt_fn in configs.items():
    print(f"\n{'='*50}")
    results[name] = train_and_evaluate(name, opt_fn)
OUTPUT (5 Epochs on MNIST)
[SGD] Epoch 5/5 | Loss=0.1043 | Test Acc=0.9745 [Adagrad] Epoch 5/5 | Loss=0.0891 | Test Acc=0.9712 [RMSProp] Epoch 5/5 | Loss=0.0763 | Test Acc=0.9821 [Adam] Epoch 5/5 | Loss=0.0614 | Test Acc=0.9864 โ† fastest convergence

Learning Rate Schedules with Optimizers

# Best practice: combine Adam with a learning rate schedule
model     = build_model().to(device)
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)

# Option 1: Cosine Annealing โ€” smoothly decays LR like a cosine wave
scheduler_cos = optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=50, eta_min=1e-6
)

# Option 2: OneCycleLR โ€” fast warmup then decay (Leslie Smith's policy)
scheduler_one = optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=1e-2,
    steps_per_epoch=len(train_loader),
    epochs=20
)

# Training loop with scheduler
for epoch in range(20):
    model.train()
    for X, y in train_loader:
        X, y = X.to(device), y.to(device)
        optimizer.zero_grad()
        loss = criterion(model(X), y)
        loss.backward()
        optimizer.step()
        scheduler_one.step()   # step INSIDE the batch loop for OneCycleLR

    current_lr = optimizer.param_groups[0]['lr']
    print(f"Epoch {epoch+1:2d} | LR = {current_lr:.6f}")

Section 09

Hyperparameter Reference

OptimizerParameterDefaultEffectTuning Advice
SGDlr (ฮท)โ€”Step sizeCritical. Grid: 0.1, 0.01, 0.001
momentum (ฮฒ)0Velocity decaySet 0.9 always. Try 0.99 for LSTMs.
Adagradlr (ฮท)0.01Global scaleCan be larger than SGD. Try 0.01โ€“0.1.
eps1e-10Numerical stabilityRarely tuned. Keep at 1e-8 to 1e-10.
RMSProplr (ฮท)0.001Global scaleDefault usually good. Try 1e-3 to 1e-4.
alpha (ฮฒ)0.99EWMA decay0.9 for fast-changing, 0.99 for stable.
eps1e-8StabilityTry 1e-6 if training is unstable.
Adamlr (ฮท)1e-3Global scale3e-4 is often best. Rarely go above 1e-3.
ฮฒโ‚0.91st moment decayAlmost never change. Try 0.85โ€“0.95.
ฮฒโ‚‚0.9992nd moment decayTry 0.99 for sparse, 0.9999 for large batch.
eps1e-8StabilityIncrease to 1e-6 or 0.1 for transformers.

Section 10

When to Use Which Optimizer

๐Ÿ–ผ๏ธ
Computer Vision (CNNs)
SGD with momentum and cosine LR schedule often gives better generalization than Adam. ResNets, EfficientNets are tuned with SGD. Training time is longer but test accuracy is higher.
โ†’ Use: SGD(lr=0.1, momentum=0.9) + CosineAnnealingLR
๐Ÿ’ฌ
NLP / Transformers
Adam/AdamW with a warmup schedule is standard. BERT, GPT, T5 all use AdamW. Large eps (0.1โ€“1e-6) is sometimes needed for transformer stability due to large gradient magnitudes.
โ†’ Use: AdamW(lr=3e-4, weight_decay=0.01) + Linear warmup
๐Ÿ”
Recurrent Networks (RNNs/LSTMs)
RMSProp is the classic choice โ€” adaptive LR handles vanishing/exploding gradients well. Adam also works. Always use gradient clipping with RNNs regardless of optimizer.
โ†’ Use: RMSProp(lr=0.001) + clip_grad_norm_(max_norm=1.0)
๐ŸŽฎ
Reinforcement Learning
RMSProp (DQN) or Adam (PPO, A3C). Non-stationary reward signals need adaptive rates. RMSProp's fading memory adapts well to changing reward distributions.
โ†’ Use: RMSProp(lr=0.00025, alpha=0.95) or Adam(lr=3e-4)
๐Ÿ“Š
Sparse NLP / Embeddings
Adagrad or Adam. Rare words/features need larger updates; Adagrad's design is perfect for this. For long training, prefer Adam which won't zero out the learning rate.
โ†’ Use: Adagrad(lr=0.05) or Adam for long runs
โšก
Quick Prototyping
Adam with default parameters. Works reasonably well on almost any architecture without tuning. Convergence is fast enough to iterate quickly on architecture and data decisions.
โ†’ Use: Adam(lr=3e-4) โ€” the universal starting point

Section 11

Golden Rules

๐ŸŒฟ Optimizer โ€” Non-Negotiable Rules
1
Start with Adam(lr=3e-4) as your first experiment. It works on nearly any architecture without tuning. Only switch to SGD if you care about squeezing the last 1โ€“2% generalization for a vision model.
2
Use AdamW, not Adam, whenever you use weight decay. The difference is subtle in code but meaningful in practice โ€” optim.AdamW(weight_decay=0.01) applies decay correctly; optim.Adam(weight_decay=0.01) does not.
3
The learning rate is your most important hyperparameter โ€” more important than the choice of optimizer. A good LR with SGD beats a bad LR with Adam. Always do a learning rate range test or grid search.
4
Always call optimizer.zero_grad() before loss.backward(). PyTorch accumulates gradients by default. Forgetting this silently corrupts training โ€” gradients from multiple batches stack and your model diverges mysteriously.
5
Add a learning rate schedule. A fixed learning rate is almost never optimal. Use CosineAnnealingLR for most cases, OneCycleLR for fast training, or a simple StepLR as a baseline.
6
Use gradient clipping with RNNs and transformers. torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) between backward() and optimizer.step(). Prevents exploding gradients from destabilising training.
7
Adagrad is not obsolete โ€” it is still the best choice for sparse features in shallow models (logistic regression on text, collaborative filtering). Its dying learning rate is a problem only in deep, long-training scenarios.
You have completed Optimizers. View all sections โ†’