The Story That Explains Optimizers
One strategy: take small, careful steps downhill every time. This is Gradient Descent โ the grandfather of all optimizers. But what if the slope is very gentle? You shuffle forward for hours. What if it's steep? You leap too far and overshoot the valley.
Different optimizers are different strategies for this blindfolded hiker. SGD is the original cautious stepper. Adagrad remembers every rocky patch and slows down there. RMSProp forgets the old history and adapts to recent terrain. Adam is the hiker with GPS, momentum, and memory โ the most popular for good reason.
In machine learning, an optimizer is the algorithm that adjusts a model's parameters (weights and biases) to minimize the loss function โ the measure of how wrong the model's predictions are. The process is called training, and every optimizer approaches it differently.
A gradient is the direction of steepest ascent in the loss landscape.
Optimizers move in the opposite direction โ downhill โ by a step size called the
learning rate (ฮท). The central update rule for all gradient-based optimizers is:
ฮธ โ ฮธ โ ฮท ยท โL(ฮธ)
where ฮธ are the parameters, ฮท is the learning rate, and โL(ฮธ) is the gradient of the loss.
The Loss Landscape โ A Visual Introduction
Before diving into individual optimizers, it helps to understand what they are navigating. The loss landscape is a high-dimensional surface where each point represents a configuration of model weights and its corresponding loss value.
Blue path = SGD (erratic, can get stuck). Purple path = Adam (smooth, efficient). Green star = global minimum (what we want).
| Challenge | Description | Problematic For |
|---|---|---|
| Local Minima | A valley that isn't the deepest โ optimizer gets stuck | SGD without momentum |
| Saddle Points | Flat regions that aren't minima โ gradient โ 0 in all directions | First-order methods |
| Plateaus | Large flat regions with tiny gradients โ training stalls | Fixed learning rate methods |
| Ravines | Narrow valleys with steep walls โ optimizer oscillates | SGD without momentum |
| Ill-conditioning | Very different curvatures in different directions | SGD, needs adaptive rates |
SGD โ Stochastic Gradient Descent
The "stochastic" part means they estimate the slope by looking at just one rock (one data point) or a handful of rocks (a mini-batch) instead of the entire mountain map. Fast, but noisy.
The Update Rule
SGD with Momentum โ The Update Rule
Python Implementation
import torch
import torch.nn as nn
import torch.optim as optim
# Define a simple model
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Linear(256, 10)
)
# SGD โ plain vanilla
optimizer_sgd = optim.SGD(
model.parameters(),
lr=0.01 # learning rate
)
# SGD with momentum โ almost always better
optimizer_sgd_m = optim.SGD(
model.parameters(),
lr=0.01,
momentum=0.9, # ฮฒ โ accumulates velocity
weight_decay=1e-4 # L2 regularization
)
# SGD with Nesterov momentum (lookahead variant โ often better)
optimizer_nag = optim.SGD(
model.parameters(),
lr=0.01,
momentum=0.9,
nesterov=True # Nesterov Accelerated Gradient
)
# Training loop
criterion = nn.CrossEntropyLoss()
def train_step(X_batch, y_batch, optimizer):
optimizer.zero_grad() # clear old gradients
output = model(X_batch)
loss = criterion(output, y_batch)
loss.backward() # compute gradients
optimizer.step() # update weights
return loss.item()
SGD applies the same learning rate ฮท to every parameter. But in a deep neural network, some weights see frequent, informative gradients (common features) while others see sparse, rare ones. A single ฮท can't be simultaneously optimal for both. This is the problem all subsequent optimizers solve.
Adagrad โ Adaptive Gradient Algorithm
This is adaptive learning: frequent features get smaller rates; rare features get larger ones. It was a revelation for sparse data โ NLP, recommendation systems, search. The catch: the diary fills up forever. Eventually, every step is so tiny the hiker stops moving.
The Update Rule
Python Implementation
import torch.optim as optim
# Adagrad โ great for sparse problems (NLP, recommendation)
optimizer_adagrad = optim.Adagrad(
model.parameters(),
lr=0.01, # global learning rate
lr_decay=0, # optional extra decay
eps=1e-8, # numerical stability
weight_decay=0 # L2 regularization
)
# From scratch โ educational implementation
import numpy as np
def adagrad_update(theta, grad, G, lr=0.01, eps=1e-8):
"""
theta : current parameters (array)
grad : gradient of loss w.r.t theta
G : accumulated sum of squared gradients
"""
G = G + grad**2 # accumulate squared grads
effective_lr = lr / (np.sqrt(G) + eps) # per-param learning rate
theta = theta - effective_lr * grad # update parameters
return theta, G
# Example with a sparse gradient scenario
theta = np.array([1.0, 1.0, 1.0])
G = np.zeros_like(theta) # starts at zero
# Feature 0 seen frequently, feature 2 rarely
grads = [
np.array([0.5, 0.0, 0.0]),
np.array([0.5, 0.0, 0.5]),
np.array([0.5, 0.3, 0.0]),
np.array([0.5, 0.0, 0.0]),
]
for i, grad in enumerate(grads):
theta, G = adagrad_update(theta, grad, G)
print(f"Step {i+1}: ฮธ={theta.round(4)}, eff_lr={(0.01/(np.sqrt(G)+1e-8)).round(4)}")
RMSProp โ Root Mean Square Propagation
The technical term is an exponentially weighted moving average (EWMA) of squared gradients. The accumulated history decays with coefficient ฮฒโ (typically 0.9 or 0.99), preventing the learning rate from collapsing to zero.
Geoffrey Hinton introduced RMSProp in 2012 in a Coursera lecture โ it was never formally published, yet it became one of the most widely used optimizers for recurrent neural networks and RL.
The Update Rule
| Step | G (sum of gยฒ) | Eff. LR (ฮท/โG) |
|---|---|---|
| 1 | 0.25 | 0.0200 |
| 10 | 2.50 | 0.0063 |
| 100 | 25.0 | 0.0020 |
| 1000 | 250 | 0.0006 |
| 10000 | 2500 | 0.0002 โ dying |
| Step | E[gยฒ] (EWMA) | Eff. LR (ฮท/โE[gยฒ]) |
|---|---|---|
| 1 | 0.025 | 0.0632 |
| 10 | ~0.22 | 0.0213 |
| 100 | ~0.25 | 0.0200 |
| 1000 | ~0.25 | 0.0200 |
| 10000 | ~0.25 | 0.0200 โ stable! |
Python Implementation
import torch.optim as optim
# RMSProp โ excellent for RNNs and RL
optimizer_rms = optim.RMSprop(
model.parameters(),
lr=0.001, # learning rate
alpha=0.99, # ฮฒ โ decay for moving average (0.9 or 0.99)
eps=1e-8, # numerical stability
weight_decay=0, # L2 regularization
momentum=0, # optional momentum term
centered=False # if True: normalize by E[g] - E[g]ยฒ (centered variant)
)
# From scratch โ numpy educational version
import numpy as np
def rmsprop_update(theta, grad, eg2, lr=0.001, beta=0.9, eps=1e-8):
"""
theta : current parameters
grad : gradient of loss
eg2 : running EWMA of squared gradients
beta : decay factor (0.9 = 10% new info per step)
"""
eg2 = beta * eg2 + (1 - beta) * grad**2 # EWMA update
theta = theta - lr / (np.sqrt(eg2) + eps) * grad # adaptive update
return theta, eg2
# Apply to a simple 1D loss: L(ฮธ) = ฮธยฒ (minimum at ฮธ=0)
theta = 5.0
eg2 = 0.0
for step in range(1, 21):
grad = 2 * theta # โ(ฮธยฒ) = 2ฮธ
theta, eg2 = rmsprop_update(theta, grad, eg2)
if step % 5 == 0:
print(f"Step {step:2d}: ฮธ = {theta:.5f}, E[gยฒ] = {eg2:.5f}")
RNNs suffer from exploding and vanishing gradients across time steps. RMSProp's adaptive scaling naturally handles this โ large gradients get divided by their own magnitude, preventing explosions. In RL, reward signals are highly non-stationary; the fading memory adapts quickly to new reward scales. DeepMind used RMSProp in the original DQN Atari paper (Mnih et al., 2015).
Adam โ Adaptive Moment Estimation
The GPS is the first moment โ a running average of the gradient itself (direction). The diary is the second moment โ a running EWMA of squared gradients (scale). The compass is the bias correction โ because early in training, both moments start at zero and underestimate the true values. Adam corrects for this automatically.
Kingma & Ba published Adam in 2014. It is now the default optimizer for the vast majority of deep learning research and production systems.
The Update Rule โ All 4 Steps
Python Implementation
import torch.optim as optim
# Adam โ the default choice for most deep learning
optimizer_adam = optim.Adam(
model.parameters(),
lr=3e-4, # Andrej Karpathy's "learning rate 3e-4 is king"
betas=(0.9, 0.999), # ฮฒโ, ฮฒโ โ almost never tuned
eps=1e-8, # ฮต โ numerical stability
weight_decay=0, # L2 regularization (see AdamW below)
amsgrad=False # AMSGrad variant (more stable, rarely needed)
)
# AdamW โ Adam with CORRECT weight decay (strongly preferred)
optimizer_adamw = optim.AdamW(
model.parameters(),
lr=3e-4,
betas=(0.9, 0.999),
weight_decay=0.01 # decoupled weight decay โ correct L2
)
# From scratch โ full educational implementation
import numpy as np
class Adam:
def __init__(self, params, lr=3e-4, betas=(0.9, 0.999), eps=1e-8):
self.params = params
self.lr = lr
self.b1, self.b2 = betas
self.eps = eps
self.t = 0
self.m = [np.zeros_like(p) for p in params] # 1st moment
self.v = [np.zeros_like(p) for p in params] # 2nd moment
def step(self, grads):
self.t += 1
updated = []
for i, (p, g) in enumerate(zip(self.params, grads)):
# Step 1: first moment
self.m[i] = self.b1 * self.m[i] + (1 - self.b1) * g
# Step 2: second moment
self.v[i] = self.b2 * self.v[i] + (1 - self.b2) * g**2
# Step 3: bias correction
m_hat = self.m[i] / (1 - self.b1**self.t)
v_hat = self.v[i] / (1 - self.b2**self.t)
# Step 4: update
p_new = p - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
updated.append(p_new)
return updated
# Demo: minimise L(ฮธโ,ฮธโ) = ฮธโยฒ + 10ยทฮธโยฒ (ill-conditioned)
params = [np.array(5.0), np.array(5.0)]
adam = Adam(params, lr=0.1)
for step in range(1, 31):
grads = [2*params[0], 20*params[1]] # โL
params = adam.step(grads)
if step % 10 == 0:
loss = params[0]**2 + 10*params[1]**2
print(f"Step {step:2d}: ฮธโ={params[0]:.4f}, ฮธโ={params[1]:.4f}, L={loss:.4f}")
Standard Adam's weight_decay is implemented as L2 regularization added to the gradient,
which interacts incorrectly with the adaptive scaling. AdamW (Loshchilov & Hutter, 2019)
decouples weight decay from the gradient update โ applying it directly to the weights instead.
This gives correct regularization and is the default in modern transformers (GPT, BERT, etc.).
Use optim.AdamW, not optim.Adam(weight_decay=...).
Side-by-Side Comparison
| Property | SGD | Adagrad | RMSProp | Adam |
|---|---|---|---|---|
| Year | 1951 (batch), 1998 (mini-batch) | 2011 | 2012 | 2014 |
| Adaptive LR | No | Yes | Yes | Yes |
| Momentum | Optional | No | Optional | Yes (built-in) |
| LR stability | Fixed | Monotone โ to 0 | Bounded (EWMA) | Bounded (EWMA) |
| Memory (extra params) | 0 extra | G (1 array) | E[gยฒ] (1 array) | m + v (2 arrays) |
| Bias correction | No | No | No | Yes |
| Sparse data | Poor | Excellent | Good | Excellent |
| RNN / RL | Works | Fails long training | Excellent | Excellent |
| Generalization | Best (sharp minima) | Good | Good | Sometimes worse than SGD |
| Default recommendation | CV tasks, large batch | Sparse NLP | RNN/RL | Everything else โ |
Complete Python Implementation โ All Four Optimizers
Full Training Loop Comparison on MNIST
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
# โโ Data โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
transform = transforms.ToTensor()
train_data = datasets.MNIST('.', train=True, download=True, transform=transform)
test_data = datasets.MNIST('.', train=False, download=True, transform=transform)
train_loader = DataLoader(train_data, batch_size=128, shuffle=True)
test_loader = DataLoader(test_data, batch_size=256)
# โโ Model โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def build_model():
return nn.Sequential(
nn.Flatten(),
nn.Linear(784, 256), nn.ReLU(),
nn.Linear(256, 128), nn.ReLU(),
nn.Linear(128, 10)
)
# โโ Optimizers to compare โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
configs = {
'SGD': lambda m: optim.SGD(m.parameters(), lr=0.01, momentum=0.9),
'Adagrad': lambda m: optim.Adagrad(m.parameters(), lr=0.01),
'RMSProp': lambda m: optim.RMSprop(m.parameters(), lr=0.001),
'Adam': lambda m: optim.Adam(m.parameters(), lr=3e-4),
}
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
criterion = nn.CrossEntropyLoss()
def train_and_evaluate(name, optimizer_fn, epochs=5):
model = build_model().to(device)
optimizer = optimizer_fn(model)
history = []
for epoch in range(epochs):
model.train()
total_loss = 0
for X, y in train_loader:
X, y = X.to(device), y.to(device)
optimizer.zero_grad()
loss = criterion(model(X), y)
loss.backward()
optimizer.step()
total_loss += loss.item()
# Evaluate
model.eval()
correct = 0
with torch.no_grad():
for X, y in test_loader:
X, y = X.to(device), y.to(device)
preds = model(X).argmax(dim=1)
correct += (preds == y).sum().item()
acc = correct / len(test_data)
history.append(acc)
print(f"[{name}] Epoch {epoch+1}/{epochs} | Loss={total_loss/len(train_loader):.4f} | Test Acc={acc:.4f}")
return history
# Run all four
results = {}
for name, opt_fn in configs.items():
print(f"\n{'='*50}")
results[name] = train_and_evaluate(name, opt_fn)
Learning Rate Schedules with Optimizers
# Best practice: combine Adam with a learning rate schedule
model = build_model().to(device)
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)
# Option 1: Cosine Annealing โ smoothly decays LR like a cosine wave
scheduler_cos = optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=50, eta_min=1e-6
)
# Option 2: OneCycleLR โ fast warmup then decay (Leslie Smith's policy)
scheduler_one = optim.lr_scheduler.OneCycleLR(
optimizer,
max_lr=1e-2,
steps_per_epoch=len(train_loader),
epochs=20
)
# Training loop with scheduler
for epoch in range(20):
model.train()
for X, y in train_loader:
X, y = X.to(device), y.to(device)
optimizer.zero_grad()
loss = criterion(model(X), y)
loss.backward()
optimizer.step()
scheduler_one.step() # step INSIDE the batch loop for OneCycleLR
current_lr = optimizer.param_groups[0]['lr']
print(f"Epoch {epoch+1:2d} | LR = {current_lr:.6f}")
Hyperparameter Reference
| Optimizer | Parameter | Default | Effect | Tuning Advice |
|---|---|---|---|---|
| SGD | lr (ฮท) | โ | Step size | Critical. Grid: 0.1, 0.01, 0.001 |
| momentum (ฮฒ) | 0 | Velocity decay | Set 0.9 always. Try 0.99 for LSTMs. | |
| Adagrad | lr (ฮท) | 0.01 | Global scale | Can be larger than SGD. Try 0.01โ0.1. |
| eps | 1e-10 | Numerical stability | Rarely tuned. Keep at 1e-8 to 1e-10. | |
| RMSProp | lr (ฮท) | 0.001 | Global scale | Default usually good. Try 1e-3 to 1e-4. |
| alpha (ฮฒ) | 0.99 | EWMA decay | 0.9 for fast-changing, 0.99 for stable. | |
| eps | 1e-8 | Stability | Try 1e-6 if training is unstable. | |
| Adam | lr (ฮท) | 1e-3 | Global scale | 3e-4 is often best. Rarely go above 1e-3. |
| ฮฒโ | 0.9 | 1st moment decay | Almost never change. Try 0.85โ0.95. | |
| ฮฒโ | 0.999 | 2nd moment decay | Try 0.99 for sparse, 0.9999 for large batch. | |
| eps | 1e-8 | Stability | Increase to 1e-6 or 0.1 for transformers. |
When to Use Which Optimizer
Golden Rules
optim.AdamW(weight_decay=0.01)
applies decay correctly; optim.Adam(weight_decay=0.01) does not.
optimizer.zero_grad() before loss.backward().
PyTorch accumulates gradients by default. Forgetting this silently corrupts training โ
gradients from multiple batches stack and your model diverges mysteriously.
CosineAnnealingLR for most cases, OneCycleLR
for fast training, or a simple StepLR as a baseline.
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) between
backward() and optimizer.step(). Prevents exploding gradients
from destabilising training.