Optimizers in Deep Learning: SGD vs Adam

Section 01

The Story That Explains Optimizers

📖 Real World Analogy

The Blindfolded Hiker Looking for the Valley

Imagine you are blindfolded on a mountain range. Your goal is simple: reach the lowest valley. You cannot see the terrain — you can only feel the slope beneath your feet right now.

One strategy: take small, careful steps downhill every time. This is Gradient Descent — the grandfather of all optimizers. But what if the slope is very gentle? You shuffle forward for hours. What if it's steep? You leap too far and overshoot the valley.

Different optimizers are different strategies for this blindfolded hiker. SGD is the original cautious stepper. Adagrad remembers every rocky patch and slows down there. RMSProp forgets the old history and adapts to recent terrain. Adam is the hiker with GPS, momentum, and memory — the most popular for good reason.

In machine learning, an optimizer is the algorithm that adjusts a model's parameters (weights and biases) to minimize the loss function — the measure of how wrong the model's predictions are. The process is called training, and every optimizer approaches it differently.

🌿

Core Concept: What Is a Gradient?

A gradient is the direction of steepest ascent in the loss landscape. Optimizers move in the opposite direction — downhill — by a step size called the learning rate (η). The central update rule for all gradient-based optimizers is:

θ ← θ − η · ∇L(θ)

where θ are the parameters, η is the learning rate, and ∇L(θ) is the gradient of the loss.

Section 02

The Loss Landscape — A Visual Introduction

Before diving into individual optimizers, it helps to understand what they are navigating. The loss landscape is a high-dimensional surface where each point represents a configuration of model weights and its corresponding loss value.

🏔️ Loss Landscape — What Optimizers Navigate

Blue path = SGD (erratic, can get stuck). Purple path = Adam (smooth, efficient). Green star = global minimum (what we want).

Challenge	Description	Problematic For
Local Minima	A valley that isn't the deepest — optimizer gets stuck	SGD without momentum
Saddle Points	Flat regions that aren't minima — gradient ≈ 0 in all directions	First-order methods
Plateaus	Large flat regions with tiny gradients — training stalls	Fixed learning rate methods
Ravines	Narrow valleys with steep walls — optimizer oscillates	SGD without momentum
Ill-conditioning	Very different curvatures in different directions	SGD, needs adaptive rates

Section 03

SGD — Stochastic Gradient Descent

📖 Story

The Stubborn Hiker with One Rule

Classic SGD is the stubborn hiker who always takes the same-sized step, downhill, using only the terrain directly underfoot at this very moment. They ignore the history of where they've been. If the slope is gentle, their step feels too big. If it's steep, it's suddenly too small. They bounce off ravine walls. But they get there — just not gracefully.

The "stochastic" part means they estimate the slope by looking at just one rock (one data point) or a handful of rocks (a mini-batch) instead of the entire mountain map. Fast, but noisy.

The Update Rule

📐 SGD Update Rule — Animated

⚙️ SGD Variants

Vanilla GD

Uses the full dataset to compute the gradient. Exact but very slow on large datasets — each update requires seeing all N samples.

SGD

Uses a single random sample per update. Very fast but extremely noisy — gradient estimate is highly variable, causing erratic paths.

Mini-batch

Uses a batch of B samples (typically 32–512). The sweet spot: estimates gradient more accurately than pure SGD while remaining fast. This is what people usually mean by "SGD" in practice.

SGD + Momentum

Adds a velocity term v = βv + η∇L. Accumulates gradient direction like a ball rolling downhill — smooths oscillations and accelerates convergence.

🎬 SGD Convergence Animation — With vs Without Momentum

SGD with Momentum — The Update Rule

Velocity Update

v_t = β·v_{t-1} + η·∇L(θ_t)

β is the momentum coefficient (typically 0.9). Accumulates past gradients like a rolling ball gaining speed.

Parameter Update

θ_{t+1} = θ_t − v_t

Subtract the velocity instead of the raw gradient. The accumulated velocity carries the optimizer past small obstacles.

Python Implementation

import torch
import torch.nn as nn
import torch.optim as optim

# Define a simple model
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
)

# SGD — plain vanilla
optimizer_sgd = optim.SGD(
    model.parameters(),
    lr=0.01           # learning rate
)

# SGD with momentum — almost always better
optimizer_sgd_m = optim.SGD(
    model.parameters(),
    lr=0.01,
    momentum=0.9,     # β — accumulates velocity
    weight_decay=1e-4 # L2 regularization
)

# SGD with Nesterov momentum (lookahead variant — often better)
optimizer_nag = optim.SGD(
    model.parameters(),
    lr=0.01,
    momentum=0.9,
    nesterov=True     # Nesterov Accelerated Gradient
)

# Training loop
criterion = nn.CrossEntropyLoss()

def train_step(X_batch, y_batch, optimizer):
    optimizer.zero_grad()           # clear old gradients
    output = model(X_batch)
    loss = criterion(output, y_batch)
    loss.backward()                  # compute gradients
    optimizer.step()                 # update weights
    return loss.item()

⚠️

SGD's Fatal Flaw — One Learning Rate for All

SGD applies the same learning rate η to every parameter. But in a deep neural network, some weights see frequent, informative gradients (common features) while others see sparse, rare ones. A single η can't be simultaneously optimal for both. This is the problem all subsequent optimizers solve.

Section 04

Adagrad — Adaptive Gradient Algorithm

📖 Story

The Hiker Who Remembers Every Rocky Patch

Our Adagrad hiker has a detailed diary. Every time they step on a rocky patch (a steep gradient for a feature), they write it down. The next time they approach that patch, they take a smaller step — it was rough last time. Smooth features they've rarely crossed? They step more boldly there.

This is adaptive learning: frequent features get smaller rates; rare features get larger ones. It was a revelation for sparse data — NLP, recommendation systems, search. The catch: the diary fills up forever. Eventually, every step is so tiny the hiker stops moving.

The Update Rule

📐 Adagrad Update Rule — Animated

✅

Where Adagrad Shines

Sparse data, NLP

Rare words/features get large updates (small accumulated G). Frequent ones get small updates. Perfect for word embeddings and bag-of-words models.

⚙️

Hyperparameters

η=0.01, ε=1e-8

η = global learning rate (can be set higher than SGD since it's automatically scaled). ε = tiny constant for numerical stability (prevents ÷0).

❌

The Fatal Problem

Monotonically decreasing LR

G only grows — it never shrinks. After many iterations, every parameter's effective learning rate → 0. Training stops learning. This kills Adagrad in deep networks.

Python Implementation

import torch.optim as optim

# Adagrad — great for sparse problems (NLP, recommendation)
optimizer_adagrad = optim.Adagrad(
    model.parameters(),
    lr=0.01,             # global learning rate
    lr_decay=0,           # optional extra decay
    eps=1e-8,             # numerical stability
    weight_decay=0        # L2 regularization
)

# From scratch — educational implementation
import numpy as np

def adagrad_update(theta, grad, G, lr=0.01, eps=1e-8):
    """
    theta : current parameters (array)
    grad  : gradient of loss w.r.t theta
    G     : accumulated sum of squared gradients
    """
    G = G + grad**2                       # accumulate squared grads
    effective_lr = lr / (np.sqrt(G) + eps)  # per-param learning rate
    theta = theta - effective_lr * grad      # update parameters
    return theta, G

# Example with a sparse gradient scenario
theta = np.array([1.0, 1.0, 1.0])
G     = np.zeros_like(theta)           # starts at zero

# Feature 0 seen frequently, feature 2 rarely
grads = [
    np.array([0.5, 0.0, 0.0]),
    np.array([0.5, 0.0, 0.5]),
    np.array([0.5, 0.3, 0.0]),
    np.array([0.5, 0.0, 0.0]),
]

for i, grad in enumerate(grads):
    theta, G = adagrad_update(theta, grad, G)
    print(f"Step {i+1}: θ={theta.round(4)}, eff_lr={(0.01/(np.sqrt(G)+1e-8)).round(4)}")

OUTPUT

Step 1: θ=[0.9800 1.0000 1.0000], eff_lr=[0.02 0.01 0.01 ] Step 2: θ=[0.9657 1.0000 0.9800], eff_lr=[0.0141 0.01 0.02 ] Step 3: θ=[0.9541 0.9943 0.9800], eff_lr=[0.0115 0.0141 0.02 ] Step 4: θ=[0.9447 0.9943 0.9800], eff_lr=[0.01 0.0141 0.02 ] ← Feature 0 (frequent): effective LR shrinks fast ← Feature 2 (rare): effective LR stays higher longer

Section 05

RMSProp — Root Mean Square Propagation

📖 Story

The Hiker Who Only Trusts Recent Memory

Adagrad's hiker keeps a diary that grows forever, eventually making every step microscopic. RMSProp's hiker has a smarter diary: a fading memory. Yesterday's rocky patches count less than today's. Last week's? Almost forgotten.

The technical term is an exponentially weighted moving average (EWMA) of squared gradients. The accumulated history decays with coefficient β₂ (typically 0.9 or 0.99), preventing the learning rate from collapsing to zero.

Geoffrey Hinton introduced RMSProp in 2012 in a Coursera lecture — it was never formally published, yet it became one of the most widely used optimizers for recurrent neural networks and RL.

The Update Rule

📐 RMSProp Update Rule — Animated

📛 Adagrad — Accumulates Forever

Step	G (sum of g²)	Eff. LR (η/√G)
1	0.25	0.0200
10	2.50	0.0063
100	25.0	0.0020
1000	250	0.0006
10000	2500	0.0002 ← dying

✅ RMSProp — Stays Bounded

Step	E[g²] (EWMA)	Eff. LR (η/√E[g²])
1	0.025	0.0632
10	~0.22	0.0213
100	~0.25	0.0200
1000	~0.25	0.0200
10000	~0.25	0.0200 ← stable!

Python Implementation

import torch.optim as optim

# RMSProp — excellent for RNNs and RL
optimizer_rms = optim.RMSprop(
    model.parameters(),
    lr=0.001,            # learning rate
    alpha=0.99,           # β — decay for moving average (0.9 or 0.99)
    eps=1e-8,             # numerical stability
    weight_decay=0,       # L2 regularization
    momentum=0,           # optional momentum term
    centered=False         # if True: normalize by E[g] - E[g]² (centered variant)
)

# From scratch — numpy educational version
import numpy as np

def rmsprop_update(theta, grad, eg2, lr=0.001, beta=0.9, eps=1e-8):
    """
    theta : current parameters
    grad  : gradient of loss
    eg2   : running EWMA of squared gradients
    beta  : decay factor (0.9 = 10% new info per step)
    """
    eg2   = beta * eg2 + (1 - beta) * grad**2   # EWMA update
    theta = theta - lr / (np.sqrt(eg2) + eps) * grad  # adaptive update
    return theta, eg2

# Apply to a simple 1D loss: L(θ) = θ²  (minimum at θ=0)
theta = 5.0
eg2   = 0.0

for step in range(1, 21):
    grad       = 2 * theta                          # ∇(θ²) = 2θ
    theta, eg2 = rmsprop_update(theta, grad, eg2)
    if step % 5 == 0:
        print(f"Step {step:2d}: θ = {theta:.5f}, E[g²] = {eg2:.5f}")

OUTPUT

Step 5: θ = 2.53718, E[g²] = 32.31682 Step 10: θ = 0.93064, E[g²] = 22.48537 Step 15: θ = 0.29891, E[g²] = 13.54219 Step 20: θ = 0.09184, E[g²] = 8.12948 ↳ converging smoothly toward θ=0 (global min)

⚡

RMSProp is the go-to for RNNs and Reinforcement Learning

RNNs suffer from exploding and vanishing gradients across time steps. RMSProp's adaptive scaling naturally handles this — large gradients get divided by their own magnitude, preventing explosions. In RL, reward signals are highly non-stationary; the fading memory adapts quickly to new reward scales. DeepMind used RMSProp in the original DQN Atari paper (Mnih et al., 2015).

Section 06

Adam — Adaptive Moment Estimation

📖 Story

The Hiker with GPS, Compass, and a Fading Diary

Adam is the synthesis of everything we've learned. It combines momentum (RMSProp doesn't have it) with per-parameter adaptive learning rates (SGD doesn't have it).

The GPS is the first moment — a running average of the gradient itself (direction). The diary is the second moment — a running EWMA of squared gradients (scale). The compass is the bias correction — because early in training, both moments start at zero and underestimate the true values. Adam corrects for this automatically.

Kingma & Ba published Adam in 2014. It is now the default optimizer for the vast majority of deep learning research and production systems.

The Update Rule — All 4 Steps

📐 Adam — Complete Update Rule (Animated)

🔍 Why Bias Correction Matters — A Concrete Example

t=1

β₁=0.9, first gradient g₁=0.5. Raw: m₁ = 0.9·0 + 0.1·0.5 = 0.05. But true mean ≈ 0.5. Correction: m̂₁ = 0.05 / (1-0.9¹) = 0.05/0.1 = 0.5 ✓

t=10

β₁¹⁰ ≈ 0.35. Correction factor = 1/(1-0.35) ≈ 1.54. The correction is still meaningful but getting smaller.

t=100

β₁¹⁰⁰ ≈ 2.66×10⁻⁵ ≈ 0. Correction factor ≈ 1.000003 — effectively no correction needed. Adam self-calibrates.

Python Implementation

import torch.optim as optim

# Adam — the default choice for most deep learning
optimizer_adam = optim.Adam(
    model.parameters(),
    lr=3e-4,              # Andrej Karpathy's "learning rate 3e-4 is king"
    betas=(0.9, 0.999),   # β₁, β₂ — almost never tuned
    eps=1e-8,             # ε — numerical stability
    weight_decay=0,       # L2 regularization (see AdamW below)
    amsgrad=False          # AMSGrad variant (more stable, rarely needed)
)

# AdamW — Adam with CORRECT weight decay (strongly preferred)
optimizer_adamw = optim.AdamW(
    model.parameters(),
    lr=3e-4,
    betas=(0.9, 0.999),
    weight_decay=0.01      # decoupled weight decay — correct L2
)

# From scratch — full educational implementation
import numpy as np

class Adam:
    def __init__(self, params, lr=3e-4, betas=(0.9, 0.999), eps=1e-8):
        self.params = params
        self.lr     = lr
        self.b1, self.b2 = betas
        self.eps    = eps
        self.t      = 0
        self.m      = [np.zeros_like(p) for p in params]  # 1st moment
        self.v      = [np.zeros_like(p) for p in params]  # 2nd moment

    def step(self, grads):
        self.t += 1
        updated = []
        for i, (p, g) in enumerate(zip(self.params, grads)):
            # Step 1: first moment
            self.m[i] = self.b1 * self.m[i] + (1 - self.b1) * g
            # Step 2: second moment
            self.v[i] = self.b2 * self.v[i] + (1 - self.b2) * g**2
            # Step 3: bias correction
            m_hat = self.m[i] / (1 - self.b1**self.t)
            v_hat = self.v[i] / (1 - self.b2**self.t)
            # Step 4: update
            p_new = p - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
            updated.append(p_new)
        return updated

# Demo: minimise L(θ₁,θ₂) = θ₁² + 10·θ₂² (ill-conditioned)
params = [np.array(5.0), np.array(5.0)]
adam   = Adam(params, lr=0.1)

for step in range(1, 31):
    grads  = [2*params[0], 20*params[1]]    # ∇L
    params = adam.step(grads)
    if step % 10 == 0:
        loss = params[0]**2 + 10*params[1]**2
        print(f"Step {step:2d}: θ₁={params[0]:.4f}, θ₂={params[1]:.4f}, L={loss:.4f}")

OUTPUT

Step 10: θ₁=3.9982, θ₂=0.0918, L=15.9940 Step 20: θ₁=2.7965, θ₂=0.0000, L=7.8204 Step 30: θ₁=1.7016, θ₂=0.0000, L=2.8955 ↳ Adam drives θ₂ (steep direction) to 0 almost immediately while slowly handling θ₁ — perfect adaptive behaviour!

🎯

AdamW vs Adam — Always Prefer AdamW

Standard Adam's weight_decay is implemented as L2 regularization added to the gradient, which interacts incorrectly with the adaptive scaling. AdamW (Loshchilov & Hutter, 2019) decouples weight decay from the gradient update — applying it directly to the weights instead. This gives correct regularization and is the default in modern transformers (GPT, BERT, etc.). Use optim.AdamW, not optim.Adam(weight_decay=...).

Section 07

Side-by-Side Comparison

Property	SGD	Adagrad	RMSProp	Adam
Year	1951 (batch), 1998 (mini-batch)	2011	2012	2014
Adaptive LR	No	Yes	Yes	Yes
Momentum	Optional	No	Optional	Yes (built-in)
LR stability	Fixed	Monotone ↓ to 0	Bounded (EWMA)	Bounded (EWMA)
Memory (extra params)	0 extra	G (1 array)	E[g²] (1 array)	m + v (2 arrays)
Bias correction	No	No	No	Yes
Sparse data	Poor	Excellent	Good	Excellent
RNN / RL	Works	Fails long training	Excellent	Excellent
Generalization	Best (sharp minima)	Good	Good	Sometimes worse than SGD
Default recommendation	CV tasks, large batch	Sparse NLP	RNN/RL	Everything else ✅

📉 Convergence Speed Comparison — All 4 Optimizers

Section 08

Complete Python Implementation — All Four Optimizers

Full Training Loop Comparison on MNIST

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# ── Data ─────────────────────────────────────────────────────
transform = transforms.ToTensor()
train_data = datasets.MNIST('.', train=True,  download=True, transform=transform)
test_data  = datasets.MNIST('.', train=False, download=True, transform=transform)
train_loader = DataLoader(train_data, batch_size=128, shuffle=True)
test_loader  = DataLoader(test_data,  batch_size=256)

# ── Model ─────────────────────────────────────────────────────
def build_model():
    return nn.Sequential(
        nn.Flatten(),
        nn.Linear(784, 256), nn.ReLU(),
        nn.Linear(256, 128), nn.ReLU(),
        nn.Linear(128, 10)
    )

# ── Optimizers to compare ─────────────────────────────────────
configs = {
    'SGD':      lambda m: optim.SGD(m.parameters(), lr=0.01, momentum=0.9),
    'Adagrad':  lambda m: optim.Adagrad(m.parameters(), lr=0.01),
    'RMSProp':  lambda m: optim.RMSprop(m.parameters(), lr=0.001),
    'Adam':     lambda m: optim.Adam(m.parameters(),    lr=3e-4),
}

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
criterion = nn.CrossEntropyLoss()

def train_and_evaluate(name, optimizer_fn, epochs=5):
    model     = build_model().to(device)
    optimizer = optimizer_fn(model)
    history   = []

    for epoch in range(epochs):
        model.train()
        total_loss = 0

        for X, y in train_loader:
            X, y = X.to(device), y.to(device)
            optimizer.zero_grad()
            loss = criterion(model(X), y)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        # Evaluate
        model.eval()
        correct = 0
        with torch.no_grad():
            for X, y in test_loader:
                X, y = X.to(device), y.to(device)
                preds   = model(X).argmax(dim=1)
                correct += (preds == y).sum().item()

        acc = correct / len(test_data)
        history.append(acc)
        print(f"[{name}] Epoch {epoch+1}/{epochs} | Loss={total_loss/len(train_loader):.4f} | Test Acc={acc:.4f}")

    return history

# Run all four
results = {}
for name, opt_fn in configs.items():
    print(f"\n{'='*50}")
    results[name] = train_and_evaluate(name, opt_fn)

OUTPUT (5 Epochs on MNIST)

Learning Rate Schedules with Optimizers

# Best practice: combine Adam with a learning rate schedule
model     = build_model().to(device)
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)

# Option 1: Cosine Annealing — smoothly decays LR like a cosine wave
scheduler_cos = optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=50, eta_min=1e-6
)

# Option 2: OneCycleLR — fast warmup then decay (Leslie Smith's policy)
scheduler_one = optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=1e-2,
    steps_per_epoch=len(train_loader),
    epochs=20
)

# Training loop with scheduler
for epoch in range(20):
    model.train()
    for X, y in train_loader:
        X, y = X.to(device), y.to(device)
        optimizer.zero_grad()
        loss = criterion(model(X), y)
        loss.backward()
        optimizer.step()
        scheduler_one.step()   # step INSIDE the batch loop for OneCycleLR

    current_lr = optimizer.param_groups[0]['lr']
    print(f"Epoch {epoch+1:2d} | LR = {current_lr:.6f}")

Section 09

Hyperparameter Reference

Optimizer	Parameter	Default	Effect	Tuning Advice
SGD	lr (η)	—	Step size	Critical. Grid: 0.1, 0.01, 0.001
SGD	momentum (β)	0	Velocity decay	Set 0.9 always. Try 0.99 for LSTMs.
Adagrad	lr (η)	0.01	Global scale	Can be larger than SGD. Try 0.01–0.1.
Adagrad	eps	1e-10	Numerical stability	Rarely tuned. Keep at 1e-8 to 1e-10.
RMSProp	lr (η)	0.001	Global scale	Default usually good. Try 1e-3 to 1e-4.
	alpha (β)	0.99	EWMA decay	0.9 for fast-changing, 0.99 for stable.
	eps	1e-8	Stability	Try 1e-6 if training is unstable.
Adam	lr (η)	1e-3	Global scale	3e-4 is often best. Rarely go above 1e-3.
	β₁	0.9	1st moment decay	Almost never change. Try 0.85–0.95.
	β₂	0.999	2nd moment decay	Try 0.99 for sparse, 0.9999 for large batch.
	eps	1e-8	Stability	Increase to 1e-6 or 0.1 for transformers.

Section 10

When to Use Which Optimizer

🖼️

Computer Vision (CNNs)

SGD with momentum and cosine LR schedule often gives better generalization than Adam. ResNets, EfficientNets are tuned with SGD. Training time is longer but test accuracy is higher.

→ Use: SGD(lr=0.1, momentum=0.9) + CosineAnnealingLR

💬

NLP / Transformers

Adam/AdamW with a warmup schedule is standard. BERT, GPT, T5 all use AdamW. Large eps (0.1–1e-6) is sometimes needed for transformer stability due to large gradient magnitudes.

→ Use: AdamW(lr=3e-4, weight_decay=0.01) + Linear warmup

🔁

Recurrent Networks (RNNs/LSTMs)

RMSProp is the classic choice — adaptive LR handles vanishing/exploding gradients well. Adam also works. Always use gradient clipping with RNNs regardless of optimizer.

→ Use: RMSProp(lr=0.001) + clip_grad_norm_(max_norm=1.0)

🎮

Reinforcement Learning

RMSProp (DQN) or Adam (PPO, A3C). Non-stationary reward signals need adaptive rates. RMSProp's fading memory adapts well to changing reward distributions.

→ Use: RMSProp(lr=0.00025, alpha=0.95) or Adam(lr=3e-4)

📊

Sparse NLP / Embeddings

Adagrad or Adam. Rare words/features need larger updates; Adagrad's design is perfect for this. For long training, prefer Adam which won't zero out the learning rate.

→ Use: Adagrad(lr=0.05) or Adam for long runs

⚡

Quick Prototyping

Adam with default parameters. Works reasonably well on almost any architecture without tuning. Convergence is fast enough to iterate quickly on architecture and data decisions.

→ Use: Adam(lr=3e-4) — the universal starting point

Section 11

Golden Rules

🌿 Optimizer — Non-Negotiable Rules

Start with Adam(lr=3e-4) as your first experiment. It works on nearly any architecture without tuning. Only switch to SGD if you care about squeezing the last 1–2% generalization for a vision model.

Use AdamW, not Adam, whenever you use weight decay. The difference is subtle in code but meaningful in practice — optim.AdamW(weight_decay=0.01) applies decay correctly; optim.Adam(weight_decay=0.01) does not.

The learning rate is your most important hyperparameter — more important than the choice of optimizer. A good LR with SGD beats a bad LR with Adam. Always do a learning rate range test or grid search.

Always call optimizer.zero_grad() before loss.backward(). PyTorch accumulates gradients by default. Forgetting this silently corrupts training — gradients from multiple batches stack and your model diverges mysteriously.

Add a learning rate schedule. A fixed learning rate is almost never optimal. Use CosineAnnealingLR for most cases, OneCycleLR for fast training, or a simple StepLR as a baseline.

Use gradient clipping with RNNs and transformers. torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) between backward() and optimizer.step(). Prevents exploding gradients from destabilising training.

Adagrad is not obsolete — it is still the best choice for sparse features in shallow models (logistic regression on text, collaborative filtering). Its dying learning rate is a problem only in deep, long-training scenarios.