The Story That Explains Recurrent Neural Networks
When you reach the word "bank", you don't forget everything you just read. Your brain holds a running memory โ context โ of all the prior words. So you understand this is a riverbank, not a financial institution, even before you reach the word "river".
Now imagine a forgetful reader who reads each word in complete isolation, with no memory of what came before. Every word is processed cold. Would they understand this sentence? No. They would guess "bank" means money every time.
That forgetful reader is a standard feedforward neural network applied to sequences. Recurrent Neural Networks (RNNs) are the reader who remembers โ they maintain a hidden state that carries context from every previous step forward.
An RNN is a type of neural network designed for sequential data โ data where the order and history of inputs matters. Unlike a standard network that maps one input to one output independently, an RNN loops: it feeds its output (a hidden state) back into itself at the next time step, building a running memory of the sequence it has seen so far.
The world is full of sequences โ language, music, time-series, video, DNA. Most real phenomena unfold in time. RNNs are the first major class of deep learning model built to process information in the order it arrives, respecting the temporal structure of data.
Why a Standard Neural Network Can't Do This
A feedforward network is that amnesiac translator. It sees
xโ, produces an output, then forgets xโ entirely before seeing xโ. For images โ where the position of pixels matters but not their sequential order โ this is fine. For language, music, or stock prices, it's fatal.
The RNN Architecture โ Animated
The magic of an RNN is its recurrent connection: the hidden state h at each time step depends on both the current input and the previous hidden state. Below is an animated diagram of information flowing through an RNN.
Each RNN cell receives the current input xโ and the prior hidden state hโโโ. It produces a new hidden state hโ (passed right) and an optional output yโ. The same weights W are shared at every time step.
The Mathematics โ Animated Equations
The equations of a vanilla RNN are surprisingly compact. Two lines define everything that happens inside a single cell at time step t.
The Four RNN Input/Output Patterns
RNNs are flexible about how they consume and produce data. The same core architecture supports four distinct sequence patterns, each suited to different tasks.
The Vanishing Gradient Problem
That is the vanishing gradient problem. During backpropagation through time (BPTT), gradients are multiplied at every time step by the recurrent weight matrix. If the weights are slightly less than 1, the gradient shrinks exponentially as it flows backward. For a sequence of 100 steps, the gradient of step 1 might be
0.9ยนโฐโฐ โ 0.000027 โ effectively zero. The early parts of the sequence stop influencing the model's learning.
If weights are < 1: gradients vanish โ the model forgets long-term patterns. If weights are > 1: gradients explode โ training becomes numerically unstable (NaN values, divergence). Both are caused by repeated matrix multiplication through time. Gradient clipping handles explosion. LSTMs and GRUs were invented to handle vanishing.
LSTM โ Long Short-Term Memory
An LSTM is like the same person carrying a notebook. When they encounter new information, they consciously decide: Does this go in the notebook? Should I erase something? What should I tell you right now? They have three dedicated mental operations โ three gates โ governing what to remember, what to forget, and what to output. The notebook (cell state) can carry information across thousands of steps untouched, because it flows through with only linear transformations, not squashing activations.
fโ = ฯ(Wfยท[hโโโ, xโ] + bf)
Cโ = fโโCโโโ + iโโCฬโ
hโ = oโ โ tanh(Cโ)
The cell state Cโ flows through the LSTM with only pointwise multiplication (โ) by the forget gate โ no squashing activation. This creates a nearly-uninterrupted gradient highway from the end of the sequence back to the beginning. The gradient no longer has to pass through tanh at every step, so it doesn't vanish exponentially.
GRU โ Gated Recurrent Unit
Introduced by Cho et al. in 2014, the GRU is a streamlined LSTM. It merges the forget and input gates into a single update gate, and eliminates the separate cell state โ the hidden state is the memory. The result: fewer parameters, faster training, often comparable performance.
| Component | Formula |
|---|---|
| Forget gate | fโ = ฯ(Wfยท[hโโโ, xโ]) |
| Input gate | iโ = ฯ(Wiยท[hโโโ, xโ]) |
| Output gate | oโ = ฯ(Woยท[hโโโ, xโ]) |
| Cell state | Cโ (separate) |
| Hidden state | hโ (separate) |
| Component | Formula |
|---|---|
| Reset gate | rโ = ฯ(Wrยท[hโโโ, xโ]) |
| Update gate | zโ = ฯ(Wzยท[hโโโ, xโ]) |
| Candidate | hฬโ = tanh(Wยท[rโโhโโโ, xโ]) |
| Hidden state | hโ = (1โzโ)โhโโโ + zโโhฬโ |
| Parameters | ~25% fewer than LSTM |
| Property | Vanilla RNN | LSTM | GRU |
|---|---|---|---|
| Memory mechanism | Hidden state only | Cell state + hidden state | Hidden state (gated) |
| Vanishing gradient | Severe problem | Largely solved | Largely solved |
| Number of gates | 0 | 3 (forget, input, output) | 2 (reset, update) |
| Parameter count | Smallest | Largest | Medium (~25% less than LSTM) |
| Training speed | Fastest | Slowest | Fast |
| Long sequences | Fails | Excellent | Very good |
| When to use | Short sequences, baselines | Long sequences, NLP, time-series | When you want LSTM performance with less compute |
Backpropagation Through Time (BPTT)
Training an RNN requires computing gradients across time. Because the same weights W are used at every step, the chain rule must unroll backward through all the time steps โ this is Backpropagation Through Time (BPTT).
nn.utils.clip_grad_norm_(model.parameters(), 1.0)
Python โ Vanilla RNN from Scratch (NumPy)
Before using PyTorch's built-in RNN, let's build a one-step RNN cell entirely in NumPy to demystify the maths. This is a character-level model trained on a tiny corpus.
import numpy as np
# โโ Hyperparameters โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
hidden_size = 64 # Number of hidden units
seq_length = 25 # Truncated BPTT length
learning_rate = 1e-1
# โโ Tiny corpus โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
data = "hello world, this is a recurrent neural network tutorial"
chars = list(set(data))
vocab_size = len(chars)
char_to_ix = {ch: i for i, ch in enumerate(chars)}
ix_to_char = {i: ch for i, ch in enumerate(chars)}
# โโ Weight initialisation (Xavier-ish) โโโโโโโโโโโโโโโโโโโโโโโโ
Wx = np.random.randn(hidden_size, vocab_size) * 0.01 # Input โ hidden
Wh = np.random.randn(hidden_size, hidden_size) * 0.01 # Hidden โ hidden
Wy = np.random.randn(vocab_size, hidden_size) * 0.01 # Hidden โ output
bh = np.zeros((hidden_size, 1)) # Hidden bias
by = np.zeros((vocab_size, 1)) # Output bias
def forward(inputs, targets, hprev):
xs, hs, ys, ps = {}, {}, {}, {}
hs[-1] = np.copy(hprev)
loss = 0
# โโ Forward pass โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
for t in range(len(inputs)):
xs[t] = np.zeros((vocab_size, 1))
xs[t][inputs[t]] = 1 # One-hot encode
hs[t] = np.tanh(Wx @ xs[t] + Wh @ hs[t-1] + bh) # Hidden state
ys[t] = Wy @ hs[t] + by # Logits
ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # Softmax
loss += -np.log(ps[t][targets[t], 0]) # Cross-entropy
# โโ Backward pass (BPTT) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
dWx, dWh, dWy = np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(Wy)
dbh, dby = np.zeros_like(bh), np.zeros_like(by)
dhnext = np.zeros_like(hs[0])
for t in reversed(range(len(inputs))):
dy = np.copy(ps[t])
dy[targets[t]] -= 1 # dL/dy (softmax + CE gradient)
dWy += dy @ hs[t].T
dby += dy
dh = Wy.T @ dy + dhnext # Backprop into hidden state
dhraw = (1 - hs[t] ** 2) * dh # tanh derivative
dbh += dhraw
dWx += dhraw @ xs[t].T
dWh += dhraw @ hs[t-1].T
dhnext = Wh.T @ dhraw
# Gradient clipping โ prevents exploding gradients
for dparam in [dWx, dWh, dWy, dbh, dby]:
np.clip(dparam, -5, 5, out=dparam)
return loss, dWx, dWh, dWy, dbh, dby, hs[len(inputs)-1]
Line hs[t] = np.tanh(Wx @ xs[t] + Wh @ hs[t-1] + bh) is the entire RNN equation โ it combines the current input and previous hidden state, squashes through tanh, and produces the new memory. That's the whole architecture in one matrix operation.
Python โ LSTM for Sentiment Analysis (PyTorch)
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
import numpy as np
# โโ 1. Define the LSTM Model โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
class SentimentLSTM(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, n_layers, dropout=0.3):
super().__init__()
# Word โ dense vector
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
# Stacked LSTM layers
self.lstm = nn.LSTM(
input_size=embed_dim,
hidden_size=hidden_dim,
num_layers=n_layers,
batch_first=True, # (batch, seq, feature)
dropout=dropout if n_layers > 1 else 0.0,
bidirectional=False # Use True for Bi-LSTM
)
self.dropout = nn.Dropout(dropout)
self.fc = nn.Linear(hidden_dim, 1) # Binary output
self.sigmoid = nn.Sigmoid()
def forward(self, x):
emb = self.dropout(self.embedding(x)) # (B, T, E)
out, (hn, cn) = self.lstm(emb) # out: (B, T, H)
# Use ONLY the last time step's hidden state
last_hidden = self.dropout(hn[-1]) # (B, H)
logit = self.fc(last_hidden) # (B, 1)
return self.sigmoid(logit).squeeze(1) # (B,)
# โโ 2. Instantiate and inspect โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
VOCAB_SIZE = 10_000
EMBED_DIM = 128
HIDDEN_DIM = 256
N_LAYERS = 2
BATCH_SIZE = 64
model = SentimentLSTM(VOCAB_SIZE, EMBED_DIM, HIDDEN_DIM, N_LAYERS)
print(model)
# Count trainable parameters
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Trainable parameters: {total_params:,}")
# โโ 3. Training loop โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
def train_epoch(model, loader, optimizer, criterion, device):
model.train()
total_loss, correct = 0, 0
for texts, labels in loader:
texts, labels = texts.to(device), labels.float().to(device)
optimizer.zero_grad()
preds = model(texts)
loss = criterion(preds, labels)
loss.backward()
# โโ Gradient clipping โ critical for RNN/LSTM stability
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
total_loss += loss.item()
correct += ((preds > 0.5).float() == labels).sum().item()
n = len(loader.dataset)
return total_loss / len(loader), correct / n
# โโ 4. Inference on new text โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def predict_sentiment(model, tokenized_text, word2idx, device, max_len=200):
model.eval()
idxs = [word2idx.get(w, 1) for w in tokenized_text[:max_len]]
idxs += [0] * (max_len - len(idxs)) # Pad to fixed length
x = torch.tensor([idxs], dtype=torch.long).to(device)
with torch.no_grad():
prob = model(x).item()
return "Positive ๐" if prob > 0.5 else "Negative ๐", prob
Python โ GRU for Stock Price Forecasting
import torch
import torch.nn as nn
import numpy as np
from sklearn.preprocessing import MinMaxScaler
# โโ GRU Model โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
class StockGRU(nn.Module):
def __init__(self, input_features, hidden_dim, n_layers, output_size=1):
super().__init__()
self.gru = nn.GRU(
input_size=input_features,
hidden_size=hidden_dim,
num_layers=n_layers,
batch_first=True,
dropout=0.2 if n_layers > 1 else 0
)
self.fc = nn.Linear(hidden_dim, output_size)
def forward(self, x):
# x shape: (batch, seq_len=60, features=5)
gru_out, hn = self.gru(x) # gru_out: (B, 60, H)
last = gru_out[:, -1, :] # Take last time step (B, H)
return self.fc(last) # (B, 1) โ tomorrow's price
# โโ Data preparation: sliding window โโโโโโโโโโโโโโโโโโโโโโโโโโ
def create_sequences(data, seq_len=60):
"""Slide a window of length seq_len over the data."""
X, y = [], []
for i in range(len(data) - seq_len):
X.append(data[i : i + seq_len])
y.append(data[i + seq_len, 0]) # Predict closing price (col 0)
return np.array(X), np.array(y)
# โโ Simulate some stock data (5 features per day) โโโโโโโโโโโโโ
np.random.seed(42)
raw_data = np.random.randn(500, 5) # close, open, high, low, volume
scaler = MinMaxScaler()
scaled = scaler.fit_transform(raw_data)
X, y = create_sequences(scaled, seq_len=60)
# โโ Train/test split (no shuffle โ time series!) โโโโโโโโโโโโโโ
split = int(len(X) * 0.8)
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
X_train = torch.FloatTensor(X_train)
X_test = torch.FloatTensor(X_test)
y_train = torch.FloatTensor(y_train).unsqueeze(1)
y_test = torch.FloatTensor(y_test).unsqueeze(1)
# โโ Instantiate and train โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
model = StockGRU(input_features=5, hidden_dim=128, n_layers=2)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
for epoch in range(50):
model.train()
optimizer.zero_grad()
pred = model(X_train)
loss = criterion(pred, y_train)
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
if epoch % 10 == 0:
print(f"Epoch {epoch:3d} | Train Loss: {loss.item():.6f}")
In a standard classification task, shuffling training data is good practice. For time series, it is catastrophic. Shuffling destroys the temporal ordering that the GRU needs to learn. Always split train/test by time (e.g. first 80% is train, last 20% is test). Never use train_test_split(..., shuffle=True) on sequential data.
Bidirectional RNNs
A Bidirectional RNN runs two RNN layers โ one forward through the sequence, one backward โ then concatenates their hidden states at each time step. The model sees full context at every position: both what came before and what comes after.
| Property | Value |
|---|---|
| Context at step t | Only xโโฆxโ (past) |
| Hidden dim | H |
| Output dim at each t | H |
| Good for | Generation, streaming, future is unknown |
| PyTorch param | bidirectional=False |
| Property | Value |
|---|---|
| Context at step t | Full sequence xโโฆxT |
| Hidden dim | H (each direction) |
| Output dim at each t | 2H (concatenated) |
| Good for | Classification, tagging, translation encoder |
| PyTorch param | bidirectional=True |
# Bidirectional LSTM โ a 2-line change from the standard LSTM
class BiLSTMClassifier(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, n_classes):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.lstm = nn.LSTM(
embed_dim, hidden_dim,
num_layers=2,
batch_first=True,
bidirectional=True # โ key change
)
# hidden_dim * 2 because forward + backward states are concatenated
self.fc = nn.Linear(hidden_dim * 2, n_classes)
def forward(self, x):
emb = self.embedding(x) # (B, T, E)
out, (hn, _) = self.lstm(emb) # hn: (2*layers, B, H)
# Concatenate last forward and last backward hidden state
fwd = hn[-2, :, :] # Forward last layer
bwd = hn[-1, :, :] # Backward last layer
combined = torch.cat([fwd, bwd], dim=1) # (B, 2H)
return self.fc(combined) # (B, n_classes)
Real-World Applications of RNNs
RNN vs. Transformer โ What Changed
Since 2017 and the "Attention Is All You Need" paper, Transformers have largely replaced RNNs for NLP tasks. Understanding why helps you choose the right tool.
| Property | RNN / LSTM / GRU | Transformer |
|---|---|---|
| Parallelisation | Sequential โ can't parallelise over time steps | Fully parallel โ all tokens processed at once |
| Long-range dependencies | Hard โ gradient must travel through every step | Easy โ direct attention between any two tokens |
| Memory of sequence | Compressed into fixed-size hidden state | Full context via O(Tยฒ) attention |
| Compute cost | O(T) โ linear in sequence length | O(Tยฒ) โ quadratic in sequence length |
| Streaming / online inference | Natural โ process one token at a time | Requires full sequence upfront (KV cache helps) |
| Small data | Often better โ less data hungry | Needs large datasets to shine |
| Best use today | Time series, edge devices, streaming, small datasets | Large-scale NLP, vision, any task where data is abundant |
RNNs remain the go-to architecture for time-series forecasting, embedded/edge devices with limited memory, online streaming applications, and small-data regimes where a Transformer would massively overfit. New variants like Mamba (SSM) and RWKV combine RNN-like sequential efficiency with Transformer-like expressiveness โ representing the next evolution of recurrent architectures.
Golden Rules
nn.utils.clip_grad_norm_(model.parameters(), 1.0) before every optimizer step. Without this, training an RNN on long sequences will diverge due to exploding gradients. This is not optional.
MinMaxScaler or StandardScaler fitted only on the training set. Apply the same transform to test data without refitting.
batch_first=True in PyTorch. PyTorch's default RNN input shape is (seq, batch, features), which is counterintuitive. Always pass batch_first=True to get the natural (batch, seq, features) shape.
h = h.detach() before the next batch. Without this, gradients will flow through the entire history, causing memory explosion.