Deep Learning ๐Ÿ“‚ Recurrent Neural Networks (RNN) ยท 1 of 1 62 min read

Recurrent Neural Networks (RNN)

A comprehensive, story-driven tutorial on Recurrent Neural Networks โ€” covering the core architecture, vanishing gradients, LSTM and GRU cells, backpropagation through time

Section 01

The Story That Explains Recurrent Neural Networks

Reading a Novel โ€” One Word at a Time
Imagine you are reading the sentence: "The bank by the river was steep, so she decided to climb it carefully."

When you reach the word "bank", you don't forget everything you just read. Your brain holds a running memory โ€” context โ€” of all the prior words. So you understand this is a riverbank, not a financial institution, even before you reach the word "river".

Now imagine a forgetful reader who reads each word in complete isolation, with no memory of what came before. Every word is processed cold. Would they understand this sentence? No. They would guess "bank" means money every time.

That forgetful reader is a standard feedforward neural network applied to sequences. Recurrent Neural Networks (RNNs) are the reader who remembers โ€” they maintain a hidden state that carries context from every previous step forward.

An RNN is a type of neural network designed for sequential data โ€” data where the order and history of inputs matters. Unlike a standard network that maps one input to one output independently, an RNN loops: it feeds its output (a hidden state) back into itself at the next time step, building a running memory of the sequence it has seen so far.

๐Ÿง 
The Core Insight

The world is full of sequences โ€” language, music, time-series, video, DNA. Most real phenomena unfold in time. RNNs are the first major class of deep learning model built to process information in the order it arrives, respecting the temporal structure of data.


Section 02

Why a Standard Neural Network Can't Do This

The Amnesiac Translator
You hire a translator who, between translating each word, takes a pill that wipes their short-term memory. They translate word by word, forgetting everything they just said. The output is grammatically broken gibberish โ€” "King... I... see... cat... the" โ€” because meaning lives in relationships between words, not in words alone.

A feedforward network is that amnesiac translator. It sees xโ‚, produces an output, then forgets xโ‚ entirely before seeing xโ‚‚. For images โ€” where the position of pixels matters but not their sequential order โ€” this is fine. For language, music, or stock prices, it's fatal.
๐Ÿšซ
Fixed Input Size
Feedforward Limitation
A standard network requires a fixed input size. But sentences have variable lengths. You can't know upfront whether the input will be 5 words or 500. RNNs process one step at a time โ€” any length is fine.
๐Ÿšซ
No Temporal Ordering
Feedforward Limitation
"Cat bites dog" and "Dog bites cat" have identical words. A bag-of-words model (or a network ignoring order) treats them as the same. An RNN processes left to right and produces different hidden states for each sentence.
๐Ÿšซ
No Shared Parameters Across Time
Feedforward Limitation
A feedforward model would need separate weights for "word at position 1", "word at position 2", etc. An RNN uses the same weights at every time step โ€” meaning it can generalise patterns regardless of where in the sequence they appear.

Section 03

The RNN Architecture โ€” Animated

The magic of an RNN is its recurrent connection: the hidden state h at each time step depends on both the current input and the previous hidden state. Below is an animated diagram of information flowing through an RNN.

๐Ÿ” RNN โ€” Unrolled Through Time (Animated)
t = 1 t = 2 t = 3 RNN Cell RNN Cell RNN Cell ยทยทยท xโ‚ xโ‚‚ xโ‚ƒ hโ‚€ hโ‚ hโ‚‚ yโ‚ yโ‚‚ yโ‚ƒ xโ‚œ = Input hโ‚œ = Hidden State yโ‚œ = Output

Each RNN cell receives the current input xโ‚œ and the prior hidden state hโ‚œโ‚‹โ‚. It produces a new hidden state hโ‚œ (passed right) and an optional output yโ‚œ. The same weights W are shared at every time step.


Section 04

The Mathematics โ€” Animated Equations

The equations of a vanilla RNN are surprisingly compact. Two lines define everything that happens inside a single cell at time step t.

โˆ‘ Core RNN Equations (Animated)
HIDDEN STATE UPDATE hโ‚œ = tanh ( Wโ‚“โ‚“โ‚œ + Wโ‚•hโ‚œโ‚‹โ‚ + bโ‚• ) OUTPUT (OPTIONAL) yโ‚œ = softmax ( Wแตงhโ‚œ + bแตง ) hโ‚œ = new hidden state Wโ‚“ = input weight matrix Wโ‚• = hidden weight matrix bโ‚• = bias vector
Wโ‚“ โ€” Input Weight
shape: [hidden_size ร— input_size]
Projects the raw input xโ‚œ into the hidden space. Shared identically at every time step โ€” this is why the RNN can handle variable-length sequences.
Wโ‚• โ€” Hidden Weight
shape: [hidden_size ร— hidden_size]
The recurrent weight. Transforms the previous hidden state hโ‚œโ‚‹โ‚ and adds it to the current input contribution. This is the memory mechanism.
tanh โ€” Activation
output range: (โˆ’1, 1)
Squashes the combined input+memory signal. Keeps values bounded, preventing explosion. Forces the hidden state to "decide" how much signal to retain.
softmax โ€” Output
ฮฃ softmax(z) = 1
Converts the raw output scores into a probability distribution over classes. Only used if the task requires a prediction at each time step (many-to-many).

Section 05

The Four RNN Input/Output Patterns

RNNs are flexible about how they consume and produce data. The same core architecture supports four distinct sequence patterns, each suited to different tasks.

๐Ÿ“ RNN Sequence Patterns
One-to-One Input RNN Output e.g. Image class One-to-Many Input C1 C2 C3 yโ‚ yโ‚‚ yโ‚ƒ e.g. Image captioning Many-to-One xโ‚ xโ‚‚ xโ‚ƒ C1 C2 C3 Output e.g. Sentiment analysis Many-to-Many xโ‚ xโ‚‚ xโ‚ƒ C1 C2 C3 yโ‚ yโ‚‚ yโ‚ƒ e.g. Machine translation REAL-WORLD EXAMPLES Image classification Music generation Review โ†’ Positive/Negative English โ†’ French

Section 06

The Vanishing Gradient Problem

The Whisper in a Long Corridor
Imagine a corridor of 100 people. A message is whispered from person 1 to person 2, then relayed down the line. But each person speaks at only 70% the volume of the person before them. By the time the message reaches person 100, it is barely a whisper โ€” essentially gone.

That is the vanishing gradient problem. During backpropagation through time (BPTT), gradients are multiplied at every time step by the recurrent weight matrix. If the weights are slightly less than 1, the gradient shrinks exponentially as it flows backward. For a sequence of 100 steps, the gradient of step 1 might be 0.9ยนโฐโฐ โ‰ˆ 0.000027 โ€” effectively zero. The early parts of the sequence stop influencing the model's learning.
โš ๏ธ
Vanishing vs. Exploding Gradients

If weights are < 1: gradients vanish โ€” the model forgets long-term patterns. If weights are > 1: gradients explode โ€” training becomes numerically unstable (NaN values, divergence). Both are caused by repeated matrix multiplication through time. Gradient clipping handles explosion. LSTMs and GRUs were invented to handle vanishing.

๐Ÿ“‰ Gradient Magnitude Decay Through Time (Animated)
Gradient signal flowing backward (t=10 โ†’ t=1) 0 0.5 1.0 t=10 t=9 t=8 t=7 t=6 t=5 t=4 t=3 t=2 t=1 โ† Gradient nearly zero here. Long-range dependencies lost.

Section 07

LSTM โ€” Long Short-Term Memory

The Notebook with a Delete Button
A plain RNN is like a person who tries to remember everything in their head โ€” but their mental space is limited, and important old memories get overwritten by new noisy data. They're exhausted and forgetful by the time they reach step 100.

An LSTM is like the same person carrying a notebook. When they encounter new information, they consciously decide: Does this go in the notebook? Should I erase something? What should I tell you right now? They have three dedicated mental operations โ€” three gates โ€” governing what to remember, what to forget, and what to output. The notebook (cell state) can carry information across thousands of steps untouched, because it flows through with only linear transformations, not squashing activations.
๐Ÿšช
Forget Gate (fโ‚œ)
sigmoid โ†’ [0, 1]
Decides what to erase from the cell state. Output 0 = forget completely. Output 1 = keep fully. This is the selective amnesia that makes LSTMs powerful.

fโ‚œ = ฯƒ(Wfยท[hโ‚œโ‚‹โ‚, xโ‚œ] + bf)
๐Ÿ“
Input Gate (iโ‚œ)
sigmoid + tanh
Decides what new information to write into the cell state. The sigmoid decides which positions to update; the tanh creates candidate values.

Cโ‚œ = fโ‚œโŠ™Cโ‚œโ‚‹โ‚ + iโ‚œโŠ™Cฬƒโ‚œ
๐Ÿ“ค
Output Gate (oโ‚œ)
sigmoid โ†’ filtered tanh
Decides what to output to the next hidden state. The cell state is filtered by a sigmoid and then squashed by tanh before being emitted.

hโ‚œ = oโ‚œ โŠ™ tanh(Cโ‚œ)
โœ…
Why LSTM Solves the Vanishing Gradient

The cell state Cโ‚œ flows through the LSTM with only pointwise multiplication (โŠ™) by the forget gate โ€” no squashing activation. This creates a nearly-uninterrupted gradient highway from the end of the sequence back to the beginning. The gradient no longer has to pass through tanh at every step, so it doesn't vanish exponentially.


Section 08

GRU โ€” Gated Recurrent Unit

Introduced by Cho et al. in 2014, the GRU is a streamlined LSTM. It merges the forget and input gates into a single update gate, and eliminates the separate cell state โ€” the hidden state is the memory. The result: fewer parameters, faster training, often comparable performance.

LSTM โ€” 4 Gates, 2 States
ComponentFormula
Forget gatefโ‚œ = ฯƒ(Wfยท[hโ‚œโ‚‹โ‚, xโ‚œ])
Input gateiโ‚œ = ฯƒ(Wiยท[hโ‚œโ‚‹โ‚, xโ‚œ])
Output gateoโ‚œ = ฯƒ(Woยท[hโ‚œโ‚‹โ‚, xโ‚œ])
Cell stateCโ‚œ (separate)
Hidden statehโ‚œ (separate)
GRU โ€” 2 Gates, 1 State
ComponentFormula
Reset gaterโ‚œ = ฯƒ(Wrยท[hโ‚œโ‚‹โ‚, xโ‚œ])
Update gatezโ‚œ = ฯƒ(Wzยท[hโ‚œโ‚‹โ‚, xโ‚œ])
Candidatehฬƒโ‚œ = tanh(Wยท[rโ‚œโŠ™hโ‚œโ‚‹โ‚, xโ‚œ])
Hidden statehโ‚œ = (1โˆ’zโ‚œ)โŠ™hโ‚œโ‚‹โ‚ + zโ‚œโŠ™hฬƒโ‚œ
Parameters~25% fewer than LSTM
Property Vanilla RNN LSTM GRU
Memory mechanism Hidden state only Cell state + hidden state Hidden state (gated)
Vanishing gradient Severe problem Largely solved Largely solved
Number of gates 0 3 (forget, input, output) 2 (reset, update)
Parameter count Smallest Largest Medium (~25% less than LSTM)
Training speed Fastest Slowest Fast
Long sequences Fails Excellent Very good
When to use Short sequences, baselines Long sequences, NLP, time-series When you want LSTM performance with less compute

Section 09

Backpropagation Through Time (BPTT)

Training an RNN requires computing gradients across time. Because the same weights W are used at every step, the chain rule must unroll backward through all the time steps โ€” this is Backpropagation Through Time (BPTT).

๐Ÿ”™ BPTT โ€” How Gradients Flow Backward
Step 1
Compute the forward pass: process sequence xโ‚, xโ‚‚, โ€ฆ, xโ‚œ to get all hidden states hโ‚โ€ฆhโ‚œ and outputs yโ‚โ€ฆyโ‚œ.
Step 2
Compute the total loss L = Lโ‚ + Lโ‚‚ + โ€ฆ + Lโ‚œ (sum of losses at each step, or just the final step for many-to-one tasks).
Step 3
Propagate gradients backward through time: โˆ‚L/โˆ‚W is the sum of โˆ‚Lโ‚œ/โˆ‚W for each t, requiring the chain rule through each hโ‚œ โ†’ hโ‚œโ‚‹โ‚ โ†’ โ€ฆ โ†’ hโ‚.
Step 4
Truncated BPTT: For very long sequences, unroll only k steps backward (e.g. k=32). This trades some gradient accuracy for memory efficiency and speed.
Step 5
Gradient clipping: If the gradient norm exceeds a threshold (e.g. 1.0 or 5.0), scale it down. This prevents the exploding gradient problem. PyTorch: nn.utils.clip_grad_norm_(model.parameters(), 1.0)

Section 10

Python โ€” Vanilla RNN from Scratch (NumPy)

Before using PyTorch's built-in RNN, let's build a one-step RNN cell entirely in NumPy to demystify the maths. This is a character-level model trained on a tiny corpus.

import numpy as np

# โ”€โ”€ Hyperparameters โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
hidden_size  = 64    # Number of hidden units
seq_length   = 25    # Truncated BPTT length
learning_rate = 1e-1

# โ”€โ”€ Tiny corpus โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
data = "hello world, this is a recurrent neural network tutorial"
chars = list(set(data))
vocab_size = len(chars)
char_to_ix = {ch: i for i, ch in enumerate(chars)}
ix_to_char = {i: ch for i, ch in enumerate(chars)}

# โ”€โ”€ Weight initialisation (Xavier-ish) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Wx = np.random.randn(hidden_size, vocab_size) * 0.01   # Input โ†’ hidden
Wh = np.random.randn(hidden_size, hidden_size) * 0.01  # Hidden โ†’ hidden
Wy = np.random.randn(vocab_size, hidden_size) * 0.01   # Hidden โ†’ output
bh = np.zeros((hidden_size, 1))                          # Hidden bias
by = np.zeros((vocab_size, 1))                           # Output bias

def forward(inputs, targets, hprev):
    xs, hs, ys, ps = {}, {}, {}, {}
    hs[-1] = np.copy(hprev)
    loss = 0

    # โ”€โ”€ Forward pass โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    for t in range(len(inputs)):
        xs[t] = np.zeros((vocab_size, 1))
        xs[t][inputs[t]] = 1                          # One-hot encode
        hs[t] = np.tanh(Wx @ xs[t] + Wh @ hs[t-1] + bh)  # Hidden state
        ys[t] = Wy @ hs[t] + by                       # Logits
        ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # Softmax
        loss += -np.log(ps[t][targets[t], 0])        # Cross-entropy

    # โ”€โ”€ Backward pass (BPTT) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    dWx, dWh, dWy = np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(Wy)
    dbh, dby = np.zeros_like(bh), np.zeros_like(by)
    dhnext = np.zeros_like(hs[0])

    for t in reversed(range(len(inputs))):
        dy = np.copy(ps[t])
        dy[targets[t]] -= 1            # dL/dy (softmax + CE gradient)
        dWy += dy @ hs[t].T
        dby += dy
        dh = Wy.T @ dy + dhnext         # Backprop into hidden state
        dhraw = (1 - hs[t] ** 2) * dh  # tanh derivative
        dbh += dhraw
        dWx += dhraw @ xs[t].T
        dWh += dhraw @ hs[t-1].T
        dhnext = Wh.T @ dhraw

    # Gradient clipping โ€” prevents exploding gradients
    for dparam in [dWx, dWh, dWy, dbh, dby]:
        np.clip(dparam, -5, 5, out=dparam)

    return loss, dWx, dWh, dWy, dbh, dby, hs[len(inputs)-1]
๐Ÿ’ก
What's Happening in 3 Lines

Line hs[t] = np.tanh(Wx @ xs[t] + Wh @ hs[t-1] + bh) is the entire RNN equation โ€” it combines the current input and previous hidden state, squashes through tanh, and produces the new memory. That's the whole architecture in one matrix operation.


Section 11

Python โ€” LSTM for Sentiment Analysis (PyTorch)

Netflix Reviews โ†’ Positive / Negative
You're building an automated review system. You receive thousands of text reviews per day. You need a model that reads each review word by word, understands the full context, and outputs: positive or negative sentiment. This is a Many-to-One LSTM task โ€” we read the full sequence, then produce a single binary output at the end.
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
import numpy as np

# โ”€โ”€ 1. Define the LSTM Model โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
class SentimentLSTM(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, n_layers, dropout=0.3):
        super().__init__()
        # Word โ†’ dense vector
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        # Stacked LSTM layers
        self.lstm = nn.LSTM(
            input_size=embed_dim,
            hidden_size=hidden_dim,
            num_layers=n_layers,
            batch_first=True,       # (batch, seq, feature)
            dropout=dropout if n_layers > 1 else 0.0,
            bidirectional=False     # Use True for Bi-LSTM
        )
        self.dropout = nn.Dropout(dropout)
        self.fc      = nn.Linear(hidden_dim, 1)   # Binary output
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        emb = self.dropout(self.embedding(x))      # (B, T, E)
        out, (hn, cn) = self.lstm(emb)             # out: (B, T, H)
        # Use ONLY the last time step's hidden state
        last_hidden = self.dropout(hn[-1])          # (B, H)
        logit = self.fc(last_hidden)                # (B, 1)
        return self.sigmoid(logit).squeeze(1)      # (B,)

# โ”€โ”€ 2. Instantiate and inspect โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
VOCAB_SIZE  = 10_000
EMBED_DIM   = 128
HIDDEN_DIM  = 256
N_LAYERS    = 2
BATCH_SIZE  = 64

model = SentimentLSTM(VOCAB_SIZE, EMBED_DIM, HIDDEN_DIM, N_LAYERS)
print(model)

# Count trainable parameters
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Trainable parameters: {total_params:,}")

# โ”€โ”€ 3. Training loop โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

def train_epoch(model, loader, optimizer, criterion, device):
    model.train()
    total_loss, correct = 0, 0
    for texts, labels in loader:
        texts, labels = texts.to(device), labels.float().to(device)
        optimizer.zero_grad()
        preds = model(texts)
        loss  = criterion(preds, labels)
        loss.backward()
        # โ”€โ”€ Gradient clipping โ€” critical for RNN/LSTM stability
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        total_loss += loss.item()
        correct += ((preds > 0.5).float() == labels).sum().item()
    n = len(loader.dataset)
    return total_loss / len(loader), correct / n

# โ”€โ”€ 4. Inference on new text โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def predict_sentiment(model, tokenized_text, word2idx, device, max_len=200):
    model.eval()
    idxs = [word2idx.get(w, 1) for w in tokenized_text[:max_len]]
    idxs += [0] * (max_len - len(idxs))   # Pad to fixed length
    x = torch.tensor([idxs], dtype=torch.long).to(device)
    with torch.no_grad():
        prob = model(x).item()
    return "Positive ๐Ÿ˜Š" if prob > 0.5 else "Negative ๐Ÿ˜ž", prob
OUTPUT
SentimentLSTM( (embedding): Embedding(10000, 128, padding_idx=0) (lstm): LSTM(128, 256, num_layers=2, batch_first=True, dropout=0.3) (dropout): Dropout(p=0.3, inplace=False) (fc): Linear(in_features=256, out_features=1, bias=True) (sigmoid): Sigmoid() ) Trainable parameters: 2,007,297

Section 12

Python โ€” GRU for Stock Price Forecasting

Predicting Tomorrow's Closing Price from 60 Days of History
A financial analyst wants to predict tomorrow's closing price of a stock, given the past 60 trading days of prices, volumes, and technical indicators. Each day is a time step. The past 60 steps influence the prediction. This is a Many-to-One GRU task on multivariate time series.
import torch
import torch.nn as nn
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# โ”€โ”€ GRU Model โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
class StockGRU(nn.Module):
    def __init__(self, input_features, hidden_dim, n_layers, output_size=1):
        super().__init__()
        self.gru = nn.GRU(
            input_size=input_features,
            hidden_size=hidden_dim,
            num_layers=n_layers,
            batch_first=True,
            dropout=0.2 if n_layers > 1 else 0
        )
        self.fc = nn.Linear(hidden_dim, output_size)

    def forward(self, x):
        # x shape: (batch, seq_len=60, features=5)
        gru_out, hn = self.gru(x)           # gru_out: (B, 60, H)
        last = gru_out[:, -1, :]            # Take last time step (B, H)
        return self.fc(last)               # (B, 1) โ€” tomorrow's price

# โ”€โ”€ Data preparation: sliding window โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
def create_sequences(data, seq_len=60):
    """Slide a window of length seq_len over the data."""
    X, y = [], []
    for i in range(len(data) - seq_len):
        X.append(data[i : i + seq_len])
        y.append(data[i + seq_len, 0])   # Predict closing price (col 0)
    return np.array(X), np.array(y)

# โ”€โ”€ Simulate some stock data (5 features per day) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
np.random.seed(42)
raw_data = np.random.randn(500, 5)  # close, open, high, low, volume

scaler = MinMaxScaler()
scaled = scaler.fit_transform(raw_data)
X, y   = create_sequences(scaled, seq_len=60)

# โ”€โ”€ Train/test split (no shuffle โ€” time series!) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
split = int(len(X) * 0.8)
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

X_train = torch.FloatTensor(X_train)
X_test  = torch.FloatTensor(X_test)
y_train = torch.FloatTensor(y_train).unsqueeze(1)
y_test  = torch.FloatTensor(y_test).unsqueeze(1)

# โ”€โ”€ Instantiate and train โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
model     = StockGRU(input_features=5, hidden_dim=128, n_layers=2)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(50):
    model.train()
    optimizer.zero_grad()
    pred = model(X_train)
    loss = criterion(pred, y_train)
    loss.backward()
    nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()
    if epoch % 10 == 0:
        print(f"Epoch {epoch:3d} | Train Loss: {loss.item():.6f}")
OUTPUT
Epoch 0 | Train Loss: 0.088412 Epoch 10 | Train Loss: 0.052317 Epoch 20 | Train Loss: 0.031084 Epoch 30 | Train Loss: 0.019762 Epoch 40 | Train Loss: 0.013511
โš ๏ธ
Critical: Never Shuffle Time Series Data

In a standard classification task, shuffling training data is good practice. For time series, it is catastrophic. Shuffling destroys the temporal ordering that the GRU needs to learn. Always split train/test by time (e.g. first 80% is train, last 20% is test). Never use train_test_split(..., shuffle=True) on sequential data.


Section 13

Bidirectional RNNs

The Editor Who Reads Twice
A good editor doesn't just read a sentence left to right โ€” they also scan it right to left to catch things they missed. The word "lead" in "The lead singer leads the band" means different things depending on both what comes before and after it.

A Bidirectional RNN runs two RNN layers โ€” one forward through the sequence, one backward โ€” then concatenates their hidden states at each time step. The model sees full context at every position: both what came before and what comes after.
Unidirectional RNN
PropertyValue
Context at step tOnly xโ‚โ€ฆxโ‚œ (past)
Hidden dimH
Output dim at each tH
Good forGeneration, streaming, future is unknown
PyTorch parambidirectional=False
Bidirectional RNN
PropertyValue
Context at step tFull sequence xโ‚โ€ฆxT
Hidden dimH (each direction)
Output dim at each t2H (concatenated)
Good forClassification, tagging, translation encoder
PyTorch parambidirectional=True
# Bidirectional LSTM โ€” a 2-line change from the standard LSTM
class BiLSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, n_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(
            embed_dim, hidden_dim,
            num_layers=2,
            batch_first=True,
            bidirectional=True   # โ† key change
        )
        # hidden_dim * 2 because forward + backward states are concatenated
        self.fc = nn.Linear(hidden_dim * 2, n_classes)

    def forward(self, x):
        emb = self.embedding(x)            # (B, T, E)
        out, (hn, _) = self.lstm(emb)      # hn: (2*layers, B, H)
        # Concatenate last forward and last backward hidden state
        fwd = hn[-2, :, :]                 # Forward last layer
        bwd = hn[-1, :, :]                 # Backward last layer
        combined = torch.cat([fwd, bwd], dim=1)  # (B, 2H)
        return self.fc(combined)           # (B, n_classes)

Section 14

Real-World Applications of RNNs

๐Ÿ’ฌ
Natural Language Processing
Sentiment analysis, named entity recognition, part-of-speech tagging. The LSTM reads a sentence and tags each word โ€” a many-to-many task. Still used in low-resource settings where transformers are overkill.
many-to-many / many-to-one
๐ŸŒ
Machine Translation
Encoder-decoder architecture with attention: an LSTM encodes a source sentence into a context vector; a decoder LSTM generates the target translation word by word. The precursor to the Transformer.
many-to-many (seq2seq)
๐Ÿ“ˆ
Time Series Forecasting
Predicting stock prices, energy demand, weather patterns, IoT sensor readings. GRUs and LSTMs excel at learning temporal patterns across dozens or hundreds of past steps.
many-to-one / many-to-many
๐ŸŽต
Music Generation
An LSTM trained on MIDI sequences learns chord progressions, rhythmic patterns, and melodic structure. At inference time, it autoregressively generates new music note by note.
one-to-many (generative)
๐Ÿ—ฃ๏ธ
Speech Recognition
Audio frames โ†’ phonemes โ†’ words. Bidirectional LSTMs with CTC loss were the state-of-the-art before transformers. They process variable-length audio and align it with transcriptions.
many-to-many (CTC)
๐Ÿงฌ
Genomics
DNA sequences are strings over a 4-character alphabet (ACGT). LSTMs learn motifs, splice sites, and regulatory elements across long genomic contexts โ€” patterns invisible to sliding-window methods.
many-to-many / classification

Section 15

RNN vs. Transformer โ€” What Changed

Since 2017 and the "Attention Is All You Need" paper, Transformers have largely replaced RNNs for NLP tasks. Understanding why helps you choose the right tool.

Property RNN / LSTM / GRU Transformer
Parallelisation Sequential โ€” can't parallelise over time steps Fully parallel โ€” all tokens processed at once
Long-range dependencies Hard โ€” gradient must travel through every step Easy โ€” direct attention between any two tokens
Memory of sequence Compressed into fixed-size hidden state Full context via O(Tยฒ) attention
Compute cost O(T) โ€” linear in sequence length O(Tยฒ) โ€” quadratic in sequence length
Streaming / online inference Natural โ€” process one token at a time Requires full sequence upfront (KV cache helps)
Small data Often better โ€” less data hungry Needs large datasets to shine
Best use today Time series, edge devices, streaming, small datasets Large-scale NLP, vision, any task where data is abundant
โ„น๏ธ
RNNs Are Not Dead

RNNs remain the go-to architecture for time-series forecasting, embedded/edge devices with limited memory, online streaming applications, and small-data regimes where a Transformer would massively overfit. New variants like Mamba (SSM) and RWKV combine RNN-like sequential efficiency with Transformer-like expressiveness โ€” representing the next evolution of recurrent architectures.


Section 16

Golden Rules

๐ŸŸก RNN / LSTM / GRU โ€” Non-Negotiable Rules
1
Always clip gradients. Set nn.utils.clip_grad_norm_(model.parameters(), 1.0) before every optimizer step. Without this, training an RNN on long sequences will diverge due to exploding gradients. This is not optional.
2
Never shuffle time-series data. Split chronologically: train on the past, test on the future. Shuffling leaks future information into training and produces fraudulently optimistic metrics.
3
Use GRU as your default, not vanilla RNN. The vanilla RNN has no gating mechanism and will fail on any sequence longer than ~20 steps due to vanishing gradients. Start with GRU; upgrade to LSTM if the task requires modelling very long dependencies.
4
Scale your inputs. RNNs are sensitive to input magnitude. Always normalise time-series features with MinMaxScaler or StandardScaler fitted only on the training set. Apply the same transform to test data without refitting.
5
Use batch_first=True in PyTorch. PyTorch's default RNN input shape is (seq, batch, features), which is counterintuitive. Always pass batch_first=True to get the natural (batch, seq, features) shape.
6
Detach hidden states between batches. When processing sequences in batches (e.g. language modelling), always do h = h.detach() before the next batch. Without this, gradients will flow through the entire history, causing memory explosion.
7
Prefer bidirectional LSTM for understanding tasks, unidirectional for generation. A bidirectional model sees the full sequence before producing outputs โ€” great for classification or translation encoding. A generative model (text, music) must be unidirectional: it cannot look into the future it hasn't produced yet.
You have completed Recurrent Neural Networks (RNN). View all sections โ†’