Recurrent Neural Networks (RNN)

Section 01

The Story That Explains Recurrent Neural Networks

📖 Real World Analogy

Reading a Novel — One Word at a Time

Imagine you are reading the sentence: "The bank by the river was steep, so she decided to climb it carefully."

When you reach the word "bank", you don't forget everything you just read. Your brain holds a running memory — context — of all the prior words. So you understand this is a riverbank, not a financial institution, even before you reach the word "river".

Now imagine a forgetful reader who reads each word in complete isolation, with no memory of what came before. Every word is processed cold. Would they understand this sentence? No. They would guess "bank" means money every time.

That forgetful reader is a standard feedforward neural network applied to sequences. Recurrent Neural Networks (RNNs) are the reader who remembers — they maintain a hidden state that carries context from every previous step forward.

An RNN is a type of neural network designed for sequential data — data where the order and history of inputs matters. Unlike a standard network that maps one input to one output independently, an RNN loops: it feeds its output (a hidden state) back into itself at the next time step, building a running memory of the sequence it has seen so far.

🧠

The Core Insight

The world is full of sequences — language, music, time-series, video, DNA. Most real phenomena unfold in time. RNNs are the first major class of deep learning model built to process information in the order it arrives, respecting the temporal structure of data.

Section 02

Why a Standard Neural Network Can't Do This

📖 Story

The Amnesiac Translator

You hire a translator who, between translating each word, takes a pill that wipes their short-term memory. They translate word by word, forgetting everything they just said. The output is grammatically broken gibberish — "King... I... see... cat... the" — because meaning lives in relationships between words, not in words alone.

A feedforward network is that amnesiac translator. It sees x₁, produces an output, then forgets x₁ entirely before seeing x₂. For images — where the position of pixels matters but not their sequential order — this is fine. For language, music, or stock prices, it's fatal.

🚫

Fixed Input Size

Feedforward Limitation

A standard network requires a fixed input size. But sentences have variable lengths. You can't know upfront whether the input will be 5 words or 500. RNNs process one step at a time — any length is fine.

🚫

No Temporal Ordering

Feedforward Limitation

"Cat bites dog" and "Dog bites cat" have identical words. A bag-of-words model (or a network ignoring order) treats them as the same. An RNN processes left to right and produces different hidden states for each sentence.

🚫

No Shared Parameters Across Time

Feedforward Limitation

A feedforward model would need separate weights for "word at position 1", "word at position 2", etc. An RNN uses the same weights at every time step — meaning it can generalise patterns regardless of where in the sequence they appear.

Section 03

The RNN Architecture — Animated

The magic of an RNN is its recurrent connection: the hidden state h at each time step depends on both the current input and the previous hidden state. Below is an animated diagram of information flowing through an RNN.

🔁 RNN — Unrolled Through Time (Animated)

Each RNN cell receives the current input xₜ and the prior hidden state hₜ₋₁. It produces a new hidden state hₜ (passed right) and an optional output yₜ. The same weights W are shared at every time step.

Section 04

The Mathematics — Animated Equations

The equations of a vanilla RNN are surprisingly compact. Two lines define everything that happens inside a single cell at time step t.

∑ Core RNN Equations (Animated)

Wₓ — Input Weight

shape: [hidden_size × input_size]

Projects the raw input xₜ into the hidden space. Shared identically at every time step — this is why the RNN can handle variable-length sequences.

Wₕ — Hidden Weight

shape: [hidden_size × hidden_size]

The recurrent weight. Transforms the previous hidden state hₜ₋₁ and adds it to the current input contribution. This is the memory mechanism.

tanh — Activation

output range: (−1, 1)

Squashes the combined input+memory signal. Keeps values bounded, preventing explosion. Forces the hidden state to "decide" how much signal to retain.

softmax — Output

Σ softmax(z) = 1

Converts the raw output scores into a probability distribution over classes. Only used if the task requires a prediction at each time step (many-to-many).

Section 05

The Four RNN Input/Output Patterns

RNNs are flexible about how they consume and produce data. The same core architecture supports four distinct sequence patterns, each suited to different tasks.

📐 RNN Sequence Patterns

Section 06

The Vanishing Gradient Problem

📖 Story

The Whisper in a Long Corridor

Imagine a corridor of 100 people. A message is whispered from person 1 to person 2, then relayed down the line. But each person speaks at only 70% the volume of the person before them. By the time the message reaches person 100, it is barely a whisper — essentially gone.

That is the vanishing gradient problem. During backpropagation through time (BPTT), gradients are multiplied at every time step by the recurrent weight matrix. If the weights are slightly less than 1, the gradient shrinks exponentially as it flows backward. For a sequence of 100 steps, the gradient of step 1 might be 0.9¹⁰⁰ ≈ 0.000027 — effectively zero. The early parts of the sequence stop influencing the model's learning.

⚠️

Vanishing vs. Exploding Gradients

If weights are < 1: gradients vanish — the model forgets long-term patterns. If weights are > 1: gradients explode — training becomes numerically unstable (NaN values, divergence). Both are caused by repeated matrix multiplication through time. Gradient clipping handles explosion. LSTMs and GRUs were invented to handle vanishing.

📉 Gradient Magnitude Decay Through Time (Animated)

Section 07

LSTM — Long Short-Term Memory

📖 Story

The Notebook with a Delete Button

A plain RNN is like a person who tries to remember everything in their head — but their mental space is limited, and important old memories get overwritten by new noisy data. They're exhausted and forgetful by the time they reach step 100.

An LSTM is like the same person carrying a notebook. When they encounter new information, they consciously decide: Does this go in the notebook? Should I erase something? What should I tell you right now? They have three dedicated mental operations — three gates — governing what to remember, what to forget, and what to output. The notebook (cell state) can carry information across thousands of steps untouched, because it flows through with only linear transformations, not squashing activations.

🚪

Forget Gate (fₜ)

sigmoid → [0, 1]

Decides what to erase from the cell state. Output 0 = forget completely. Output 1 = keep fully. This is the selective amnesia that makes LSTMs powerful.

fₜ = σ(Wf·[hₜ₋₁, xₜ] + bf)

📝

Input Gate (iₜ)

sigmoid + tanh

Decides what new information to write into the cell state. The sigmoid decides which positions to update; the tanh creates candidate values.

Cₜ = fₜ⊙Cₜ₋₁ + iₜ⊙C̃ₜ

📤

Output Gate (oₜ)

sigmoid → filtered tanh

Decides what to output to the next hidden state. The cell state is filtered by a sigmoid and then squashed by tanh before being emitted.

hₜ = oₜ ⊙ tanh(Cₜ)

✅

Why LSTM Solves the Vanishing Gradient

The cell state Cₜ flows through the LSTM with only pointwise multiplication (⊙) by the forget gate — no squashing activation. This creates a nearly-uninterrupted gradient highway from the end of the sequence back to the beginning. The gradient no longer has to pass through tanh at every step, so it doesn't vanish exponentially.

Section 08

GRU — Gated Recurrent Unit

Introduced by Cho et al. in 2014, the GRU is a streamlined LSTM. It merges the forget and input gates into a single update gate, and eliminates the separate cell state — the hidden state is the memory. The result: fewer parameters, faster training, often comparable performance.

LSTM — 4 Gates, 2 States

Component	Formula
Forget gate	fₜ = σ(Wf·[hₜ₋₁, xₜ])
Input gate	iₜ = σ(Wi·[hₜ₋₁, xₜ])
Output gate	oₜ = σ(Wo·[hₜ₋₁, xₜ])
Cell state	Cₜ (separate)
Hidden state	hₜ (separate)

GRU — 2 Gates, 1 State

Component	Formula
Reset gate	rₜ = σ(Wr·[hₜ₋₁, xₜ])
Update gate	zₜ = σ(Wz·[hₜ₋₁, xₜ])
Candidate	h̃ₜ = tanh(W·[rₜ⊙hₜ₋₁, xₜ])
Hidden state	hₜ = (1−zₜ)⊙hₜ₋₁ + zₜ⊙h̃ₜ
Parameters	~25% fewer than LSTM

Property	Vanilla RNN	LSTM	GRU
Memory mechanism	Hidden state only	Cell state + hidden state	Hidden state (gated)
Vanishing gradient	Severe problem	Largely solved	Largely solved
Number of gates	0	3 (forget, input, output)	2 (reset, update)
Parameter count	Smallest	Largest	Medium (~25% less than LSTM)
Training speed	Fastest	Slowest	Fast
Long sequences	Fails	Excellent	Very good
When to use	Short sequences, baselines	Long sequences, NLP, time-series	When you want LSTM performance with less compute

Section 09

Backpropagation Through Time (BPTT)

Training an RNN requires computing gradients across time. Because the same weights W are used at every step, the chain rule must unroll backward through all the time steps — this is Backpropagation Through Time (BPTT).

🔙 BPTT — How Gradients Flow Backward

Step 1

Compute the forward pass: process sequence x₁, x₂, …, xₜ to get all hidden states h₁…hₜ and outputs y₁…yₜ.

Step 2

Compute the total loss L = L₁ + L₂ + … + Lₜ (sum of losses at each step, or just the final step for many-to-one tasks).

Step 3

Propagate gradients backward through time: ∂L/∂W is the sum of ∂Lₜ/∂W for each t, requiring the chain rule through each hₜ → hₜ₋₁ → … → h₁.

Step 4

Truncated BPTT: For very long sequences, unroll only k steps backward (e.g. k=32). This trades some gradient accuracy for memory efficiency and speed.

Step 5

Gradient clipping: If the gradient norm exceeds a threshold (e.g. 1.0 or 5.0), scale it down. This prevents the exploding gradient problem. PyTorch: nn.utils.clip_grad_norm_(model.parameters(), 1.0)

Section 10

Python — Vanilla RNN from Scratch (NumPy)

Before using PyTorch's built-in RNN, let's build a one-step RNN cell entirely in NumPy to demystify the maths. This is a character-level model trained on a tiny corpus.

import numpy as np

# ── Hyperparameters ────────────────────────────────────────────
hidden_size  = 64    # Number of hidden units
seq_length   = 25    # Truncated BPTT length
learning_rate = 1e-1

# ── Tiny corpus ────────────────────────────────────────────────
data = "hello world, this is a recurrent neural network tutorial"
chars = list(set(data))
vocab_size = len(chars)
char_to_ix = {ch: i for i, ch in enumerate(chars)}
ix_to_char = {i: ch for i, ch in enumerate(chars)}

# ── Weight initialisation (Xavier-ish) ────────────────────────
Wx = np.random.randn(hidden_size, vocab_size) * 0.01   # Input → hidden
Wh = np.random.randn(hidden_size, hidden_size) * 0.01  # Hidden → hidden
Wy = np.random.randn(vocab_size, hidden_size) * 0.01   # Hidden → output
bh = np.zeros((hidden_size, 1))                          # Hidden bias
by = np.zeros((vocab_size, 1))                           # Output bias

def forward(inputs, targets, hprev):
    xs, hs, ys, ps = {}, {}, {}, {}
    hs[-1] = np.copy(hprev)
    loss = 0

    # ── Forward pass ──────────────────────────────────────────────
    for t in range(len(inputs)):
        xs[t] = np.zeros((vocab_size, 1))
        xs[t][inputs[t]] = 1                          # One-hot encode
        hs[t] = np.tanh(Wx @ xs[t] + Wh @ hs[t-1] + bh)  # Hidden state
        ys[t] = Wy @ hs[t] + by                       # Logits
        ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # Softmax
        loss += -np.log(ps[t][targets[t], 0])        # Cross-entropy

    # ── Backward pass (BPTT) ──────────────────────────────────────
    dWx, dWh, dWy = np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(Wy)
    dbh, dby = np.zeros_like(bh), np.zeros_like(by)
    dhnext = np.zeros_like(hs[0])

    for t in reversed(range(len(inputs))):
        dy = np.copy(ps[t])
        dy[targets[t]] -= 1            # dL/dy (softmax + CE gradient)
        dWy += dy @ hs[t].T
        dby += dy
        dh = Wy.T @ dy + dhnext         # Backprop into hidden state
        dhraw = (1 - hs[t] ** 2) * dh  # tanh derivative
        dbh += dhraw
        dWx += dhraw @ xs[t].T
        dWh += dhraw @ hs[t-1].T
        dhnext = Wh.T @ dhraw

    # Gradient clipping — prevents exploding gradients
    for dparam in [dWx, dWh, dWy, dbh, dby]:
        np.clip(dparam, -5, 5, out=dparam)

    return loss, dWx, dWh, dWy, dbh, dby, hs[len(inputs)-1]

💡

What's Happening in 3 Lines

Line hs[t] = np.tanh(Wx @ xs[t] + Wh @ hs[t-1] + bh) is the entire RNN equation — it combines the current input and previous hidden state, squashes through tanh, and produces the new memory. That's the whole architecture in one matrix operation.

Section 11

Python — LSTM for Sentiment Analysis (PyTorch)

📖 Real Problem

Netflix Reviews → Positive / Negative

You're building an automated review system. You receive thousands of text reviews per day. You need a model that reads each review word by word, understands the full context, and outputs: positive or negative sentiment. This is a Many-to-One LSTM task — we read the full sequence, then produce a single binary output at the end.

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
import numpy as np

# ── 1. Define the LSTM Model ───────────────────────────────────
class SentimentLSTM(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, n_layers, dropout=0.3):
        super().__init__()
        # Word → dense vector
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        # Stacked LSTM layers
        self.lstm = nn.LSTM(
            input_size=embed_dim,
            hidden_size=hidden_dim,
            num_layers=n_layers,
            batch_first=True,       # (batch, seq, feature)
            dropout=dropout if n_layers > 1 else 0.0,
            bidirectional=False     # Use True for Bi-LSTM
        )
        self.dropout = nn.Dropout(dropout)
        self.fc      = nn.Linear(hidden_dim, 1)   # Binary output
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        emb = self.dropout(self.embedding(x))      # (B, T, E)
        out, (hn, cn) = self.lstm(emb)             # out: (B, T, H)
        # Use ONLY the last time step's hidden state
        last_hidden = self.dropout(hn[-1])          # (B, H)
        logit = self.fc(last_hidden)                # (B, 1)
        return self.sigmoid(logit).squeeze(1)      # (B,)

# ── 2. Instantiate and inspect ─────────────────────────────────
VOCAB_SIZE  = 10_000
EMBED_DIM   = 128
HIDDEN_DIM  = 256
N_LAYERS    = 2
BATCH_SIZE  = 64

model = SentimentLSTM(VOCAB_SIZE, EMBED_DIM, HIDDEN_DIM, N_LAYERS)
print(model)

# Count trainable parameters
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Trainable parameters: {total_params:,}")

# ── 3. Training loop ───────────────────────────────────────────
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

def train_epoch(model, loader, optimizer, criterion, device):
    model.train()
    total_loss, correct = 0, 0
    for texts, labels in loader:
        texts, labels = texts.to(device), labels.float().to(device)
        optimizer.zero_grad()
        preds = model(texts)
        loss  = criterion(preds, labels)
        loss.backward()
        # ── Gradient clipping — critical for RNN/LSTM stability
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        total_loss += loss.item()
        correct += ((preds > 0.5).float() == labels).sum().item()
    n = len(loader.dataset)
    return total_loss / len(loader), correct / n

# ── 4. Inference on new text ───────────────────────────────────
def predict_sentiment(model, tokenized_text, word2idx, device, max_len=200):
    model.eval()
    idxs = [word2idx.get(w, 1) for w in tokenized_text[:max_len]]
    idxs += [0] * (max_len - len(idxs))   # Pad to fixed length
    x = torch.tensor([idxs], dtype=torch.long).to(device)
    with torch.no_grad():
        prob = model(x).item()
    return "Positive 😊" if prob > 0.5 else "Negative 😞", prob

OUTPUT

SentimentLSTM( (embedding): Embedding(10000, 128, padding_idx=0) (lstm): LSTM(128, 256, num_layers=2, batch_first=True, dropout=0.3) (dropout): Dropout(p=0.3, inplace=False) (fc): Linear(in_features=256, out_features=1, bias=True) (sigmoid): Sigmoid() ) Trainable parameters: 2,007,297

Section 12

Python — GRU for Stock Price Forecasting

📖 Real Problem

Predicting Tomorrow's Closing Price from 60 Days of History

A financial analyst wants to predict tomorrow's closing price of a stock, given the past 60 trading days of prices, volumes, and technical indicators. Each day is a time step. The past 60 steps influence the prediction. This is a Many-to-One GRU task on multivariate time series.

import torch
import torch.nn as nn
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# ── GRU Model ──────────────────────────────────────────────────
class StockGRU(nn.Module):
    def __init__(self, input_features, hidden_dim, n_layers, output_size=1):
        super().__init__()
        self.gru = nn.GRU(
            input_size=input_features,
            hidden_size=hidden_dim,
            num_layers=n_layers,
            batch_first=True,
            dropout=0.2 if n_layers > 1 else 0
        )
        self.fc = nn.Linear(hidden_dim, output_size)

    def forward(self, x):
        # x shape: (batch, seq_len=60, features=5)
        gru_out, hn = self.gru(x)           # gru_out: (B, 60, H)
        last = gru_out[:, -1, :]            # Take last time step (B, H)
        return self.fc(last)               # (B, 1) — tomorrow's price

# ── Data preparation: sliding window ──────────────────────────
def create_sequences(data, seq_len=60):
    """Slide a window of length seq_len over the data."""
    X, y = [], []
    for i in range(len(data) - seq_len):
        X.append(data[i : i + seq_len])
        y.append(data[i + seq_len, 0])   # Predict closing price (col 0)
    return np.array(X), np.array(y)

# ── Simulate some stock data (5 features per day) ─────────────
np.random.seed(42)
raw_data = np.random.randn(500, 5)  # close, open, high, low, volume

scaler = MinMaxScaler()
scaled = scaler.fit_transform(raw_data)
X, y   = create_sequences(scaled, seq_len=60)

# ── Train/test split (no shuffle — time series!) ──────────────
split = int(len(X) * 0.8)
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

X_train = torch.FloatTensor(X_train)
X_test  = torch.FloatTensor(X_test)
y_train = torch.FloatTensor(y_train).unsqueeze(1)
y_test  = torch.FloatTensor(y_test).unsqueeze(1)

# ── Instantiate and train ──────────────────────────────────────
model     = StockGRU(input_features=5, hidden_dim=128, n_layers=2)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(50):
    model.train()
    optimizer.zero_grad()
    pred = model(X_train)
    loss = criterion(pred, y_train)
    loss.backward()
    nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()
    if epoch % 10 == 0:
        print(f"Epoch {epoch:3d} | Train Loss: {loss.item():.6f}")

OUTPUT

⚠️

Critical: Never Shuffle Time Series Data

In a standard classification task, shuffling training data is good practice. For time series, it is catastrophic. Shuffling destroys the temporal ordering that the GRU needs to learn. Always split train/test by time (e.g. first 80% is train, last 20% is test). Never use train_test_split(..., shuffle=True) on sequential data.

Section 13

Bidirectional RNNs

📖 Story

The Editor Who Reads Twice

A good editor doesn't just read a sentence left to right — they also scan it right to left to catch things they missed. The word "lead" in "The lead singer leads the band" means different things depending on both what comes before and after it.

A Bidirectional RNN runs two RNN layers — one forward through the sequence, one backward — then concatenates their hidden states at each time step. The model sees full context at every position: both what came before and what comes after.

Unidirectional RNN

Property	Value
Context at step t	Only x₁…xₜ (past)
Hidden dim	H
Output dim at each t	H
Good for	Generation, streaming, future is unknown
PyTorch param	`bidirectional=False`

Bidirectional RNN

Property	Value
Context at step t	Full sequence x₁…xT
Hidden dim	H (each direction)
Output dim at each t	2H (concatenated)
Good for	Classification, tagging, translation encoder
PyTorch param	`bidirectional=True`

# Bidirectional LSTM — a 2-line change from the standard LSTM
class BiLSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, n_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(
            embed_dim, hidden_dim,
            num_layers=2,
            batch_first=True,
            bidirectional=True   # ← key change
        )
        # hidden_dim * 2 because forward + backward states are concatenated
        self.fc = nn.Linear(hidden_dim * 2, n_classes)

    def forward(self, x):
        emb = self.embedding(x)            # (B, T, E)
        out, (hn, _) = self.lstm(emb)      # hn: (2*layers, B, H)
        # Concatenate last forward and last backward hidden state
        fwd = hn[-2, :, :]                 # Forward last layer
        bwd = hn[-1, :, :]                 # Backward last layer
        combined = torch.cat([fwd, bwd], dim=1)  # (B, 2H)
        return self.fc(combined)           # (B, n_classes)

Section 14

Real-World Applications of RNNs

💬

Natural Language Processing

Sentiment analysis, named entity recognition, part-of-speech tagging. The LSTM reads a sentence and tags each word — a many-to-many task. Still used in low-resource settings where transformers are overkill.

many-to-many / many-to-one

🌐

Machine Translation

Encoder-decoder architecture with attention: an LSTM encodes a source sentence into a context vector; a decoder LSTM generates the target translation word by word. The precursor to the Transformer.

many-to-many (seq2seq)

📈

Time Series Forecasting

Predicting stock prices, energy demand, weather patterns, IoT sensor readings. GRUs and LSTMs excel at learning temporal patterns across dozens or hundreds of past steps.

many-to-one / many-to-many

🎵

Music Generation

An LSTM trained on MIDI sequences learns chord progressions, rhythmic patterns, and melodic structure. At inference time, it autoregressively generates new music note by note.

one-to-many (generative)

🗣️

Speech Recognition

Audio frames → phonemes → words. Bidirectional LSTMs with CTC loss were the state-of-the-art before transformers. They process variable-length audio and align it with transcriptions.

many-to-many (CTC)

🧬

Genomics

DNA sequences are strings over a 4-character alphabet (ACGT). LSTMs learn motifs, splice sites, and regulatory elements across long genomic contexts — patterns invisible to sliding-window methods.

many-to-many / classification

Section 15

RNN vs. Transformer — What Changed

Since 2017 and the "Attention Is All You Need" paper, Transformers have largely replaced RNNs for NLP tasks. Understanding why helps you choose the right tool.

Property	RNN / LSTM / GRU	Transformer
Parallelisation	Sequential — can't parallelise over time steps	Fully parallel — all tokens processed at once
Long-range dependencies	Hard — gradient must travel through every step	Easy — direct attention between any two tokens
Memory of sequence	Compressed into fixed-size hidden state	Full context via O(T²) attention
Compute cost	O(T) — linear in sequence length	O(T²) — quadratic in sequence length
Streaming / online inference	Natural — process one token at a time	Requires full sequence upfront (KV cache helps)
Small data	Often better — less data hungry	Needs large datasets to shine
Best use today	Time series, edge devices, streaming, small datasets	Large-scale NLP, vision, any task where data is abundant

ℹ️

RNNs Are Not Dead

RNNs remain the go-to architecture for time-series forecasting, embedded/edge devices with limited memory, online streaming applications, and small-data regimes where a Transformer would massively overfit. New variants like Mamba (SSM) and RWKV combine RNN-like sequential efficiency with Transformer-like expressiveness — representing the next evolution of recurrent architectures.

Section 16

Golden Rules

🟡 RNN / LSTM / GRU — Non-Negotiable Rules

Always clip gradients. Set nn.utils.clip_grad_norm_(model.parameters(), 1.0) before every optimizer step. Without this, training an RNN on long sequences will diverge due to exploding gradients. This is not optional.

Never shuffle time-series data. Split chronologically: train on the past, test on the future. Shuffling leaks future information into training and produces fraudulently optimistic metrics.

Use GRU as your default, not vanilla RNN. The vanilla RNN has no gating mechanism and will fail on any sequence longer than ~20 steps due to vanishing gradients. Start with GRU; upgrade to LSTM if the task requires modelling very long dependencies.

Scale your inputs. RNNs are sensitive to input magnitude. Always normalise time-series features with MinMaxScaler or StandardScaler fitted only on the training set. Apply the same transform to test data without refitting.

Use batch_first=True in PyTorch. PyTorch's default RNN input shape is (seq, batch, features), which is counterintuitive. Always pass batch_first=True to get the natural (batch, seq, features) shape.

Detach hidden states between batches. When processing sequences in batches (e.g. language modelling), always do h = h.detach() before the next batch. Without this, gradients will flow through the entire history, causing memory explosion.

Prefer bidirectional LSTM for understanding tasks, unidirectional for generation. A bidirectional model sees the full sequence before producing outputs — great for classification or translation encoding. A generative model (text, music) must be unidirectional: it cannot look into the future it hasn't produced yet.