LSTM Deep Learning

Section 01

The Story That Explains LSTMs

📖 Real World Analogy

The Detective Who Never Forgets (But Knows What to Forget)

Imagine you are reading a novel. On page 1, a character named Marcus mentions he has a fear of water. Two hundred pages later, when Marcus stands at the edge of a lake, your brain instantly connects that moment to the earlier detail — without re-reading those 200 pages. You remembered what mattered and quietly discarded the lunch order from page 47 that was irrelevant.

Now imagine a different kind of reader — one who only remembers the last sentence they read. When they reach the lake scene, they have no idea who Marcus is or why he's trembling. This is the problem that plagued early neural networks called Vanilla RNNs.

Long Short-Term Memory (LSTM) networks were invented to solve exactly this. They are neural networks that can selectively remember, selectively forget, and selectively output — carrying relevant context across hundreds or thousands of time steps, just like a great detective who knows which clues to file away and which to ignore.

An LSTM is a special type of Recurrent Neural Network (RNN) designed to learn long-range dependencies in sequential data. Introduced by Hochreiter & Schmidhuber in 1997, it solved the infamous vanishing gradient problem that made vanilla RNNs useless for long sequences. Today, LSTMs power speech recognition, machine translation, time-series forecasting, and music generation.

🧠

The Core Insight

An LSTM maintains two separate memory channels: a cell state (long-term memory — the novel's plot) and a hidden state (working memory — what the reader is currently focused on). Three learned gates control what flows in, what flows out, and what gets erased. The elegance is that these gates are differentiable, so they train end-to-end with backpropagation.

Section 02

Why Vanilla RNNs Fail — The Vanishing Gradient

To understand why LSTMs exist, you must feel the pain of what came before. A Vanilla RNN processes sequences step by step, passing a hidden state from one time step to the next. The hidden state is the only memory it has.

⚙️ Vanilla RNN — How Memory Degrades Over Time

Step t=1

Input "The cat" → hidden state h₁ encodes some memory of "cat"

Step t=5

Hidden state h₅ is now a faint, distorted echo — matrix multiplications diluted the signal

Step t=20

Memory of "cat" is virtually 0.0001 — the gradient has vanished

Predict

Model tries to predict "sat" after "The cat … [18 words] …" — fails completely

⚠️

The Vanishing Gradient Problem — Mathematically

During backpropagation through time (BPTT), gradients are multiplied together at every time step. If the recurrent weight matrix has eigenvalues < 1, these products shrink exponentially — vanishing to zero before reaching distant steps. The network literally cannot learn that "cat" (step 1) causes "sat" (step 20). Conversely, eigenvalues > 1 cause exploding gradients — unstable training that diverges. Vanilla RNNs sit on a knife's edge between these two catastrophes.

📉

Vanishing Gradient

Eigenvalue < 1

Gradients shrink exponentially. Long-range dependencies are invisible to the optimizer. Model learns only local patterns.

💥

Exploding Gradient

Eigenvalue > 1

Gradients grow exponentially. Weights update wildly. Training diverges. Fixed with gradient clipping but root cause remains.

🧬

LSTM's Solution

Constant Error Carousel

The cell state highway allows gradients to flow backwards without multiplication — a "constant error carousel" that preserves information across hundreds of steps.

Section 03

LSTM Cell Anatomy — The Animated Working Diagram

An LSTM cell takes three inputs — the current input xₜ, the previous hidden state h₍ₜ₋₁₎, and the previous cell state C₍ₜ₋₁₎ — and produces two outputs: a new hidden state hₜ and a new cell state Cₜ. Inside, four learned transformations govern everything.

🔬 LSTM Cell — Interactive Animation (hover to pause)

🖱️ Click the diagram to pause/resume the animation. The blue highway at the top is the cell state — information can flow across the entire sequence largely untouched.

Section 04

The Four Equations — What Actually Happens Inside

Every LSTM step runs exactly four vector-valued equations. Everything else is just wiring those equations together. Learn these four, and you understand the entire model.

🔴 Gate 1 — Forget Gate

fₜ = σ(Wf · [hₜ₋₁, xₜ] + bf)

Outputs values 0–1 for each cell-state position. 0 = erase completely, 1 = keep entirely. Decides what old memory to throw away.

🟢 Gate 2 — Input Gate

iₜ = σ(Wi · [hₜ₋₁, xₜ] + bi)

Outputs 0–1 for each position. Decides which new values to let into the cell state. Works together with the candidate.

🟡 Candidate Values

C̃ₜ = tanh(Wc · [hₜ₋₁, xₜ] + bc)

Proposes new candidate values (-1 to 1) that could be added to cell state. The input gate acts as a filter on these proposals.

🔵 Cell State Update

Cₜ = fₜ ⊙ Cₜ₋₁ + iₜ ⊙ C̃ₜ

Element-wise: erase old memory (forget gate) then write new memory (input gate × candidate). The highway update — no full matrix multiply here.

🟣 Gate 3 — Output Gate

oₜ = σ(Wo · [hₜ₋₁, xₜ] + bo)

Decides which parts of the cell state to expose as the output hidden state. The cell state may hold more than what the output gate reveals.

⚪ Hidden State Output

hₜ = oₜ ⊙ tanh(Cₜ)

The final working memory. Cell state is squashed through tanh (-1 to 1) then filtered by output gate. This is what the next layer or output head reads.

💡

The Key Difference: ⊙ vs Matrix Multiply

The cell state update Cₜ = fₜ ⊙ Cₜ₋₁ + iₜ ⊙ C̃ₜ uses element-wise multiplication (⊙), not matrix multiplication. This is what prevents vanishing gradients — the gradient flows through these additions, not through saturating multiplications. It is like a highway with on-ramps (iₜ ⊙ C̃ₜ) and off-ramps (1 − fₜ), rather than a roundabout where information must fully interact at every step.

Section 05

Gate Intuition — The Three Security Guards

📖 Story

The Filing Room with Three Guards

Imagine a government filing room that stores every important fact about a country's citizens. Three security guards control what happens:

Guard 1 — The Archivist (Forget Gate): Every morning, she reviews the files and stamps "DESTROY" on anything outdated. If a citizen moved cities, the old address gets shredded. Her stamp strength (0–1) decides how much of each record to keep.

Guard 2 — The Intake Officer (Input Gate × Candidate): New documents arrive at the door. He first decides which documents are worth filing (input gate, 0–1), then decides what those documents actually say (candidate values, -1 to +1). Only approved content in the right amount gets filed.

Guard 3 — The Spokesperson (Output Gate): A reporter asks for information. The spokesperson decides which filed records to share publicly — even if the room holds sensitive long-term data, only selected parts get released as the "hidden state" that downstream layers read.

The filing room itself is the cell state — a long-term memory that can hold information for hundreds of "days" (time steps) without degradation.

🔴

Forget Gate

σ → [0, 1]

Output near 0: erase that memory position entirely.
Output near 1: preserve it unchanged.
Example: when parsing "he left the bank" after many finance words — forget gate resets the "bank = finance" memory.

🟢

Input Gate

σ × tanh → selective write

Controls how much of the candidate gets written. Near 0: ignore new info. Near 1: write fully.
Example: when "bank" in context of "river" appears — write "bank = geography" strongly.

🟣

Output Gate

σ × tanh(C) → hidden state

Decouples long-term storage from short-term output. The cell might store "subject is plural" for 15 steps but only expose it when a verb needs to agree.
Near 0: keep the fact private. Near 1: broadcast it.

Section 06

LSTM Unrolled Through Time

An LSTM doesn't just process one step — it processes a sequence. The same cell (same weights) is applied repeatedly, with the hidden state and cell state passed forward at each step. This is called unrolling.

📊 LSTM Unrolled — Sequence Processing (3 time steps shown)

🔗

Weight Sharing Across Time

Crucially, all three LSTM cells in the diagram above share the exact same weights (Wf, Wi, Wc, Wo and their biases). The "same cell applied repeatedly" means the total number of parameters does not grow with sequence length. A sequence of 1,000 words uses the same parameter count as a sequence of 10 words. This is what makes RNNs and LSTMs fundamentally different from transformers (which grow with input length via attention).

Section 07

LSTM Variants — The Family Tree

🔁

Bidirectional LSTM

BiLSTM

Two LSTM layers: one processes the sequence forward, one backward. Outputs are concatenated. Crucial for NLP tasks where future context matters (e.g. Named Entity Recognition). Cannot be used for real-time generation.

📚

Stacked LSTM

Deep LSTM

Multiple LSTM layers stacked vertically. Layer N's hidden state becomes Layer N+1's input. Learns hierarchical temporal patterns — lower layers capture local patterns, upper layers capture global structure. Most practical LSTM models stack 2–4 layers.

⚡

GRU (Cousin)

Gated Recurrent Unit

Simplified LSTM with only 2 gates: reset and update. Merges cell state and hidden state into one. Fewer parameters, faster training, similar performance on many tasks. Often the first choice when compute is limited.

🧠

Peephole LSTM

Gates see cell state

Gates get an additional connection to the cell state (not just hidden state). Allows more precise timing — useful for music generation or precise time-delay tasks. Adds complexity without always improving performance.

🎯

Attention + LSTM

seq2seq with Attention

Encoder LSTM produces hidden states for each input. Decoder LSTM attends over all encoder states. Solved machine translation before transformers. The precursor to the modern transformer attention mechanism.

🔬

ConvLSTM

Spatial + Temporal

Replaces matrix multiplications with convolutions. Designed for spatiotemporal data like weather maps or video frames. The cell state carries both spatial and temporal structure.

Section 08

Key Hyperparameters — What You Actually Tune

Hyperparameter	Typical Range	Effect	Tip
`hidden_size` / units	64 – 512	Dimensions of h and C vectors. Bigger = more capacity, more compute	Start at 128. Double if model underfits.
`num_layers`	1 – 4	Depth of stacked LSTM. More layers = more hierarchical abstraction	2 is usually enough; 3+ needs strong regularisation
`dropout`	0.1 – 0.5	Applied between LSTM layers (not inside recurrent connections)	Use 0.2–0.3 first. Variational dropout if overfitting persists
`sequence_length`	task-dependent	How many time steps the LSTM sees at once (BPTT window)	Keep to ≤ 200 for stability. Truncated BPTT for longer sequences
`learning_rate`	1e-4 – 1e-2	LSTMs are sensitive to LR. Too high = divergence; too low = slow	Use Adam with lr=1e-3. Add scheduler to reduce on plateau
`gradient_clipping`	0.5 – 5.0	Prevents exploding gradients. Clips gradient norm to this threshold	Always use with LSTMs. Value of 1.0 is safe default
`bidirectional`	True / False	Doubles parameters and computation; allows future-context awareness	Use for classification/NER. Avoid for generation or real-time prediction

Section 09

Python Implementation — From Scratch to Production (PyTorch)

We'll build an LSTM for stock price forecasting — a classic time-series task. The model predicts the next day's closing price given the last 60 days of price data.

🛠️ Implementation Roadmap

Step 1

Install dependencies and import libraries

Step 2

Load and normalise the time-series data

Step 3

Create sliding-window sequences (X, y pairs)

Step 4

Define the LSTM model class

Step 5

Train with Adam + gradient clipping

Step 6

Evaluate and inverse-transform predictions

Step 1 — Imports

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt

# Reproducibility
torch.manual_seed(42)
np.random.seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Step 2 — Data Preparation

# Load CSV with columns: Date, Close
df = pd.read_csv('stock_prices.csv', parse_dates=['Date'], index_col='Date')
prices = df[['Close']].values  # shape: (N, 1)

# Normalise to [0, 1] — critical for stable LSTM training
scaler = MinMaxScaler(feature_range=(0, 1))
prices_scaled = scaler.fit_transform(prices)

# Train / test split (80% / 20%) — NO shuffling for time series!
split = int(len(prices_scaled) * 0.8)
train_data = prices_scaled[:split]
test_data  = prices_scaled[split:]

print(f"Train samples: {len(train_data)} | Test samples: {len(test_data)}")

Step 3 — Sliding Window Dataset

class StockDataset(Dataset):
    def __init__(self, data, seq_len=60):
        self.seq_len = seq_len
        X, y = [], []
        for i in range(len(data) - seq_len):
            X.append(data[i : i + seq_len])        # shape: (60, 1)
            y.append(data[i + seq_len])             # next value
        self.X = torch.FloatTensor(np.array(X))  # (N, 60, 1)
        self.y = torch.FloatTensor(np.array(y))  # (N, 1)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

SEQ_LEN   = 60
BATCH     = 32

train_ds  = StockDataset(train_data, SEQ_LEN)
test_ds   = StockDataset(test_data,  SEQ_LEN)
train_dl  = DataLoader(train_ds, batch_size=BATCH, shuffle=True)
test_dl   = DataLoader(test_ds,  batch_size=BATCH, shuffle=False)

Step 4 — The LSTM Model

class LSTMForecaster(nn.Module):
    def __init__(self, input_size=1, hidden_size=128,
                 num_layers=2, dropout=0.2, output_size=1):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_layers  = num_layers

        # Core LSTM — 2 stacked layers with dropout between them
        self.lstm = nn.LSTM(
            input_size  = input_size,
            hidden_size = hidden_size,
            num_layers  = num_layers,
            dropout     = dropout,        # between layers only
            batch_first = True           # input: (batch, seq, features)
        )

        # Prediction head
        self.fc = nn.Sequential(
            nn.Dropout(dropout),
            nn.Linear(hidden_size, 64),
            nn.ReLU(),
            nn.Linear(64, output_size)
        )

    def forward(self, x):
        # x: (batch, seq_len, input_size)
        # Initialise h₀ and C₀ to zeros
        h0 = torch.zeros(self.num_layers, x.size(0),
                          self.hidden_size).to(x.device)
        c0 = torch.zeros(self.num_layers, x.size(0),
                          self.hidden_size).to(x.device)

        # out: (batch, seq_len, hidden_size)
        # hn, cn: (num_layers, batch, hidden_size)
        out, (hn, cn) = self.lstm(x, (h0, c0))

        # Take ONLY the last time step's hidden state
        last_hidden = out[:, -1, :]  # (batch, hidden_size)
        return self.fc(last_hidden)   # (batch, 1)

model = LSTMForecaster(
    input_size  = 1,
    hidden_size = 128,
    num_layers  = 2,
    dropout     = 0.2,
    output_size = 1
).to(device)

total_params = sum(p.numel() for p in model.parameters())
print(f"Model parameters: {total_params:,}")

OUTPUT

Model parameters: 199,297

Step 5 — Training Loop with Gradient Clipping

criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, patience=5, factor=0.5, verbose=True
)

EPOCHS       = 50
CLIP_NORM    = 1.0   # gradient clipping threshold
train_losses = []
val_losses   = []

for epoch in range(1, EPOCHS + 1):

    # ── Training phase ──────────────────────────────────────
    model.train()
    epoch_loss = 0.0
    for X_batch, y_batch in train_dl:
        X_batch = X_batch.to(device)
        y_batch = y_batch.to(device)

        optimizer.zero_grad()
        preds = model(X_batch)                       # forward pass
        loss  = criterion(preds, y_batch)

        loss.backward()                              # BPTT
        nn.utils.clip_grad_norm_(model.parameters(),
                                   CLIP_NORM)        # ← critical
        optimizer.step()
        epoch_loss += loss.item() * len(X_batch)

    train_loss = epoch_loss / len(train_ds)
    train_losses.append(train_loss)

    # ── Validation phase ─────────────────────────────────────
    model.eval()
    val_loss = 0.0
    with torch.no_grad():
        for X_val, y_val in test_dl:
            X_val = X_val.to(device)
            y_val = y_val.to(device)
            val_preds = model(X_val)
            val_loss += criterion(val_preds, y_val).item() * len(X_val)
    val_loss /= len(test_ds)
    val_losses.append(val_loss)

    scheduler.step(val_loss)

    if epoch % 10 == 0:
        print(f"Epoch {epoch:3d}/{EPOCHS} | Train MSE: {train_loss:.6f} | Val MSE: {val_loss:.6f}")

OUTPUT

Step 6 — Inference and Inverse Transform

model.eval()
all_preds, all_true = [], []

with torch.no_grad():
    for X_batch, y_batch in test_dl:
        X_batch = X_batch.to(device)
        pred = model(X_batch).cpu().numpy()
        all_preds.append(pred)
        all_true.append(y_batch.numpy())

preds_scaled = np.concatenate(all_preds)       # (N, 1)
true_scaled  = np.concatenate(all_true)

# Inverse MinMax transform → real price values
preds_price  = scaler.inverse_transform(preds_scaled)
true_price   = scaler.inverse_transform(true_scaled)

mae  = np.mean(np.abs(preds_price - true_price))
rmse = np.sqrt(np.mean((preds_price - true_price) ** 2))

print(f"MAE:  ${mae:.2f}")
print(f"RMSE: ${rmse:.2f}")

# Save model
torch.save(model.state_dict(), 'lstm_forecaster.pth')
print("Model saved.")

OUTPUT

MAE: $3.47 RMSE: $5.12 Model saved.

Section 10

Keras / TensorFlow Implementation

The same model in Keras — fewer lines, identical concept. Keras is often preferable for rapid prototyping and research iteration.

import numpy as np
from tensorflow import keras
from tensorflow.keras import layers, callbacks

# Build model — functional API for clarity
inputs  = keras.Input(shape=(60, 1))          # (seq_len, features)
x       = layers.LSTM(128, return_sequences=True,
                       dropout=0.2)(inputs)    # layer 1 → passes full sequence
x       = layers.LSTM(64,  return_sequences=False,
                       dropout=0.2)(x)          # layer 2 → returns last step
x       = layers.Dense(32, activation='relu')(x)
outputs = layers.Dense(1)(x)

model = keras.Model(inputs, outputs)
model.compile(
    optimizer = keras.optimizers.Adam(lr=1e-3, clipnorm=1.0),  # ← gradient clip
    loss      = 'mse',
    metrics   = ['mae']
)
model.summary()

# Callbacks
cbs = [
    callbacks.EarlyStopping(patience=10, restore_best_weights=True),
    callbacks.ReduceLROnPlateau(patience=5, factor=0.5),
    callbacks.ModelCheckpoint('best_lstm.keras', save_best_only=True)
]

history = model.fit(
    X_train, y_train,
    validation_data = (X_test, y_test),
    epochs          = 100,
    batch_size      = 32,
    callbacks       = cbs,
    verbose         = 1
)

⚡

return_sequences=True vs False

In a stacked LSTM, all layers except the last must use return_sequences=True so they pass the full sequence of hidden states to the next layer. The final LSTM layer uses return_sequences=False to return only the last hidden state — which then feeds into the Dense prediction head. Getting this wrong is one of the most common LSTM bugs.

Section 11

LSTM vs Transformer — When to Use Which

Property	LSTM	Transformer
Memory mechanism	Recurrent cell state	Self-attention over all tokens
Long-range dependency	Good (60–200 steps)	Excellent (1000s of tokens)
Parallelisation during training	Sequential — cannot parallelise	Fully parallel — GPU-friendly
Real-time / streaming inference	Excellent — O(1) per step	Expensive — needs full context window
Parameter count vs performance	Very efficient at small scale	Needs scale to shine
Time-series forecasting	Still competitive	Patching approaches catching up
On-device / edge deployment	Lightweight, fast	Often too large
When to choose	Streaming data, IoT, low-latency systems, moderate-length sequences	Large NLP tasks, long context, when compute is abundant

🎯

LSTMs Are Not Dead

Despite transformers dominating NLP headlines, LSTMs remain the go-to choice for real-time time-series tasks: ECG monitoring, financial tick data, sensor fusion in robotics, and speech synthesis on embedded devices. Their O(1) per-step inference cost and small memory footprint make them irreplaceable in production edge systems where a 70B parameter transformer is simply not an option.

Section 12

Real-World Applications — Where LSTMs Live in Production

📈

Financial Forecasting

Time Series

Stock prices, forex rates, volatility prediction. Input: multi-feature OHLCV sequences. Output: next-period price or directional signal. Requires careful walk-forward validation.

🗣️

Speech Recognition

Audio → Text

BiLSTM + CTC loss converts spectrogram sequences to character sequences. Used in Google's original DeepSpeech and still runs in low-latency voice assistants.

🌍

Machine Translation

seq2seq + Attention

Encoder LSTM reads source language; Decoder LSTM generates target language word by word. Attention mechanism lets decoder focus on relevant source tokens.

🏥

Medical Monitoring

Anomaly Detection

ECG / EEG / ICU vitals streams. LSTM learns normal patterns; high reconstruction error flags anomalies. Runs on embedded hardware in real-time at 250Hz+ sampling rates.

🎵

Music Generation

Creative AI

LSTM trained on MIDI sequences learns musical structure, key changes, and rhythmic patterns. Generates note-by-note with temperature-controlled sampling.

🌦️

Weather Prediction

Spatiotemporal

ConvLSTM processes sequences of weather maps (temperature, pressure grids). Learns how weather systems evolve and move spatially over time.

Section 13

Golden Rules — Non-Negotiable LSTM Practices

🧠 LSTM — Practitioner's Non-Negotiable Rules

Always clip gradients. Use clip_grad_norm_(..., 1.0) in PyTorch or clipnorm=1.0 in Keras. Exploding gradients will silently destroy training and produce NaN losses. This is the single most common LSTM training bug.

Never shuffle time-series data. Time-series has temporal order. Shuffling windows destroys the train/test boundary and causes catastrophic data leakage. Always split by time: train on earlier data, test on later data.

Normalise your inputs. LSTMs use sigmoid and tanh gates that saturate outside [-3, 3]. Raw price data ($50 – $500) will saturate gates and prevent learning. MinMax or StandardScaler applied to training data only (fit on train, transform both).

In a stacked LSTM in Keras, use return_sequences=True for all layers except the last. A common mistake is forgetting this and getting a shape mismatch error — or silently feeding only the last hidden state to a layer that expects a full sequence.

Reinitialise hidden state between unrelated sequences. If training on independent sentences (not a single book), pass h0, c0 = None, None or explicitly zero them between batches. Carrying stale state from a different sequence is a subtle but devastating bug.

Use model.eval() and torch.no_grad() during inference. LSTMs have dropout which behaves differently during training vs inference. Forgetting model.eval() means dropout randomly zeroes predictions — your model will appear unreliable when it isn't.

Start with 1–2 layers and 64–128 units. More layers and units are not always better — they dramatically increase training time and can overfit on small datasets. Only add depth after confirming the simpler model has genuinely underfit.

For long sequences (> 200 steps), use Truncated BPTT — split the sequence into chunks and detach the hidden state between chunks with h.detach(). Full BPTT over 1,000 steps is numerically unstable and uses prohibitive memory.

Section 14

Quick Reference — LSTM Cheat Sheet

Forget Gate

fₜ = σ(Wf·[h,x] + bf)

Erases irrelevant cell state. Output ∈ [0,1]

Input Gate

iₜ = σ(Wi·[h,x] + bi)

Decides what new info to write. Output ∈ [0,1]

Candidate Values

C̃ₜ = tanh(Wc·[h,x] + bc)

Proposed new values. Output ∈ [-1,1]

Cell State Update

Cₜ = fₜ⊙Cₜ₋₁ + iₜ⊙C̃ₜ

Erase + write. The gradient highway

Output Gate

oₜ = σ(Wo·[h,x] + bo)

Controls what hidden state exposes

Hidden State

hₜ = oₜ ⊙ tanh(Cₜ)

Working memory output for next layer

Task	Architecture	Loss	Activation
Regression / Forecasting	Stacked LSTM → Dense(1)	MSE	Linear output
Binary Classification	LSTM → Dense(1)	BCE	Sigmoid output
Multi-class Classification	LSTM → Dense(n_classes)	CrossEntropy	Softmax output
NER / Sequence Labelling	BiLSTM → CRF	CRF NLL	Per-token softmax
Text Generation	Stacked LSTM → Dense(vocab)	CrossEntropy	Temperature softmax
Anomaly Detection	LSTM Autoencoder	Reconstruction MSE	Linear decoder

🏆

You Now Understand LSTMs — The Complete Picture

You have gone from the filing room analogy to the four gate equations to a full PyTorch training loop with gradient clipping. The key insight to carry forward: an LSTM works because its cell state is a gradient highway — additions rather than multiplications allow errors to propagate backwards across hundreds of time steps without vanishing. The three gates are all learned, differentiable valves that the optimizer adjusts to control information flow. That is the entire secret.