LSTM Networks: Cell State, Gates

Section 01

What Are LSTM Networks?

Long Short-Term Memory networks — usually just called LSTMs — are a special kind of Recurrent Neural Network (RNN), capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997) and were refined and popularized by many researchers in the work that followed. They work tremendously well on a large variety of problems and are now widely used across industry and research.

📖 The Core Promise

Remembering Is Their Default — Not a Struggle

LSTMs are explicitly designed to avoid the long-term dependency problem. Consider reading a novel: when you reach the climax on page 300, you still remember the key character trait planted on page 5. Standard RNNs cannot do this — after a handful of steps, earlier context dissolves. LSTMs solve this by maintaining a dedicated memory highway that carries information across hundreds of time steps without degradation.

Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!

🔗

The Chain of Repeating Modules

All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module has a very simple structure — such as a single tanh layer. LSTMs also have this chain-like structure, but the repeating module has a different, much more sophisticated internal design with four interacting components instead of one.

⛓️ RNN Chain Structure — Standard vs LSTM (click to pause)

🖱️ Click the diagram to pause. The glowing middle cell (t) is the currently active one. The blue horizontal line is the cell state — the long-term memory highway running unchanged through the chain.

Section 02

Why Do We Need LSTMs?

Traditional RNNs often struggle to retain information over long sequences due to issues like vanishing and exploding gradients.

📖 Concrete Example

Language Modelling — When Context Disappears

Consider language modelling, where predicting the next word may depend on context established much earlier in a sentence or paragraph:

"The clouds in the sky are very ____."

To predict the likely completion ("dark", "gray", "bright"), you need to remember the words clouds and sky from earlier in the sentence. Standard RNNs can lose this context, making incorrect predictions. An LSTM easily remembers these earlier words because its cell state acts as a selective memory — keeping "clouds" and "sky" alive until they are needed.

📉 The Vanishing Gradient — Why RNN Memory Fades

📉

Vanishing Gradient

In RNNs

Each step multiplies by weights < 1. After 7–10 steps the gradient signal is effectively zero. The network cannot learn long-range dependencies — it forgets anything beyond a short window.

💥

Exploding Gradient

Also in RNNs

If weights > 1, gradient grows exponentially. Training becomes wildly unstable — weights jump erratically. Gradient clipping is a band-aid, not a cure.

🛤️

LSTM Solution

Cell State Highway

The cell state uses addition rather than multiplication to update. Gradients flow back through additions unchanged — no vanishing, no exploding. This is the "constant error carousel."

Section 03

How Do LSTMs Work? — The Cell State Conveyor Belt

The crucial innovation in LSTMs is the cell state. Think of it as a conveyor belt that carries information through the network. Information can flow along this conveyor belt almost unchanged, allowing long-term information retention.

The key to LSTMs is the cell state — the horizontal line running through the top of the diagram. The cell state runs straight down the entire chain, with only some minor linear interactions. It's very easy for information to just flow along it unchanged.

🏭 The Cell State — Conveyor Belt Animation (click to pause)

The packages on the conveyor belt represent pieces of information ("clouds", "sky", "plural") being carried unchanged across time steps. Gates decide what to load, remove, or read.

Section 04

The Complete LSTM Cell — Step-by-Step Interactive

An LSTM cell consists of three gates and one candidate state computation. The interactive diagram below lets you click each gate to see what it does — each step lights up in sequence. Use the buttons to walk through each stage.

🔬 Full LSTM Cell — Click the Steps to Animate Each Gate

LSTM Overview: The LSTM cell takes inputs xₜ (current) and hₜ₋₁ (previous hidden state), plus the previous cell state Cₜ₋₁. Click a step above to see each gate highlighted and explained.

Section 05

Gate 1 — The Forget Gate

🔴

Forget Gate — Purpose & Formula

Purpose: Decides what information from the previous cell state should be discarded.
Formula: fₜ = σ(Wf · [hₜ₋₁, xₜ] + bf)
Activation: Sigmoid σ — outputs values between 0 and 1
If fₜ = 0 → information is completely forgotten
If fₜ = 1 → information is fully retained

🔴 Forget Gate Animation — What Gets Erased (click to pause)

📖 Example — "The clouds in the sky are very ____"

Input

xₜ = "are" (current word); hₜ₋₁ = hidden state from "sky"

fₜ ≈ 1.0

"clouds" and "sky" → keep in cell state — they are still relevant

fₜ ≈ 0.0

Unrelated context from previous sentences → erase

Result

Cₜ still contains "clouds", "sky" — ready to predict the next word

Section 06

Gate 2 — The Input Gate & Candidate Values

🟢

Input Gate — Purpose & Formula

Purpose: Determines what new information should be added to the cell state.
iₜ = σ(Wi · [hₜ₋₁, xₜ] + bi) — gate filter (0–1)
C̃ₜ = tanh(Wc · [hₜ₋₁, xₜ] + bc) — candidate values (-1 to 1)
The input gate iₜ decides how much of the candidate memory C̃ₜ should be written into the cell state. Together: iₜ ⊙ C̃ₜ

🟢 Input Gate — What Gets Written into Memory (click to pause)

Section 07

The Cell State Update — Cₜ = fₜ · Cₜ₋₁ + iₜ · C̃ₜ

The overall cell state update equation is the heart of the LSTM. This single equation combines both gates to produce the new memory state.

🔴 Forget Term

fₜ · Cₜ₋₁

Scale old memory by forget gate. Values near 0 are erased. Values near 1 survive unchanged. This is the "what to throw away" operation.

🟢 Input Term

iₜ · C̃ₜ

Scale new candidate values by input gate. Only selected new information gets added. The input gate acts as a volume knob on new data.

🟡 Combined Update

Cₜ = fₜ · Cₜ₋₁ + iₜ · C̃ₜ

Element-wise: erase old irrelevant parts, add relevant new parts. The addition (not multiplication) here is what protects gradients from vanishing.

🔑 Why Addition Matters

∂Cₜ / ∂Cₜ₋₁ = fₜ

The gradient of the cell state w.r.t. the previous cell state equals fₜ — a learned, bounded value that doesn't vanish through sequential multiplication like vanilla RNN.

💡

This ensures relevant long-term information is retained

The update equation ensures that relevant long-term information is retained while discarding unnecessary details. The network learns — through backpropagation — exactly what to forget and what to write at every step. Neither the forget gate nor the input gate are manually programmed: they are optimised by gradient descent to minimize the task's loss function.

Section 08

Gate 3 — The Output Gate

🟣

Output Gate — Purpose & Formula

Purpose: Determines what part of the cell state should be output as the hidden state.
oₜ = σ(Wo · [hₜ₋₁, xₜ] + bo)
hₜ = oₜ · tanh(Cₜ)
The output gate oₜ scales the output based on the current cell state. The cell state is squashed through tanh then filtered — the model may "know" something long-term but choose not to express it until the right moment.

🟣 Output Gate Animation — What the Cell Exposes (click to pause)

Section 09

Complete Equation Summary

🔴 Forget Gate

fₜ = σ(Wf · [hₜ₋₁, xₜ] + bf)

What to erase from old cell state. Sigmoid output ∈ [0, 1]. Learned weights Wf.

🟢 Input Gate

iₜ = σ(Wi · [hₜ₋₁, xₜ] + bi)

What new values to let through. Sigmoid output ∈ [0, 1]. Learned weights Wi.

🟡 Candidate Values

C̃ₜ = tanh(Wc · [hₜ₋₁, xₜ] + bc)

Proposed new cell state values. tanh output ∈ [-1, 1]. Learned weights Wc.

⚡ Cell Update

Cₜ = fₜ · Cₜ₋₁ + iₜ · C̃ₜ

New cell state = erase old + write new. The gradient highway. Addition prevents vanishing.

🟣 Output Gate

oₜ = σ(Wo · [hₜ₋₁, xₜ] + bo)

What to expose from cell state. Sigmoid output ∈ [0, 1]. Learned weights Wo.

⚪ Hidden State

hₜ = oₜ · tanh(Cₜ)

Working memory output → next layer or prediction. Squashed and filtered version of cell state.

Section 10

Worked Example — "The Clouds in the Sky Are Very ____"

Let's trace an LSTM through the sentence word-by-word to see exactly what each gate does.

📝 Step-by-Step LSTM Trace — Weather Sentence Completion

xₜ = "The"

Forget gate ≈ 1.0 (nothing relevant to erase yet). Input gate writes "start of sentence" signal to cell state. Cell state initialised.

xₜ = "clouds"

Forget gate stays high. Input gate writes weather_subject = clouds to cell state with high strength (iₜ ≈ 0.9). C̃ₜ encodes "clouds → weather context".

xₜ = "in the"

Filler words — input gate nearly closes (iₜ ≈ 0.05). Cell state barely changes. "clouds" memory preserved unchanged.

xₜ = "sky"

Input gate opens again — adds context = outdoor/weather to cell state. Forget gate keeps "clouds" memory. Cell state now encodes {clouds, sky, weather}.

xₜ = "are very"

Intensifier words. Input gate writes "intensifier follows" signal. Output gate stays low — nothing to expose yet. Cell state carries all context forward.

xₜ = "____"

Output gate opens fully (oₜ ≈ 1.0). Exposes {clouds, sky, weather, intensifier} from cell state via tanh. hₜ feeds into softmax. Top predictions: "dark", "gray", "bright".

🎯

Why This Works — The Cell State Carried Everything

A vanilla RNN would have lost "clouds" by the time it reached the blank — too many multiplications degrade the signal. The LSTM's cell state acted as a protected conveyor belt: "clouds" and "sky" rode through five time steps essentially unchanged because the forget gate stayed near 1.0 and the input gate stayed near 0 for irrelevant words. No gradient vanishing. No lost context.

Section 11

Real-World Applications of LSTMs

📚

Language Modelling & Translation

NLP

Capturing context over long sentences. Encoder LSTM reads the source language; decoder LSTM generates the target. The cell state bridges meaning across sentences. Foundational to early Google Translate.

🗣️

Speech Recognition

Audio Sequences

Understanding sequences of phonemes and predicting subsequent words. BiLSTMs process audio features (MFCCs) both forward and backward. Used in low-latency voice assistants running on embedded hardware.

📈

Time Series Prediction

Sequential Data

Stock prices, weather forecasts, and other sequential datasets. LSTMs learn temporal patterns: seasonal cycles, trend changes, and anomalies. Ideal for data where the recent past predicts the near future.

Section 12

Python Implementation — PyTorch LSTM in Practice

Below is a clean, fully annotated PyTorch LSTM that mirrors the exact architecture we've studied — forget, input, candidate, and output gates under the hood.

import torch
import torch.nn as nn

class LSTMCell_Manual(nn.Module):
    """
    Manual LSTM cell — implements all 4 equations explicitly.
    Matches the PPT diagram step by step.
    """
    def __init__(self, input_size, hidden_size):
        super().__init__()
        n = input_size + hidden_size

        # Four weight matrices (one per gate/candidate)
        self.Wf = nn.Linear(n, hidden_size)  # Forget gate   → fₜ
        self.Wi = nn.Linear(n, hidden_size)  # Input gate    → iₜ
        self.Wc = nn.Linear(n, hidden_size)  # Candidate     → C̃ₜ
        self.Wo = nn.Linear(n, hidden_size)  # Output gate   → oₜ

    def forward(self, x, h_prev, c_prev):
        # Concatenate previous hidden state + current input
        combined = torch.cat([h_prev, x], dim=1)  # [hₜ₋₁, xₜ]

        # ── Step 1: Forget Gate ──────────────────────────────
        f_t = torch.sigmoid(self.Wf(combined))
        # f_t ∈ [0,1] — how much of old cell state to keep

        # ── Step 2: Input Gate + Candidate ───────────────────
        i_t   = torch.sigmoid(self.Wi(combined))
        c_tld = torch.tanh   (self.Wc(combined))
        # i_t   ∈ [0,1]  — how much new info to write
        # c_tld ∈ [-1,1] — candidate values to potentially write

        # ── Step 3: Cell State Update ────────────────────────
        c_t = f_t * c_prev + i_t * c_tld
        # Cₜ = fₜ ⊙ Cₜ₋₁  +  iₜ ⊙ C̃ₜ
        # ADDITION = gradient highway (no vanishing!)

        # ── Step 4: Output Gate ──────────────────────────────
        o_t = torch.sigmoid(self.Wo(combined))
        h_t = o_t * torch.tanh(c_t)
        # hₜ = oₜ ⊙ tanh(Cₜ)

        return h_t, c_t  # new hidden state, new cell state


# ── Using PyTorch's built-in (optimised CUDA version) ─────
model = nn.LSTM(
    input_size  = 10,    # xₜ dimension
    hidden_size = 128,   # hₜ and Cₜ dimension
    num_layers  = 2,     # stacked LSTMs
    dropout     = 0.2,   # between layers only
    batch_first = True,  # input: (batch, seq_len, features)
    bidirectional = False
)

# Forward pass
x   = torch.randn(4, 20, 10)    # batch=4, seq=20, features=10
h0  = torch.zeros(2, 4, 128)    # (num_layers, batch, hidden)
c0  = torch.zeros(2, 4, 128)

output, (hn, cn) = model(x, (h0, c0))
print(f"Output shape: {output.shape}")     # (4, 20, 128)
print(f"Final hidden: {hn.shape}")          # (2, 4, 128)
print(f"Final cell:   {cn.shape}")          # (2, 4, 128)

OUTPUT

Output shape: torch.Size([4, 20, 128]) Final hidden: torch.Size([2, 4, 128]) Final cell: torch.Size([2, 4, 128])

🔎

Manual vs Built-in — Same Math, Different Speed

The LSTMCell_Manual class above implements exactly the same four equations as PyTorch's nn.LSTM. The built-in version fuses all four gate computations into a single optimised matrix multiply (the four weight matrices are concatenated into one large matrix), making it significantly faster on GPU. Use nn.LSTM in production, and LSTMCell_Manual for learning and debugging.

Section 13

Key Takeaways — What You Now Know

🧠 LSTM — The Five Things to Remember

The cell state is the long-term memory. It runs as a horizontal highway through the entire sequence — not passing through a dense layer at every step. This is what makes LSTMs fundamentally different from vanilla RNNs.

Three gates, four equations. Forget (erase), Input + Candidate (write), Output (read). All four are learned simultaneously via backpropagation through time. No manual programming of what to remember — the network figures it out from data.

Addition kills vanishing gradients. The cell state update Cₜ = fₜ · Cₜ₋₁ + iₜ · C̃ₜ uses addition. Gradients flowing back through additions don't shrink exponentially — they survive across hundreds of time steps.

Cell state ≠ hidden state. The cell state (Cₜ) is the private long-term storage. The hidden state (hₜ) is the public working memory — a filtered, squashed version of the cell state that the next layer or prediction head reads. They are decoupled by design.

All weights are shared across time. The same Wf, Wi, Wc, Wo are applied at every time step. This gives LSTMs their generalisability — they learn patterns, not positions — and keeps parameter count independent of sequence length.