Deep Learning πŸ“‚ Long Short-Term Memory (LSTM) Β· 1 of 2 62 min read

LSTM Networks From Cell State to Gates

A visual, story-driven tutorial on Long Short-Term Memory (LSTM) networks built from slide-by-slide PPT content. Covers why vanilla RNNs fail, how the cell state acts as a memory highway, and walks through all three gates β€” forget, input, and output β€” with animated SVG diagrams, a worked sentence example, and annotated PyTorch code.

Section 01

What Are LSTM Networks?

Long Short-Term Memory networks β€” usually just called LSTMs β€” are a special kind of Recurrent Neural Network (RNN), capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997) and were refined and popularized by many researchers in the work that followed. They work tremendously well on a large variety of problems and are now widely used across industry and research.

Remembering Is Their Default β€” Not a Struggle
LSTMs are explicitly designed to avoid the long-term dependency problem. Consider reading a novel: when you reach the climax on page 300, you still remember the key character trait planted on page 5. Standard RNNs cannot do this β€” after a handful of steps, earlier context dissolves. LSTMs solve this by maintaining a dedicated memory highway that carries information across hundreds of time steps without degradation.

Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!
πŸ”—
The Chain of Repeating Modules

All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module has a very simple structure β€” such as a single tanh layer. LSTMs also have this chain-like structure, but the repeating module has a different, much more sophisticated internal design with four interacting components instead of one.

⛓️ RNN Chain Structure β€” Standard vs LSTM (click to pause)
LSTM CHAIN β€” THREE UNROLLED CELLS SHARING THE SAME WEIGHTS Οƒ Οƒ tanh Οƒ Cβ‚œβ‚‹β‚ β†’ Cβ‚œ (cell state) A t-1 hβ‚œβ‚‹β‚‚ xβ‚œβ‚‹β‚ hβ‚œβ‚‹β‚ Οƒ Οƒ tanh Οƒ CELL STATE HIGHWAY A t ← ACTIVE xβ‚œ hβ‚œ ← OUTPUT Οƒ Οƒ tanh Οƒ Cβ‚œ β†’ Cβ‚œβ‚Šβ‚ (cell state) A t+1 xβ‚œβ‚Šβ‚ hβ‚œβ‚Šβ‚ Cell State C (long-term memory highway) Forget Gate Οƒ Input Gate Οƒ Candidate tanh Output Gate Οƒ

πŸ–±οΈ Click the diagram to pause. The glowing middle cell (t) is the currently active one. The blue horizontal line is the cell state β€” the long-term memory highway running unchanged through the chain.


Section 02

Why Do We Need LSTMs?

Traditional RNNs often struggle to retain information over long sequences due to issues like vanishing and exploding gradients.

Language Modelling β€” When Context Disappears
Consider language modelling, where predicting the next word may depend on context established much earlier in a sentence or paragraph:

"The clouds in the sky are very ____."

To predict the likely completion ("dark", "gray", "bright"), you need to remember the words clouds and sky from earlier in the sentence. Standard RNNs can lose this context, making incorrect predictions. An LSTM easily remembers these earlier words because its cell state acts as a selective memory β€” keeping "clouds" and "sky" alive until they are needed.
πŸ“‰ The Vanishing Gradient β€” Why RNN Memory Fades
TIME STEPS β†’ t=1 100% t=2 70% t=3 44% t=4 25% t=5 12% t=6 4% t=7 ~0% VANISHED Memory = 0 LSTM 100% preserved Vanilla RNN β€” Gradient Signal LSTM β€” Gradient Protected via Cell State
πŸ“‰
Vanishing Gradient
In RNNs
Each step multiplies by weights < 1. After 7–10 steps the gradient signal is effectively zero. The network cannot learn long-range dependencies β€” it forgets anything beyond a short window.
πŸ’₯
Exploding Gradient
Also in RNNs
If weights > 1, gradient grows exponentially. Training becomes wildly unstable β€” weights jump erratically. Gradient clipping is a band-aid, not a cure.
πŸ›€οΈ
LSTM Solution
Cell State Highway
The cell state uses addition rather than multiplication to update. Gradients flow back through additions unchanged β€” no vanishing, no exploding. This is the "constant error carousel."

Section 03

How Do LSTMs Work? β€” The Cell State Conveyor Belt

The crucial innovation in LSTMs is the cell state. Think of it as a conveyor belt that carries information through the network. Information can flow along this conveyor belt almost unchanged, allowing long-term information retention.

The key to LSTMs is the cell state β€” the horizontal line running through the top of the diagram. The cell state runs straight down the entire chain, with only some minor linear interactions. It's very easy for information to just flow along it unchanged.

🏭 The Cell State β€” Conveyor Belt Animation (click to pause)
clouds sky plural CELL STATE Cβ‚œ β€” LONG-TERM MEMORY HIGHWAY FORGET GATE Οƒ fβ‚œ ∈ [0,1] INPUT GATE Οƒ iβ‚œ ∈ [0,1] OUTPUT GATE Οƒ oβ‚œ ∈ [0,1] xβ‚œ input hβ‚œ hβ‚œβ‚‹β‚

The packages on the conveyor belt represent pieces of information ("clouds", "sky", "plural") being carried unchanged across time steps. Gates decide what to load, remove, or read.


Section 04

The Complete LSTM Cell β€” Step-by-Step Interactive

An LSTM cell consists of three gates and one candidate state computation. The interactive diagram below lets you click each gate to see what it does β€” each step lights up in sequence. Use the buttons to walk through each stage.

πŸ”¬ Full LSTM Cell β€” Click the Steps to Animate Each Gate
LSTM Overview: The LSTM cell takes inputs xβ‚œ (current) and hβ‚œβ‚‹β‚ (previous hidden state), plus the previous cell state Cβ‚œβ‚‹β‚. Click a step above to see each gate highlighted and explained.
CELL STATE Cβ‚œ β€” MEMORY HIGHWAY Cβ‚œβ‚‹β‚ Γ— + tanh Γ— Cβ‚œ FORGET GATE Οƒ Wf Β· [hβ‚œβ‚‹β‚, xβ‚œ] + bf fβ‚œ INPUT GATE Οƒ Wi Β· [hβ‚œβ‚‹β‚, xβ‚œ] + bi CANDIDATE VALUES tanh CΜƒβ‚œ Wc Β· [hβ‚œβ‚‹β‚, xβ‚œ] + bc Γ— OUTPUT GATE Οƒ Wo Β· [hβ‚œβ‚‹β‚, xβ‚œ] + bo hβ‚œβ‚‹β‚ xβ‚œ hβ‚œ Wf Wi Wc Wo

Section 05

Gate 1 β€” The Forget Gate

πŸ”΄
Forget Gate β€” Purpose & Formula

Purpose: Decides what information from the previous cell state should be discarded.
Formula: fβ‚œ = Οƒ(Wf Β· [hβ‚œβ‚‹β‚, xβ‚œ] + bf)
Activation: Sigmoid Οƒ β€” outputs values between 0 and 1
If fβ‚œ = 0 β†’ information is completely forgotten
If fβ‚œ = 1 β†’ information is fully retained

πŸ”΄ Forget Gate Animation β€” What Gets Erased (click to pause)
CELL STATE Cβ‚œβ‚‹β‚ ─────────────────────────────────────── Cβ‚œ subj=he loc=Paris tense=past Γ— ERASED! fβ‚œ FORGET GATE Οƒ(Wf Β· [hβ‚œβ‚‹β‚, xβ‚œ] + bf) output range: [0.0 … 1.0] hβ‚œβ‚‹β‚ xβ‚œ fβ‚œ = 0.02 β†’ ERASE fβ‚œ = 0.97 β†’ KEEP
πŸ“– Example β€” "The clouds in the sky are very ____"
Input
xβ‚œ = "are" (current word); hβ‚œβ‚‹β‚ = hidden state from "sky"
fβ‚œ β‰ˆ 1.0
"clouds" and "sky" β†’ keep in cell state β€” they are still relevant
fβ‚œ β‰ˆ 0.0
Unrelated context from previous sentences β†’ erase
Result
Cβ‚œ still contains "clouds", "sky" β€” ready to predict the next word

Section 06

Gate 2 β€” The Input Gate & Candidate Values

🟒
Input Gate β€” Purpose & Formula

Purpose: Determines what new information should be added to the cell state.
iβ‚œ = Οƒ(Wi Β· [hβ‚œβ‚‹β‚, xβ‚œ] + bi) β€” gate filter (0–1)
CΜƒβ‚œ = tanh(Wc Β· [hβ‚œβ‚‹β‚, xβ‚œ] + bc) β€” candidate values (-1 to 1)
The input gate iβ‚œ decides how much of the candidate memory CΜƒβ‚œ should be written into the cell state. Together: iβ‚œ βŠ™ CΜƒβ‚œ

🟒 Input Gate β€” What Gets Written into Memory (click to pause)
CELL STATE β€” AFTER FORGET β†’ READY FOR NEW INFORMATION + NEW DATA Γ— INPUT GATE Οƒ β†’ iβ‚œ Wi Β· [hβ‚œβ‚‹β‚, xβ‚œ] + bi range [0, 1] CANDIDATE CΜƒβ‚œ tanh β†’ CΜƒβ‚œ Wc Β· [hβ‚œβ‚‹β‚, xβ‚œ] + bc range [-1, 1] iβ‚œ CΜƒβ‚œ hβ‚œβ‚‹β‚ xβ‚œ iβ‚œ βŠ™ CΜƒβ‚œ new info to add

Section 07

The Cell State Update β€” Cβ‚œ = fβ‚œ Β· Cβ‚œβ‚‹β‚ + iβ‚œ Β· CΜƒβ‚œ

The overall cell state update equation is the heart of the LSTM. This single equation combines both gates to produce the new memory state.

πŸ”΄ Forget Term
fβ‚œ Β· Cβ‚œβ‚‹β‚
Scale old memory by forget gate. Values near 0 are erased. Values near 1 survive unchanged. This is the "what to throw away" operation.
🟒 Input Term
iβ‚œ Β· CΜƒβ‚œ
Scale new candidate values by input gate. Only selected new information gets added. The input gate acts as a volume knob on new data.
🟑 Combined Update
Cβ‚œ = fβ‚œ Β· Cβ‚œβ‚‹β‚ + iβ‚œ Β· CΜƒβ‚œ
Element-wise: erase old irrelevant parts, add relevant new parts. The addition (not multiplication) here is what protects gradients from vanishing.
πŸ”‘ Why Addition Matters
βˆ‚Cβ‚œ / βˆ‚Cβ‚œβ‚‹β‚ = fβ‚œ
The gradient of the cell state w.r.t. the previous cell state equals fβ‚œ β€” a learned, bounded value that doesn't vanish through sequential multiplication like vanilla RNN.
πŸ’‘
This ensures relevant long-term information is retained

The update equation ensures that relevant long-term information is retained while discarding unnecessary details. The network learns β€” through backpropagation β€” exactly what to forget and what to write at every step. Neither the forget gate nor the input gate are manually programmed: they are optimised by gradient descent to minimize the task's loss function.


Section 08

Gate 3 β€” The Output Gate

🟣
Output Gate β€” Purpose & Formula

Purpose: Determines what part of the cell state should be output as the hidden state.
oβ‚œ = Οƒ(Wo Β· [hβ‚œβ‚‹β‚, xβ‚œ] + bo)
hβ‚œ = oβ‚œ Β· tanh(Cβ‚œ)
The output gate oβ‚œ scales the output based on the current cell state. The cell state is squashed through tanh then filtered β€” the model may "know" something long-term but choose not to express it until the right moment.

🟣 Output Gate Animation β€” What the Cell Exposes (click to pause)
UPDATED CELL STATE Cβ‚œ (full long-term memory) tanh Γ— tanh(Cβ‚œ) OUTPUT GATE Οƒ β†’ oβ‚œ Wo Β· [hβ‚œβ‚‹β‚, xβ‚œ] + bo range [0, 1] oβ‚œ hβ‚œβ‚‹β‚ xβ‚œ hβ‚œ β†’ to next layer or prediction head hβ‚œ = oβ‚œ Β· tanh(Cβ‚œ)

Section 09

Complete Equation Summary

πŸ”΄ Forget Gate
fβ‚œ = Οƒ(Wf Β· [hβ‚œβ‚‹β‚, xβ‚œ] + bf)
What to erase from old cell state. Sigmoid output ∈ [0, 1]. Learned weights Wf.
🟒 Input Gate
iβ‚œ = Οƒ(Wi Β· [hβ‚œβ‚‹β‚, xβ‚œ] + bi)
What new values to let through. Sigmoid output ∈ [0, 1]. Learned weights Wi.
🟑 Candidate Values
CΜƒβ‚œ = tanh(Wc Β· [hβ‚œβ‚‹β‚, xβ‚œ] + bc)
Proposed new cell state values. tanh output ∈ [-1, 1]. Learned weights Wc.
⚑ Cell Update
Cβ‚œ = fβ‚œ Β· Cβ‚œβ‚‹β‚ + iβ‚œ Β· CΜƒβ‚œ
New cell state = erase old + write new. The gradient highway. Addition prevents vanishing.
🟣 Output Gate
oβ‚œ = Οƒ(Wo Β· [hβ‚œβ‚‹β‚, xβ‚œ] + bo)
What to expose from cell state. Sigmoid output ∈ [0, 1]. Learned weights Wo.
βšͺ Hidden State
hβ‚œ = oβ‚œ Β· tanh(Cβ‚œ)
Working memory output β†’ next layer or prediction. Squashed and filtered version of cell state.

Section 10

Worked Example β€” "The Clouds in the Sky Are Very ____"

Let's trace an LSTM through the sentence word-by-word to see exactly what each gate does.

πŸ“ Step-by-Step LSTM Trace β€” Weather Sentence Completion
xβ‚œ = "The"
Forget gate β‰ˆ 1.0 (nothing relevant to erase yet). Input gate writes "start of sentence" signal to cell state. Cell state initialised.
xβ‚œ = "clouds"
Forget gate stays high. Input gate writes weather_subject = clouds to cell state with high strength (iβ‚œ β‰ˆ 0.9). CΜƒβ‚œ encodes "clouds β†’ weather context".
xβ‚œ = "in the"
Filler words β€” input gate nearly closes (iβ‚œ β‰ˆ 0.05). Cell state barely changes. "clouds" memory preserved unchanged.
xβ‚œ = "sky"
Input gate opens again β€” adds context = outdoor/weather to cell state. Forget gate keeps "clouds" memory. Cell state now encodes {clouds, sky, weather}.
xβ‚œ = "are very"
Intensifier words. Input gate writes "intensifier follows" signal. Output gate stays low β€” nothing to expose yet. Cell state carries all context forward.
xβ‚œ = "____"
Output gate opens fully (oβ‚œ β‰ˆ 1.0). Exposes {clouds, sky, weather, intensifier} from cell state via tanh. hβ‚œ feeds into softmax. Top predictions: "dark", "gray", "bright".
🎯
Why This Works β€” The Cell State Carried Everything

A vanilla RNN would have lost "clouds" by the time it reached the blank β€” too many multiplications degrade the signal. The LSTM's cell state acted as a protected conveyor belt: "clouds" and "sky" rode through five time steps essentially unchanged because the forget gate stayed near 1.0 and the input gate stayed near 0 for irrelevant words. No gradient vanishing. No lost context.


Section 11

Real-World Applications of LSTMs

πŸ“š
Language Modelling & Translation
NLP
Capturing context over long sentences. Encoder LSTM reads the source language; decoder LSTM generates the target. The cell state bridges meaning across sentences. Foundational to early Google Translate.
πŸ—£οΈ
Speech Recognition
Audio Sequences
Understanding sequences of phonemes and predicting subsequent words. BiLSTMs process audio features (MFCCs) both forward and backward. Used in low-latency voice assistants running on embedded hardware.
πŸ“ˆ
Time Series Prediction
Sequential Data
Stock prices, weather forecasts, and other sequential datasets. LSTMs learn temporal patterns: seasonal cycles, trend changes, and anomalies. Ideal for data where the recent past predicts the near future.

Section 12

Python Implementation β€” PyTorch LSTM in Practice

Below is a clean, fully annotated PyTorch LSTM that mirrors the exact architecture we've studied β€” forget, input, candidate, and output gates under the hood.

import torch
import torch.nn as nn

class LSTMCell_Manual(nn.Module):
    """
    Manual LSTM cell β€” implements all 4 equations explicitly.
    Matches the PPT diagram step by step.
    """
    def __init__(self, input_size, hidden_size):
        super().__init__()
        n = input_size + hidden_size

        # Four weight matrices (one per gate/candidate)
        self.Wf = nn.Linear(n, hidden_size)  # Forget gate   β†’ fβ‚œ
        self.Wi = nn.Linear(n, hidden_size)  # Input gate    β†’ iβ‚œ
        self.Wc = nn.Linear(n, hidden_size)  # Candidate     β†’ CΜƒβ‚œ
        self.Wo = nn.Linear(n, hidden_size)  # Output gate   β†’ oβ‚œ

    def forward(self, x, h_prev, c_prev):
        # Concatenate previous hidden state + current input
        combined = torch.cat([h_prev, x], dim=1)  # [hβ‚œβ‚‹β‚, xβ‚œ]

        # ── Step 1: Forget Gate ──────────────────────────────
        f_t = torch.sigmoid(self.Wf(combined))
        # f_t ∈ [0,1] β€” how much of old cell state to keep

        # ── Step 2: Input Gate + Candidate ───────────────────
        i_t   = torch.sigmoid(self.Wi(combined))
        c_tld = torch.tanh   (self.Wc(combined))
        # i_t   ∈ [0,1]  β€” how much new info to write
        # c_tld ∈ [-1,1] β€” candidate values to potentially write

        # ── Step 3: Cell State Update ────────────────────────
        c_t = f_t * c_prev + i_t * c_tld
        # Cβ‚œ = fβ‚œ βŠ™ Cβ‚œβ‚‹β‚  +  iβ‚œ βŠ™ CΜƒβ‚œ
        # ADDITION = gradient highway (no vanishing!)

        # ── Step 4: Output Gate ──────────────────────────────
        o_t = torch.sigmoid(self.Wo(combined))
        h_t = o_t * torch.tanh(c_t)
        # hβ‚œ = oβ‚œ βŠ™ tanh(Cβ‚œ)

        return h_t, c_t  # new hidden state, new cell state


# ── Using PyTorch's built-in (optimised CUDA version) ─────
model = nn.LSTM(
    input_size  = 10,    # xβ‚œ dimension
    hidden_size = 128,   # hβ‚œ and Cβ‚œ dimension
    num_layers  = 2,     # stacked LSTMs
    dropout     = 0.2,   # between layers only
    batch_first = True,  # input: (batch, seq_len, features)
    bidirectional = False
)

# Forward pass
x   = torch.randn(4, 20, 10)    # batch=4, seq=20, features=10
h0  = torch.zeros(2, 4, 128)    # (num_layers, batch, hidden)
c0  = torch.zeros(2, 4, 128)

output, (hn, cn) = model(x, (h0, c0))
print(f"Output shape: {output.shape}")     # (4, 20, 128)
print(f"Final hidden: {hn.shape}")          # (2, 4, 128)
print(f"Final cell:   {cn.shape}")          # (2, 4, 128)
OUTPUT
Output shape: torch.Size([4, 20, 128]) Final hidden: torch.Size([2, 4, 128]) Final cell: torch.Size([2, 4, 128])
πŸ”Ž
Manual vs Built-in β€” Same Math, Different Speed

The LSTMCell_Manual class above implements exactly the same four equations as PyTorch's nn.LSTM. The built-in version fuses all four gate computations into a single optimised matrix multiply (the four weight matrices are concatenated into one large matrix), making it significantly faster on GPU. Use nn.LSTM in production, and LSTMCell_Manual for learning and debugging.


Section 13

Key Takeaways β€” What You Now Know

🧠 LSTM β€” The Five Things to Remember
1
The cell state is the long-term memory. It runs as a horizontal highway through the entire sequence β€” not passing through a dense layer at every step. This is what makes LSTMs fundamentally different from vanilla RNNs.
2
Three gates, four equations. Forget (erase), Input + Candidate (write), Output (read). All four are learned simultaneously via backpropagation through time. No manual programming of what to remember β€” the network figures it out from data.
3
Addition kills vanishing gradients. The cell state update Cβ‚œ = fβ‚œ Β· Cβ‚œβ‚‹β‚ + iβ‚œ Β· CΜƒβ‚œ uses addition. Gradients flowing back through additions don't shrink exponentially β€” they survive across hundreds of time steps.
4
Cell state β‰  hidden state. The cell state (Cβ‚œ) is the private long-term storage. The hidden state (hβ‚œ) is the public working memory β€” a filtered, squashed version of the cell state that the next layer or prediction head reads. They are decoupled by design.
5
All weights are shared across time. The same Wf, Wi, Wc, Wo are applied at every time step. This gives LSTMs their generalisability β€” they learn patterns, not positions β€” and keeps parameter count independent of sequence length.