What Are LSTM Networks?
Long Short-Term Memory networks β usually just called LSTMs β are a special kind of Recurrent Neural Network (RNN), capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997) and were refined and popularized by many researchers in the work that followed. They work tremendously well on a large variety of problems and are now widely used across industry and research.
Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!
All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module has a very simple structure β such as a single tanh layer. LSTMs also have this chain-like structure, but the repeating module has a different, much more sophisticated internal design with four interacting components instead of one.
π±οΈ Click the diagram to pause. The glowing middle cell (t) is the currently active one. The blue horizontal line is the cell state β the long-term memory highway running unchanged through the chain.
Why Do We Need LSTMs?
Traditional RNNs often struggle to retain information over long sequences due to issues like vanishing and exploding gradients.
"The clouds in the sky are very ____."
To predict the likely completion ("dark", "gray", "bright"), you need to remember the words clouds and sky from earlier in the sentence. Standard RNNs can lose this context, making incorrect predictions. An LSTM easily remembers these earlier words because its cell state acts as a selective memory β keeping "clouds" and "sky" alive until they are needed.
How Do LSTMs Work? β The Cell State Conveyor Belt
The crucial innovation in LSTMs is the cell state. Think of it as a conveyor belt that carries information through the network. Information can flow along this conveyor belt almost unchanged, allowing long-term information retention.
The key to LSTMs is the cell state β the horizontal line running through the top of the diagram. The cell state runs straight down the entire chain, with only some minor linear interactions. It's very easy for information to just flow along it unchanged.
The packages on the conveyor belt represent pieces of information ("clouds", "sky", "plural") being carried unchanged across time steps. Gates decide what to load, remove, or read.
The Complete LSTM Cell β Step-by-Step Interactive
An LSTM cell consists of three gates and one candidate state computation. The interactive diagram below lets you click each gate to see what it does β each step lights up in sequence. Use the buttons to walk through each stage.
Gate 1 β The Forget Gate
Purpose: Decides what information from the previous cell state should be discarded.
Formula: fβ = Ο(Wf Β· [hβββ, xβ] + bf)
Activation: Sigmoid Ο β outputs values between 0 and 1
If fβ = 0 β information is completely forgotten
If fβ = 1 β information is fully retained
Gate 2 β The Input Gate & Candidate Values
Purpose: Determines what new information should be added to the cell state.
iβ = Ο(Wi Β· [hβββ, xβ] + bi) β gate filter (0β1)
CΜβ = tanh(Wc Β· [hβββ, xβ] + bc) β candidate values (-1 to 1)
The input gate iβ decides how much of the candidate memory CΜβ should be
written into the cell state. Together: iβ β CΜβ
The Cell State Update β Cβ = fβ Β· Cβββ + iβ Β· CΜβ
The overall cell state update equation is the heart of the LSTM. This single equation combines both gates to produce the new memory state.
The update equation ensures that relevant long-term information is retained while discarding unnecessary details. The network learns β through backpropagation β exactly what to forget and what to write at every step. Neither the forget gate nor the input gate are manually programmed: they are optimised by gradient descent to minimize the task's loss function.
Gate 3 β The Output Gate
Purpose: Determines what part of the cell state should be output as the hidden state.
oβ = Ο(Wo Β· [hβββ, xβ] + bo)
hβ = oβ Β· tanh(Cβ)
The output gate oβ scales the output based on the current cell state.
The cell state is squashed through tanh then filtered β the model may "know" something
long-term but choose not to express it until the right moment.
Complete Equation Summary
Worked Example β "The Clouds in the Sky Are Very ____"
Let's trace an LSTM through the sentence word-by-word to see exactly what each gate does.
A vanilla RNN would have lost "clouds" by the time it reached the blank β too many multiplications degrade the signal. The LSTM's cell state acted as a protected conveyor belt: "clouds" and "sky" rode through five time steps essentially unchanged because the forget gate stayed near 1.0 and the input gate stayed near 0 for irrelevant words. No gradient vanishing. No lost context.
Real-World Applications of LSTMs
Python Implementation β PyTorch LSTM in Practice
Below is a clean, fully annotated PyTorch LSTM that mirrors the exact architecture we've studied β forget, input, candidate, and output gates under the hood.
import torch
import torch.nn as nn
class LSTMCell_Manual(nn.Module):
"""
Manual LSTM cell β implements all 4 equations explicitly.
Matches the PPT diagram step by step.
"""
def __init__(self, input_size, hidden_size):
super().__init__()
n = input_size + hidden_size
# Four weight matrices (one per gate/candidate)
self.Wf = nn.Linear(n, hidden_size) # Forget gate β fβ
self.Wi = nn.Linear(n, hidden_size) # Input gate β iβ
self.Wc = nn.Linear(n, hidden_size) # Candidate β CΜβ
self.Wo = nn.Linear(n, hidden_size) # Output gate β oβ
def forward(self, x, h_prev, c_prev):
# Concatenate previous hidden state + current input
combined = torch.cat([h_prev, x], dim=1) # [hβββ, xβ]
# ββ Step 1: Forget Gate ββββββββββββββββββββββββββββββ
f_t = torch.sigmoid(self.Wf(combined))
# f_t β [0,1] β how much of old cell state to keep
# ββ Step 2: Input Gate + Candidate βββββββββββββββββββ
i_t = torch.sigmoid(self.Wi(combined))
c_tld = torch.tanh (self.Wc(combined))
# i_t β [0,1] β how much new info to write
# c_tld β [-1,1] β candidate values to potentially write
# ββ Step 3: Cell State Update ββββββββββββββββββββββββ
c_t = f_t * c_prev + i_t * c_tld
# Cβ = fβ β Cβββ + iβ β CΜβ
# ADDITION = gradient highway (no vanishing!)
# ββ Step 4: Output Gate ββββββββββββββββββββββββββββββ
o_t = torch.sigmoid(self.Wo(combined))
h_t = o_t * torch.tanh(c_t)
# hβ = oβ β tanh(Cβ)
return h_t, c_t # new hidden state, new cell state
# ββ Using PyTorch's built-in (optimised CUDA version) βββββ
model = nn.LSTM(
input_size = 10, # xβ dimension
hidden_size = 128, # hβ and Cβ dimension
num_layers = 2, # stacked LSTMs
dropout = 0.2, # between layers only
batch_first = True, # input: (batch, seq_len, features)
bidirectional = False
)
# Forward pass
x = torch.randn(4, 20, 10) # batch=4, seq=20, features=10
h0 = torch.zeros(2, 4, 128) # (num_layers, batch, hidden)
c0 = torch.zeros(2, 4, 128)
output, (hn, cn) = model(x, (h0, c0))
print(f"Output shape: {output.shape}") # (4, 20, 128)
print(f"Final hidden: {hn.shape}") # (2, 4, 128)
print(f"Final cell: {cn.shape}") # (2, 4, 128)
The LSTMCell_Manual class above implements exactly the same
four equations as PyTorch's nn.LSTM. The built-in version
fuses all four gate computations into a single optimised matrix multiply
(the four weight matrices are concatenated into one large matrix), making it
significantly faster on GPU. Use nn.LSTM in production, and
LSTMCell_Manual for learning and debugging.