LLM Architecture Deep Dive: Transformers, Attention

Section 01

The Story That Explains LLMs

📖 Real World Analogy

The World's Most Well-Read Librarian

Imagine a librarian who has read every book, article, forum post, and website ever written — billions of documents — and memorised not just the words, but the patterns of how ideas connect. Ask them anything and they don't look it up. They reconstruct the answer from compressed pattern memory, word by word.

They never "know" facts the way a database does. Instead, they've learned that after the phrase "The capital of France is", the word "Paris" follows with overwhelming probability — because they've seen that pattern ten million times.

That is exactly how a Large Language Model works. It is a statistical pattern machine of staggering scale, trained to predict the next token given all previous tokens. Everything else — reasoning, coding, translation — emerges from that single objective.

A Large Language Model (LLM) is a neural network — specifically a Transformer — trained on massive text corpora to model the probability distribution over sequences of tokens. Given a sequence of tokens as context, the model outputs a probability distribution over its vocabulary for the next token. Sampling from this distribution, repeatedly, produces fluent, coherent text.

🌎

The Core Objective — Next Token Prediction

LLMs are trained with a deceptively simple objective: given tokens [t₁, t₂, …, tₙ], predict tₙ₊₁. The loss is cross-entropy between the predicted distribution and the true next token. Everything — grammar, facts, logic, code generation — is learned as a side-effect of minimising this loss at massive scale. This is called self-supervised learning: the labels (next tokens) come free from the data itself.

Section 02

From Words to Tokens — Tokenisation

📖 Story

Breaking Language into LEGO Bricks

Before the model can process text, it must convert it into numbers. But you can't just assign a number to every word — there are too many (millions with inflections, compounds, slang). Instead, LLMs use sub-word tokenisation: splitting text into chunks that balance vocabulary size with coverage. "unbelievable" might become ["un", "believ", "able"]. This lets the model handle any word, even ones it's never seen whole.

🔄 How BPE Tokenisation Works — Step by Step

Step 1

Start with characters: Every character is a token. Vocabulary is tiny but every word is representable.

Step 2

Count pairs: Find the most frequent adjacent pair in the training corpus. E.g. ("e", "r") appears 2.4M times.

Step 3

Merge: Replace every occurrence of that pair with a new token "er". Add "er" to vocabulary.

Step 4

Repeat: Count pairs again, merge most frequent again. Repeat ~50,000 times until vocabulary is full.

Result

Common words become single tokens. Rare words are split into known sub-word pieces. ~50k–100k vocabulary is typical.

Text	GPT-4 Tokens	Token Count	Note
"Hello, world!"	["Hello", ",", " world", "!"]	4	Common words → single tokens
"unbelievable"	["un", "believ", "able"]	3	Rare word → sub-word split
"antidisestablishmentarianism"	["ant", "idis", "estab", "lishment", "arian", "ism"]	6	Very rare → many pieces
"🚀"	["<0xF0>", "<0x9F>", "<0x9A>", "<0x80>"]	4	Emoji = raw UTF-8 bytes
"Paris"	["Paris"]	1	Very common → single token

⚠️

Tokens ≠ Words — This Matters

Context windows are measured in tokens, not words. On average, 1 word ≈ 1.3 tokens in English. A 128K token context window holds roughly 96,000 words — about 192 pages of text. Non-English languages are often less efficient: Chinese and Arabic can use 2–3× more tokens per word since the tokenizer was trained on mostly English text.

Section 03

Embeddings — Turning Tokens into Vectors

After tokenisation, each token ID is mapped to a dense vector called an embedding. This is a lookup table: token 4821 → a vector of 4096 numbers (for a 7B parameter model). These vectors live in a high-dimensional space where semantic similarity maps to geometric proximity.

📖 Analogy

Tokens as Points in a City

Imagine a city where every word has an address (a point in space). Words with similar meanings live in the same neighbourhood. "King" and "Queen" are two blocks apart. "Cat" and "Dog" are nearby. "Cat" and "Mortgage" are across town.

The famous example: King − Man + Woman ≈ Queen. The vector arithmetic works because the embedding space has learned that the "royalty" direction and the "gender" direction are orthogonal, consistent dimensions.

📌

Token Embedding

shape: [vocab_size, d_model]

Each token ID maps to a learnable vector of dimension d_model (e.g. 4096). This table is learned during training and shared with the output layer (weight tying).

📐

Positional Encoding

RoPE / ALiBi / Sinusoidal

Transformers have no built-in notion of order. Position encodings inject sequence position information into each token vector. Modern LLMs use Rotary Position Embedding (RoPE) which encodes position in the attention mechanism directly.

➕

Input to Transformer

token_emb + pos_enc

The final input is token embedding + positional encoding, giving each position a unique starting vector. Shape entering the first layer: [batch, seq_len, d_model].

import torch
import torch.nn as nn

# Simple embedding + sinusoidal position encoding
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, d_model: int):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, d_model)
        self.d_model = d_model

    def forward(self, x):  # x: [batch, seq_len] token IDs
        return self.embed(x) * (self.d_model ** 0.5)  # scale by √d_model

# Example usage
vocab_size, d_model, seq_len = 50257, 768, 10
embedder = TokenEmbedding(vocab_size, d_model)
tokens = torch.randint(0, vocab_size, (1, seq_len))  # 1 batch, 10 tokens
embedded = embedder(tokens)
print(embedded.shape)  # torch.Size([1, 10, 768])

OUTPUT

torch.Size([1, 10, 768])

Section 04

The Transformer — Architecture Overview

The Transformer (Vaswani et al., 2017, "Attention Is All You Need") is the backbone of every modern LLM. It replaced RNNs and LSTMs by discarding recurrence entirely and relying solely on attention mechanisms. The result: massively parallelisable training and far superior long-range dependency modelling.

📖 Story

The Parliamentary Debate

Imagine 512 MPs in Parliament, each holding a topic card. For every bill being debated, each MP simultaneously asks: "How relevant is every other MP's card to mine right now?" They score all others, weight their information accordingly, and update their own understanding. Every MP does this at the same time, in parallel. After many rounds of discussion (layers), each MP's card reflects a rich understanding of the whole debate — with full context from everyone.

That parallel, mutual-awareness process is self-attention. The Transformer runs it across all token positions simultaneously, which is why it parallelises so well on GPUs.

🏗 Transformer Decoder Block — The Repeating Unit

Input

Token embeddings + positional encodings → shape [B, T, D] where B=batch, T=sequence length, D=d_model

Layer 1

RMSNorm — normalise each token vector to unit scale. Faster than LayerNorm (no mean subtraction).

Layer 2

Masked Multi-Head Self-Attention — each token attends to itself and all previous tokens. Future tokens are masked out (causal). Output shape same as input: [B, T, D].

Layer 3

Residual Add — add the attention output to the original input (skip connection). Prevents vanishing gradients.

Layer 4

RMSNorm — normalise again before the feed-forward network.

Layer 5

Feed-Forward Network (FFN) — two linear layers with a non-linearity (SwiGLU/GeLU). Operates per-token independently. Expands to 4×D then back to D.

Layer 6

Residual Add — add FFN output to input. This completes one decoder block.

Stack

Repeat N times (e.g. N=32 for a 7B model, N=80 for GPT-4 class). Final output → LM head → logits over vocabulary.

Transformer Input

X = TokenEmbed(t) + PosEncode(pos)

Each position gets a unique starting vector combining semantic identity with positional information.

Residual Stream

x = x + Attn(Norm(x))
x = x + FFN(Norm(x))

Each sub-layer adds its output to the input (residual connection), allowing gradient flow through deep stacks.

Output Logits

logits = x_final · W_embed^T

Final hidden state is projected back to vocabulary size using the transposed embedding matrix (weight tying). Softmax gives probabilities.

Training Loss

L = -Σ log P(tₙ₊₁ | t₁…tₙ)

Cross-entropy averaged over all token positions in the batch. Minimising this teaches the model to predict text accurately.

Section 05

Self-Attention — The Heart of the Transformer

Self-attention is the mechanism that lets every token "look at" every other token in the sequence and decide how much information to gather from each. It computes three vectors for every token — Query (Q), Key (K), and Value (V) — and uses dot products to compute relevance scores.

📖 Story — The Library Search

Query, Key, Value in Plain English

Imagine a library search system. You write a Query (what you're looking for). Every book has a Key on its spine (what it contains). You match your Query against all Keys to get relevance scores. Then you retrieve the Values (the actual book content) — but you get a weighted blend of all books, where weights come from your match scores.

In self-attention: every token is simultaneously the searcher and a book on the shelf. The word "bank" in "river bank" searches for context and finds "river" and "fish" most relevant — so it blends their information to resolve its meaning. This is how attention resolves ambiguity.

✍️ Scaled Dot-Product Attention — Step by Step

Step 1

Project each token embedding into Q, K, V: Q = X·Wq, K = X·Wk, V = X·Wv. Each Wq/Wk/Wv is a learned matrix of shape [d_model, d_k].

Step 2

Compute raw attention scores: scores = Q · Kᵀ. Shape: [T, T]. Entry [i,j] = how much token i attends to token j.

Step 3

Scale: divide by √d_k to prevent vanishing gradients from large dot products in high dimensions.

Step 4

Apply causal mask: set future positions to −∞ so they become 0 after softmax. Decoder-only LLMs cannot attend to future tokens.

Step 5

Softmax across each row → attention weights (sum to 1 per token). Each token now has a distribution over all past tokens.

Step 6

Weighted sum of Values: output = weights · V. Each token's output is a blend of all Value vectors, weighted by how much it attended to each.

import torch
import torch.nn.functional as F
import math

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q: [batch, heads, seq, d_k]
    K: [batch, heads, seq, d_k]
    V: [batch, heads, seq, d_v]
    """
    d_k = Q.shape[-1]

    # Step 1: Compute raw scores
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    # scores: [batch, heads, seq, seq]

    # Step 2: Apply causal mask (upper triangle = -inf)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)

    # Step 3: Softmax → attention weights
    attn_weights = F.softmax(scores, dim=-1)

    # Step 4: Weighted sum of values
    output = torch.matmul(attn_weights, V)
    return output, attn_weights

# --- Demo ---
batch, heads, seq_len, d_k = 1, 8, 5, 64
Q = torch.randn(batch, heads, seq_len, d_k)
K = torch.randn(batch, heads, seq_len, d_k)
V = torch.randn(batch, heads, seq_len, d_k)

# Causal mask: lower triangle = 1 (allowed), upper = 0 (masked)
mask = torch.tril(torch.ones(seq_len, seq_len)).unsqueeze(0).unsqueeze(0)
out, weights = scaled_dot_product_attention(Q, K, V, mask)
print(f"Output shape:  {out.shape}")
print(f"Weights shape: {weights.shape}")
print(f"Weight row sum (should be 1.0): {weights[0,0,2].sum():.4f}")

OUTPUT

Output shape: torch.Size([1, 8, 5, 64]) Weights shape: torch.Size([1, 8, 5, 5]) Weight row sum (should be 1.0): 1.0000

Section 06

Multi-Head Attention — Many Perspectives at Once

A single attention head might learn to track grammatical subject-verb agreement. Another might track pronoun references. Another might group semantically related words. Multi-Head Attention runs H parallel attention heads, each with its own Q/K/V projection weights, then concatenates and projects their outputs.

🔑

Why Multiple Heads?

A single head learns one "type" of relationship. Multiple heads in parallel learn many relationship types simultaneously. GPT-4 reportedly uses 96 heads. Each head operates on a d_head = d_model / H sub-space (e.g. 4096/32 = 128 dimensions per head). The total computation is the same as one big head — but the multi-perspective representation is far richer.

🔍

Head 1 — Syntax

subject → verb agreement

Learns to link "The dogs" to "are" rather than "is". Captures grammatical number agreement across token distances.

🔗

Head 2 — Coreference

pronoun → antecedent

Links "she" back to "Maria" mentioned 10 tokens earlier. Crucial for coherent generation over long contexts.

📄

Head N — Semantics

word meaning context

Resolves ambiguity: determines whether "bank" means financial institution or river bank based on surrounding context words.

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, n_heads: int):
        super().__init__()
        assert d_model % n_heads == 0
        self.n_heads = n_heads
        self.d_head  = d_model // n_heads

        # Projection matrices for Q, K, V and output
        self.W_q = nn.Linear(d_model, d_model, bias=False)
        self.W_k = nn.Linear(d_model, d_model, bias=False)
        self.W_v = nn.Linear(d_model, d_model, bias=False)
        self.W_o = nn.Linear(d_model, d_model, bias=False)

    def forward(self, x, mask=None):
        B, T, D = x.shape

        # Project and split into heads: [B, T, D] → [B, H, T, d_head]
        def split_heads(t):
            return t.view(B, T, self.n_heads, self.d_head).transpose(1, 2)

        Q = split_heads(self.W_q(x))
        K = split_heads(self.W_k(x))
        V = split_heads(self.W_v(x))

        # Attention for all heads in parallel
        attn_out, _ = scaled_dot_product_attention(Q, K, V, mask)

        # Recombine heads: [B, H, T, d_head] → [B, T, D]
        attn_out = attn_out.transpose(1, 2).contiguous().view(B, T, D)

        return self.W_o(attn_out)  # final output projection

Section 07

The Feed-Forward Network — Where Knowledge Lives

After attention gathers context, a Feed-Forward Network (FFN) processes each token independently. It's a 2-layer MLP with a wide hidden layer (4×d_model), applying a non-linear activation. Researchers believe this is where factual knowledge is stored — attention routes information, the FFN transforms it.

⚠ Original FFN (ReLU)

Component	Detail
Layer 1	Linear: d_model → 4·d_model
Activation	ReLU (max(0, x))
Layer 2	Linear: 4·d_model → d_model
Used in	GPT-1, GPT-2, BERT

✔ Modern FFN (SwiGLU)

Component	Detail
Layer 1	Two parallel linears: gate + value
Activation	Swish(gate) · value (gated)
Layer 2	Linear: hidden → d_model
Used in	LLaMA, Gemma, Mistral, Claude

class SwiGLU_FFN(nn.Module):
    """SwiGLU feed-forward — used in LLaMA, Mistral, Gemma."""
    def __init__(self, d_model: int, hidden: int):
        super().__init__()
        self.gate  = nn.Linear(d_model, hidden, bias=False)
        self.value = nn.Linear(d_model, hidden, bias=False)
        self.proj  = nn.Linear(hidden, d_model, bias=False)

    def forward(self, x):
        # Swish(gate(x)) acts as a learned, smooth gating signal
        return self.proj(
            F.silu(self.gate(x)) * self.value(x)
        )

# Sanity check
ffn = SwiGLU_FFN(d_model=4096, hidden=11008)  # LLaMA-7B dims
x   = torch.randn(1, 10, 4096)
print(ffn(x).shape)  # [1, 10, 4096] — same shape in, same shape out

# Count parameters in FFN alone (7B model has 32 layers)
total_ffn_params = sum(p.numel() for p in ffn.parameters())
print(f"FFN params per layer: {total_ffn_params:,}")
print(f"All 32 layers FFN total: {total_ffn_params * 32:,}")

OUTPUT

torch.Size([1, 10, 4096]) FFN params per layer: 135,266,304 All 32 layers FFN total: 4,328,521,728 ← ~4.3B of the 7B params live in FFNs

🧠

The "Knowledge Neurons" Discovery

Research (Dai et al., 2022) found that specific neurons in the FFN layers activate for specific factual associations — e.g. a neuron that fires when the model is about to output "Paris" for the query "capital of France". Suppressing these neurons degrades factual accuracy. This suggests the FFN is where factual memories are stored, while attention is the routing mechanism.

Section 08

Normalisation — Keeping Training Stable

Deep networks suffer from vanishing and exploding gradients. Normalisation layers stabilise the distribution of activations, enabling training of 100+ layer networks. Modern LLMs have largely moved from LayerNorm to RMSNorm.

📈

Layer Normalisation

mean=0, var=1 per token

Normalises across the feature dimension for each token independently. Subtracts mean, divides by std, then applies learnable scale (γ) and shift (β). Used in GPT-2, original BERT.

⚡

RMS Normalisation

scale only, no mean shift

Divides by the root-mean-square of activations only — no mean subtraction. 15–20% faster than LayerNorm. Same empirical quality. Used in LLaMA, Mistral, Gemma, PaLM.

📌

Pre-Norm vs Post-Norm

placement matters

Original Transformer: Post-Norm (normalise after residual add). Modern LLMs: Pre-Norm (normalise before attention/FFN). Pre-Norm is more stable for deep networks and enables training without careful warmup.

class RMSNorm(nn.Module):
    """RMS Normalisation — used in LLaMA, Mistral, Gemma."""
    def __init__(self, d_model: int, eps: float = 1e-6):
        super().__init__()
        self.eps   = eps
        self.scale = nn.Parameter(torch.ones(d_model))  # learnable γ

    def forward(self, x):
        # RMS = sqrt(mean(x²) + eps)
        rms = x.pow(2).mean(-1, keepdim=True).add(self.eps).sqrt()
        return (x / rms) * self.scale

Section 09

Positional Encoding — Giving Tokens a Sense of Place

Attention is permutation-invariant by design — shuffle the tokens and you get the same attention scores (just reordered). Positional encodings inject sequence order so the model knows "this token is at position 47".

〰️

Sinusoidal (Absolute)

original paper

Fixed sine/cosine functions at different frequencies. Added to token embeddings at the start. Simple but doesn't extrapolate well beyond training context length. Used in: original Transformer, BERT.

📌

Learned Absolute

trainable lookup

A learnable embedding table indexed by position. The model learns the "best" position vectors from data. Fixed maximum length. Used in: GPT-2, early GPT-3. Cannot generalise beyond max training length.

🔄

RoPE (Rotary)

state of the art

Encodes relative position by rotating Q and K vectors in pairs. The dot product Q·Kᵀ naturally encodes the relative distance between tokens. Extrapolates to unseen lengths with tricks (YaRN, rope scaling). Used in: LLaMA, Mistral, Gemma, Qwen.

📉

ALiBi (Attention Bias)

linear bias on scores

Adds a linear penalty to attention scores proportional to token distance. No positional vectors at all — just a slope per head. Excellent length generalisation. Used in: MPT, BLOOM, some Falcon models.

📋

NoPE (No Position)

implicit in architecture

Some recent work shows models can learn positional information implicitly from causal masking and data patterns alone — no explicit encoding. Experimental but promising for very long contexts.

⚡

YaRN / RoPE Scaling

extend beyond training

Interpolation tricks applied to RoPE to extend context windows after training. LLaMA-3.1 was trained on 8K but extended to 128K context using RoPE scaling. Now industry standard for long-context models.

Section 10

Scaling Laws — Why Bigger Works Better

📖 Story

The Chinchilla Paper That Changed Everything

In 2020, OpenAI published the Scaling Laws paper and showed that model loss follows predictable power laws with compute, data, and parameters. The field assumed "bigger model = more compute = better performance".

In 2022, Google DeepMind published the Chinchilla paper (Hoffmann et al.) and showed that GPT-3 (175B params, 300B tokens) was severely undertrained. The compute-optimal recipe: for a model of N parameters, train on approximately 20×N tokens. Chinchilla (70B params, 1.4T tokens) matched GPT-3 at a fraction of the inference cost.

Modern LLMs like LLaMA-3 push further: 8B params on 15T tokens. Overtrain deliberately for a better inference-time model — compute is cheap at inference, training happens once.

Model	Parameters	Training Tokens	Tokens/Param	Status
GPT-3	175B	300B	1.7×	Undertrained (pre-Chinchilla)
Chinchilla	70B	1.4T	20×	Compute-optimal
LLaMA-2 7B	7B	2T	286×	Intentionally overtrained
LLaMA-3 8B	8B	15T	1,875×	Heavily overtrained for inference
Mistral 7B	7B	~1T	143×	Efficient, architecture innovations

Scaling Law (Loss)

L(N, D) = A/Nᵅ + B/Dᵝ + L∞

Loss decreases predictably as a power law of both model size N (parameters) and data D (tokens). α ≈ 0.34, β ≈ 0.28.

Chinchilla Optimum

D_optimal = 20 × N

For a fixed compute budget C, use equal "budget" on model size and data. Train a smaller model on more tokens than previously thought optimal.

Compute Budget

FLOPs ≈ 6 × N × D

Training a model of N parameters on D tokens requires roughly 6·N·D floating point operations. Useful for estimating training cost.

Emergent Abilities

~10²³ FLOPs threshold

Complex capabilities (chain-of-thought reasoning, multi-step arithmetic) appear to emerge discontinuously above certain compute thresholds — not present in smaller models.

Section 11

Grouped Query Attention & KV Cache

During inference, the model generates tokens one at a time. Without caching, it would recompute K and V for all previous tokens at every step — O(T²) computation. The KV Cache stores past K and V tensors, reducing generation to O(T) per step.

⚠️

The KV Cache Memory Problem

For a 70B model with 80 layers, 64 heads, 4096 d_model, and 128K context: KV cache = 2 × 80 × 64 × 4096 × 128000 × 2 bytes ≈ ~270 GB — more than the model weights themselves. This is why long-context inference is memory-constrained, not compute-constrained.

⚠ Multi-Head Attention (MHA)

Property	Value
KV heads	Equal to Q heads (e.g. 32)
KV cache size	2 × n_heads × d_head × seq_len
Memory cost	Very high at long contexts
Used in	GPT-2, original GPT-3

✔ Grouped Query Attention (GQA)

Property	Value
KV heads	Fraction of Q heads (e.g. 8)
KV cache size	2 × n_kv_heads × d_head × seq_len
Memory cost	4–8× smaller cache
Used in	LLaMA-2, LLaMA-3, Mistral, Gemma

✅

Multi-Query Attention (MQA) — The Extreme Case

Multi-Query Attention (Shazeer 2019) takes it to the extreme: all Q heads share a single K and V head. Reduces KV cache by n_heads×. Small quality loss, massive memory saving. Used in Falcon, early PaLM. GQA is a compromise: G groups of Q heads share a KV head. LLaMA-3 uses G=8 (32 Q heads, 8 KV heads = 4× cache reduction vs MHA).

Section 12

Pre-Training — The Foundation Phase

Pre-training is the most expensive phase: billions of parameters, trillions of tokens, thousands of GPUs, months of wall-clock time. The goal is to build a model with rich world knowledge and language understanding. Everything downstream — instruction following, RLHF — fine-tunes this foundation.

Data Collection & Curation

Scrape web (Common Crawl, C4), add books, code (GitHub), academic papers (ArXiv), Wikipedia, curated high-quality sources. Apply quality filters: deduplication (MinHash), language detection, perplexity filtering, safe content filtering. LLaMA-3 used ~15T tokens from 30+ sources.

Tokenisation & Sharding

Train a BPE tokenizer on a representative data sample. Tokenise all training data. Shuffle globally (critical — local correlations in web crawl hurt training). Shard across thousands of files for distributed loading.

Distributed Training (3D Parallelism)

Data Parallelism: Same model on N GPUs, different batches. Gradients averaged across GPUs.
Tensor Parallelism: Split attention heads across GPUs within a node.
Pipeline Parallelism: Different layers on different nodes. Micro-batches fill the pipeline.
Modern: ZeRO sharding (DeepSpeed/FSDP) eliminates optimizer state redundancy.

Optimiser & Hyperparameters

AdamW (β₁=0.9, β₂=0.95, ε=1e-8, weight_decay=0.1). Cosine LR schedule with linear warmup (≈2000 steps). Gradient clipping (max norm=1.0). bf16 mixed precision. Batch size schedule: start small, grow to millions of tokens/batch for stable late-stage training.

Loss Monitoring & Recovery

Monitor training loss, gradient norm, and loss spikes. "Loss spikes" (sudden jumps in cross-entropy) require rollback to a checkpoint 500–1000 steps back and resuming with lower LR. LLaMA-3 documentation mentions multiple spike-recovery events during its 15T token training run.

# Simplified pre-training loop (conceptual)
import torch
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR

model    = LLM(vocab_size=128256, d_model=4096, n_layers=32, n_heads=32)
optimiser = AdamW(model.parameters(), lr=3e-4,
                   betas=(0.9, 0.95), weight_decay=0.1)
scheduler = CosineAnnealingLR(optimiser, T_max=1_000_000, eta_min=3e-5)

for step, batch in enumerate(dataloader):
    input_ids = batch["input_ids"]   # [B, T]
    labels    = input_ids[:, 1:]       # next-token labels: shift right
    logits    = model(input_ids[:, :-1])  # [B, T-1, vocab]

    # Cross-entropy loss over all positions
    loss = F.cross_entropy(
        logits.reshape(-1, logits.shape[-1]),
        labels.reshape(-1),
        ignore_index=-100   # padding tokens
    )

    optimiser.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimiser.step()
    scheduler.step()

    if step % 100 == 0:
        print(f"Step {step}: loss={loss.item():.4f}, perplexity={loss.item().exp():.2f}")

Section 13

Post-Training — Instruction Tuning & Alignment

A pre-trained LLM is a raw text completer, not an assistant. If you prompt it "What is the capital of France?", it might continue "...was debated by historians...". Post-training transforms it into a helpful, safe, instruction-following model.

📝

SFT — Supervised Fine-Tuning

Phase 1

Fine-tune on (instruction, response) pairs from human demonstrations. Teaches the format: "User: [question]\nAssistant: [helpful answer]". Uses cross-entropy loss only on the response tokens (not the instruction). Typically 10k–1M examples, 1–3 epochs.

🏅

RLHF — Reward Model

Phase 2a

Train a reward model to predict human preference. Humans rank pairs of model responses. Reward model learns: "response A is better than B" → output a scalar reward score. This bottleneck is expensive: human labellers at scale are required.

📈

PPO — Policy Optimisation

Phase 2b

Use the reward model to fine-tune the LLM via Proximal Policy Optimisation (PPO). The LLM (policy) generates responses, reward model scores them, PPO maximises reward while keeping outputs close to the SFT baseline via KL-divergence penalty.

🔥

DPO — Direct Preference Optimisation

Phase 2 (modern alternative)

Reformulates RLHF as a classification problem — no separate reward model needed. Directly fine-tunes on (prompt, chosen, rejected) triples using a contrastive loss. Simpler, more stable than PPO. Used by many open models (Zephyr, Tulu, OpenHermes).

🛡

Constitutional AI (CAI)

Anthropic's approach

Uses a set of principles (a "constitution") to guide the model's own self-critique and revision. The model critiques its outputs against the constitution and revises them. Reduces reliance on human preference labelling. Foundation of Claude's alignment approach.

💡

RLAIF — AI Feedback

Scale labelling

Replace human raters with a capable LLM (GPT-4/Claude). The AI generates preference judgements at scale. Quality depends on the judge model. Dramatically reduces cost and scales to billions of preference pairs. Increasingly dominant approach.

Section 14

Mixture of Experts — Efficient Scaling

📖 Story

The Hospital with 8 Specialist Departments

A hospital with 8 specialist departments (cardiology, neurology, orthopaedics…) doesn't route every patient to every department. A router (triage nurse) examines each patient and sends them to the 1–2 most relevant specialists. Most departments are idle for any given patient — but the total hospital capacity is the sum of all departments.

This is Mixture of Experts (MoE). The FFN in each Transformer layer is replaced with E expert FFNs. A learned router selects the top-K experts for each token. Only K of E experts compute — so a model with 8× the FFN parameters uses roughly the same compute as a dense model.

Model	Total Params	Active Params	Experts	Top-K
Mixtral 8×7B	47B	12B	8 experts	Top 2 per token
Mixtral 8×22B	141B	39B	8 experts	Top 2 per token
GPT-4 (rumoured)	~1.8T	~220B	~16 experts	Top 2 per token
DeepSeek-V3	671B	37B	256 experts	Top 8 per token
Qwen2-57B-A14B	57B	14B	64 experts	Top 8 per token

class MoELayer(nn.Module):
    """Sparse Mixture of Experts FFN layer."""
    def __init__(self, d_model: int, n_experts: int, top_k: int, hidden: int):
        super().__init__()
        self.n_experts = n_experts
        self.top_k     = top_k

        # Router: maps each token to expert logits
        self.router = nn.Linear(d_model, n_experts, bias=False)

        # E independent FFN experts
        self.experts = nn.ModuleList([
            SwiGLU_FFN(d_model, hidden) for _ in range(n_experts)
        ])

    def forward(self, x):
        B, T, D = x.shape
        x_flat = x.view(-1, D)  # [B*T, D]

        # Route: pick top-K experts per token
        logits  = self.router(x_flat)                   # [B*T, E]
        weights = F.softmax(logits, dim=-1)
        top_w, top_idx = torch.topk(weights, self.top_k, dim=-1)  # [B*T, K]
        top_w = top_w / top_w.sum(dim=-1, keepdim=True)  # renormalise

        # Aggregate expert outputs
        out = torch.zeros_like(x_flat)
        for k in range(self.top_k):
            expert_id = top_idx[:, k]   # [B*T]
            for e in range(self.n_experts):
                mask = (expert_id == e)
                if mask.any():
                    out[mask] += top_w[mask, k:k+1] * self.experts[e](x_flat[mask])

        return out.view(B, T, D)

Section 15

Quantisation — Smaller, Faster, Cheaper

A 70B parameter model in FP32 requires ~280 GB VRAM — four A100s. Quantisation reduces the precision of weights and/or activations, dramatically shrinking memory and speeding up inference. The tradeoff: some accuracy loss.

FP32

32 bits / weight

70B model = 280 GB

Training standard. No quality loss. Rarely used for inference.

BF16

16 bits / weight

70B model = 140 GB

Default for modern training & inference. Minimal quality loss. Supported natively on A100/H100.

INT8

8 bits / weight

70B model = 70 GB

LLM.int8() (bitsandbytes). ~1% quality loss. Fits 70B on 2×A100-80GB. Common production choice.

INT4 / GGUF

4 bits / weight

70B model = 35 GB

GPTQ, AWQ, GGUF (llama.cpp). ~2–4% quality loss. Fits 70B on a single A100-40GB or Mac Studio.

# Load a 70B model in 4-bit quantisation using bitsandbytes
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

quantisation_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,   # compute in bf16 for speed
    bnb_4bit_quant_type="nf4",              # NormalFloat4 — best quality at 4-bit
    bnb_4bit_use_double_quant=True,         # quantise the quantisation constants too
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-70B-Instruct",
    quantization_config=quantisation_config,
    device_map="auto",                       # auto-distribute across available GPUs
)

print(f"Model loaded. Memory: {model.get_memory_footprint() / 1e9:.1f} GB")

OUTPUT

Model loaded. Memory: 37.2 GB ← 70B model on a single 40GB A100!

Section 16

Fine-Tuning Efficiently — LoRA & PEFT

Full fine-tuning a 70B model requires 70B×4 bytes of weights + gradients + optimiser states ≈ 1–2 TB VRAM. Parameter-Efficient Fine-Tuning (PEFT) methods freeze most of the model and add a tiny number of trainable parameters.

📖 Analogy

Adding a Post-It Note to a Textbook

A 70B model is like the British Library — 170 million books worth of knowledge. To teach it one new thing (your company's writing style, a domain speciality), you don't rewrite every book. You add a set of Post-It notes — tiny, targeted annotations — that steer the model without touching the base knowledge. These Post-It notes are LoRA adapters: low-rank matrices that decompose the weight update into two small matrices.

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM

# LoRA config: rank 16, applied to Q and V projection matrices
lora_config = LoraConfig(
    r=16,               # rank of the decomposition (A·B where A:[d,r], B:[r,d])
    lora_alpha=32,      # scaling factor: effective_lr = alpha/r
    target_modules=["q_proj", "v_proj"],   # which weight matrices to adapt
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B")
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()

OUTPUT

trainable params: 3,407,872 || all params: 8,033,669,120 || trainable%: 0.0424% ← Only 0.04% of parameters trained! Full fine-tuning quality at 0.04% of compute.

LoRA Weight Update

W' = W + ΔW = W + A·B

W is frozen (d×d). ΔW is decomposed as A (d×r) times B (r×d), where r ≪ d. Only A and B are trained.

Parameter Reduction

Params = 2 × r × d_model

At r=16, d=4096: 2×16×4096 = 131,072 parameters per matrix vs 4096²=16.7M. 128× reduction per adapted layer.

Section 17

Inference & Decoding Strategies

After training, the model outputs logits (unnormalised scores) over its vocabulary at each step. How you convert those logits to actual text is a decoding strategy — and the choice dramatically affects the character of the output.

Strategy	How It Works	When to Use	Risk
Greedy	Always pick the highest probability token	Deterministic tasks, structured outputs	Repetitive, degenerate loops
Temperature Sampling	Divide logits by T before softmax. T<1 = sharp, T>1 = flat	Creative writing, dialogue	High T → incoherent text
Top-K	Sample only from K most probable tokens (e.g. K=50)	General purpose	Fixed K ignores dynamic prob. mass
Top-P (Nucleus)	Sample from the smallest set of tokens whose cumulative prob. ≥ P	Best general-purpose default	Slight overhead
Beam Search	Keep B candidate sequences, expand all, prune to B best	Translation, summarisation	Slow, verbose, "safe" outputs

import torch
import torch.nn.functional as F

def sample_next_token(logits, temperature=0.8, top_p=0.9, top_k=50):
    """
    logits: [vocab_size] — raw model output for next position
    Returns: sampled token id (int)
    """
    # 1. Temperature scaling
    logits = logits / temperature

    # 2. Top-K filtering: zero out all but top K
    if top_k > 0:
        top_k_vals = torch.topk(logits, top_k).values[-1]
        logits[logits < top_k_vals] = -float("inf")

    # 3. Top-P (nucleus) filtering
    probs = F.softmax(logits, dim=-1)
    sorted_probs, sorted_idx = torch.sort(probs, descending=True)
    cumulative = torch.cumsum(sorted_probs, dim=-1)
    # Remove tokens once cumulative prob exceeds top_p
    remove_mask = cumulative - sorted_probs > top_p
    sorted_probs[remove_mask] = 0
    probs = torch.zeros_like(logits).scatter_(0, sorted_idx, sorted_probs)
    probs = probs / probs.sum()   # renormalise

    # 4. Sample
    return torch.multinomial(probs, num_samples=1).item()

Section 18

LLM Architecture Comparison — Modern Landscape

Model	Params	Context	Architecture Innovations	Licence
GPT-4	~1.8T (rumoured MoE)	128K	Closed — architecture undisclosed	Proprietary
Claude 3.5	Undisclosed	200K	Constitutional AI, long context	Proprietary
LLaMA-3 70B	70B	128K	GQA, RoPE, SwiGLU, 15T tokens	Open weights
Mistral 7B	7B	32K	Sliding window attention, GQA	Apache 2.0
Mixtral 8×7B	47B (12B active)	32K	Sparse MoE, 8 experts, top-2	Apache 2.0
DeepSeek-V3	671B (37B active)	128K	MoE 256 experts, MLA attention	MIT
Gemma 2 27B	27B	8K	Alternating sliding/global attn	Open weights

Section 19

Evaluation — How We Measure LLM Quality

📊

Perplexity

exp(avg cross-entropy loss)

Measures how surprised the model is by test text. Lower = better. PPL of 5 means the model is as uncertain as choosing between 5 equally likely options at each step. Comparable only within the same tokenizer.

📚

Benchmark Suites

MMLU / HellaSwag / GSM8K

MMLU: 57-subject academic MCQ (science, law, history). HellaSwag: Commonsense reasoning. GSM8K: Grade-school math. HumanEval: Python code generation. MATH: Competition-level maths.

👥

Human & LLM-as-Judge

MT-Bench / Chatbot Arena

MT-Bench: GPT-4 grades model responses on 80 multi-turn questions. Chatbot Arena: Humans compare models blindly — ELO ranking emerges. Increasingly the gold standard since benchmarks get "trained into" models.

⚠️

Benchmark Contamination

If benchmark questions appear in training data, scores are inflated and meaningless. Major labs run contamination analyses, but it's nearly impossible to fully verify at 15T token scale. As a rule: high benchmark scores with no methodology → treat with scepticism. Chatbot Arena (real users, blind comparison) is the hardest to contaminate and the most trusted signal.

Section 20

Golden Rules for LLM Practitioners

⚡ LLM Architecture — Non-Negotiable Principles

More data beats more parameters — given a fixed compute budget, an undertrained large model loses to a well-trained smaller one. Follow Chinchilla scaling: ~20 tokens per parameter at minimum, and overtrain if you care about inference cost.

Always use RMSNorm + Pre-Norm + residual connections. Post-Norm (original Transformer) requires careful LR warmup and is prone to training instability. Pre-Norm with residual streams is the stable modern default.

KV cache is your inference bottleneck. Use Grouped Query Attention (GQA). For 128K+ contexts, also consider sliding window attention (Mistral), ring attention, or FlashAttention-3 for memory efficiency.

Use LoRA before full fine-tuning. For domain adaptation, instruction tuning on custom data, or style transfer, LoRA at rank 16–64 achieves 90–95% of full fine-tuning quality at 0.1% of the compute and memory. Start here always.

Quantise for inference — INT4 is almost always acceptable. 4-bit GPTQ or AWQ quantisation reduces memory 8× from FP32 with ~2–4% quality drop on most tasks. The gap is negligible for conversational applications.

Temperature 0.7–0.8, Top-P 0.9 is a safe default. For coding or structured outputs: temperature near 0 (near-greedy). For creative generation: temperature 0.9–1.1. Never set temperature to 0 for tasks requiring diversity.

Emergent capabilities are not guaranteed to be reliable. A model that "can do" multi-step reasoning on benchmarks may fail on your specific problem. Always validate on your task distribution before deploying. Benchmark scores are indicators, not guarantees.