How Large Language Models Work:

Section 01

The Story That Explains Large Language Models

📖 Real World Analogy

The World's Most Well-Read Intern

Imagine you hire an intern who, before starting work, has read every book in the Library of Congress, every Wikipedia article, every research paper, every forum thread, and billions of web pages — all in one summer. They didn't just skim; they absorbed the patterns of how ideas connect, how sentences flow, how problems get solved.

Now you ask this intern: "Complete this sentence: The capital of France is…"
They say "Paris" — not because they memorised a flashcard, but because they've seen that phrase completed correctly thousands of times and learned the statistical pattern.

That intern is a Large Language Model. It doesn't "know" things the way you do. It predicts the most probable next token given everything that came before — and from billions of such predictions, something remarkably intelligent emerges.

A Large Language Model (LLM) is a neural network with billions of parameters trained on massive text corpora to predict the next token in a sequence. From this single objective — next-token prediction — LLMs learn grammar, facts, reasoning, code, translation, and much more, without any of these being explicitly programmed.

🌿

The Core Insight

Language is a compressed representation of human thought. A model that perfectly predicts language must, by necessity, learn the underlying concepts, logic, and world knowledge that generate that language. Next-token prediction is a proxy for understanding.

Section 02

A Brief History — From N-Grams to GPT

N-Gram Models (1980s–2000s)

Statistical models that predict the next word based on the previous N words. Simple but brittle — they can't capture long-range dependencies. "The cat sat on the ___" works fine; paragraphs don't.

Recurrent Neural Networks — RNN / LSTM (2010–2017)

Neural networks with loops that carry information forward through a sequence. Much better at long text — but still struggle with very long dependencies and are slow to train (cannot be parallelised).

"Attention Is All You Need" — The Transformer (2017)

Google researchers publish the Transformer architecture. Self-attention lets every token attend to every other token simultaneously. Training is fully parallelisable. This paper changes everything.

BERT and GPT-2 (2018–2019)

OpenAI's GPT-2 (1.5B parameters) demonstrates that scale matters. BERT introduces bidirectional pretraining. The "pre-train then fine-tune" paradigm is born.

GPT-3, PaLM, Chinchilla (2020–2022)

GPT-3 (175B parameters) shows emergent capabilities — tasks it was never explicitly trained on. The Chinchilla paper reveals optimal compute-data trade-offs. "Scaling laws" become a science.

ChatGPT, Claude, Gemini, LLaMA (2022–Present)

RLHF (Reinforcement Learning from Human Feedback) aligns LLMs to follow instructions and be helpful. Open-source models emerge. LLMs become mainstream consumer products.

Section 03

Tokens — The Atoms of Language

📖 Story

The Morse Code of Language

Before a telegram could be sent, the message had to be broken into dots and dashes. LLMs do something similar: before any text can be processed, it must be broken into tokens — small chunks of characters. "unbelievable" might become ["un", "believ", "able"]. A space before a word is often its own token. The model never sees letters or words — it sees integers (token IDs), and everything happens in that numerical space.

🔤

What is a Token?

~4 characters on average

A token is a chunk of text — roughly a word-piece. "ChatGPT is great" → ["Chat", "G", "PT", " is", " great"] = 5 tokens. Common words are single tokens; rare words are split into sub-words.

🗂️

Vocabulary

~50,000–100,000 tokens

The tokeniser has a fixed vocabulary (e.g., GPT-4 uses ~100K tokens via the BPE algorithm). Every token maps to an integer ID. The model's embedding layer converts these IDs into dense vectors.

📏

Context Window

The model's "working memory"

The maximum number of tokens the model can process at once. GPT-4 has 128K tokens (~300 pages). Beyond this limit, text is truncated. Everything the model "knows" during a chat is in this window.

import tiktoken  # OpenAI's tokeniser library

enc = tiktoken.get_encoding("cl100k_base")  # GPT-4 encoding

text = "Large Language Models are transformers trained on massive datasets."
tokens = enc.encode(text)

print(f"Text:    {text}")
print(f"Tokens:  {tokens}")
print(f"Count:   {len(tokens)} tokens")

# Decode individual tokens to see the chunks
for tok_id in tokens:
    print(f"  {tok_id:6d} → '{enc.decode([tok_id])}'")

OUTPUT

Text: Large Language Models are transformers trained on massive datasets. Tokens: [35021, 11688, 27773, 527, 2678, 388, 16572, 389, 11191, 30525, 13] Count: 11 tokens 35021 → 'Large' 11688 → ' Language' 27773 → ' Models' 527 → ' are' 2678 → ' transform' 388 → 'ers' 16572 → ' trained' 389 → ' on' 11191 → ' massive' 30525 → ' datasets' 13 → '.'

Section 04

The Transformer Architecture — The Engine of LLMs

Every modern LLM is built on the Transformer architecture. Understanding it is non-negotiable for building LLMs. There are two components: the Encoder (understands input) and the Decoder (generates output). Most LLMs (GPT family, LLaMA, Claude) use the decoder-only variant.

🔩 Decoder-Only Transformer — Layer by Layer

Step 1

Token Embedding: Each token ID is looked up in an embedding matrix, producing a dense vector of dimension d_model (e.g. 4096 for LLaMA-2 7B). This is the token's "meaning" in vector space.

Step 2

Positional Encoding: Since Transformers have no built-in sense of order, position information is added to each token embedding. Modern LLMs use RoPE (Rotary Position Embeddings) instead of the original sinusoidal approach.

Step 3

Multi-Head Self-Attention: The core operation. Each token asks: "Which other tokens are most relevant to me?" It computes attention scores across all tokens and produces a weighted sum. Multiple "heads" do this in parallel, each learning different types of relationships.

Step 4

Feed-Forward Network (FFN): A two-layer MLP applied independently to each token. This is where most of the model's "knowledge" is stored. Roughly 2/3 of transformer parameters live here.

Step 5

Layer Norm + Residual Connections: Each sublayer wraps in a residual (skip) connection — output = LayerNorm(x + sublayer(x)). This enables very deep networks to train stably.

Step 6

Repeat N Times: A typical LLM stacks 32–96 of these Transformer blocks. Each layer refines the representation further. Deeper = more abstraction.

Step 7

LM Head (Output Projection): The final hidden state is projected back to vocabulary size via a linear layer + softmax. The output is a probability distribution over all tokens — the next token prediction.

📐

Key Dimensions for a 7B-Parameter Model (e.g. LLaMA-2 7B)

d_model = 4096 (hidden size) · n_heads = 32 (attention heads) · n_layers = 32 (transformer blocks) · d_ff = 11008 (FFN width) · vocab_size ≈ 32,000 tokens. Total parameters: ~7 billion.

Section 05

Self-Attention — The Heart of the Transformer

📖 Story

The Board Meeting

Every token in a sentence is like a person at a board meeting. When "bank" needs to understand what it means, it turns to all the other words and asks: "How relevant are you to me?" If the sentence is "The bank approved the loan", the words "approved" and "loan" get high attention weights — they signal this is a financial bank. If the sentence is "The bank of the river flooded", "river" and "flooded" dominate.

Attention is the mechanism that lets each word dynamically decide what context to use based on its neighbours.

Query

Q = X · W_Q

"What am I looking for?" Each token projects itself into query space.

Key

K = X · W_K

"What do I offer?" Each token projects itself into key space for others to match against.

Value

V = X · W_V

"What information do I carry?" The actual content passed to attending tokens.

Attention Score

softmax(QKᵀ / √d_k) · V

Dot product of queries and keys, scaled and softmaxed → weights applied to values.

import numpy as np

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q, K, V: (batch, seq_len, d_k)
    Returns: (batch, seq_len, d_k)
    """
    d_k = Q.shape[-1]

    # Step 1: dot product of Q and K^T → scores
    scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(d_k)  # scale

    # Step 2: causal mask (decoder only sees past tokens)
    if mask is not None:
        scores = np.where(mask == 0, -1e9, scores)  # mask future tokens

    # Step 3: softmax → attention weights
    weights = np.exp(scores) / np.exp(scores).sum(axis=-1, keepdims=True)

    # Step 4: weighted sum of values
    output = np.matmul(weights, V)
    return output, weights

# Toy example: 3 tokens, d_k=4
seq_len, d_k = 3, 4
Q = np.random.randn(1, seq_len, d_k)
K = np.random.randn(1, seq_len, d_k)
V = np.random.randn(1, seq_len, d_k)

output, weights = scaled_dot_product_attention(Q, K, V)
print("Attention weights (3×3 matrix):")
print(np.round(weights[0], 3))
print(f"\nOutput shape: {output.shape}")

OUTPUT

Attention weights (3×3 matrix): [[0.412 0.291 0.297] [0.183 0.504 0.313] [0.274 0.341 0.385]] Output shape: (1, 3, 4)

Section 06

Building a Minimal GPT from Scratch in PyTorch

⚡

What We Are Building

A character-level language model — the simplest possible GPT. It reads text, tokenises into individual characters, and trains to predict the next character. The same architecture, scaled up 10,000×, powers GPT-4. Understanding this tiny version is understanding the core.

Step 1 — Data Preparation

import torch
import torch.nn as nn
from torch.nn import functional as F

# ── Config ─────────────────────────────────────────────────
batch_size    = 32
block_size    = 128   # context window (tokens)
n_embd        = 192   # embedding dimension
n_head        = 6     # attention heads
n_layer       = 6     # transformer blocks
dropout       = 0.1
learning_rate = 3e-4
max_iters     = 3000
device        = 'cuda' if torch.cuda.is_available() else 'cpu'

# ── Load text data ─────────────────────────────────────────
with open('input.txt', 'r') as f:
    text = f.read()

# Character-level vocabulary
chars   = sorted(list(set(text)))
vocab_size = len(chars)
stoi    = { ch: i for i, ch in enumerate(chars) }  # char → index
itos    = { i: ch for i, ch in enumerate(chars) }  # index → char
encode  = lambda s: [stoi[c] for c in s]
decode  = lambda l: ''.join([itos[i] for i in l])

# Train / val split
data = torch.tensor(encode(text), dtype=torch.long)
n    = int(0.9 * len(data))
train_data = data[:n]
val_data   = data[n:]

def get_batch(split):
    d   = train_data if split == 'train' else val_data
    ix  = torch.randint(len(d) - block_size, (batch_size,))
    x   = torch.stack([d[i : i+block_size]   for i in ix])
    y   = torch.stack([d[i+1 : i+block_size+1] for i in ix])
    return x.to(device), y.to(device)

Step 2 — The Transformer Blocks

# ── Single Attention Head ───────────────────────────────────
class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.key   = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)    # (B, T, head_size)
        q = self.query(x)  # (B, T, head_size)
        # Attention scores — scaled dot product
        wei = q @ k.transpose(-2, -1) * k.shape[-1]**-0.5
        wei = wei.masked_fill(self.tril[:T,:T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        wei = self.dropout(wei)
        v   = self.value(x)
        return wei @ v  # (B, T, head_size)

# ── Multi-Head Attention ────────────────────────────────────
class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads   = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj    = nn.Linear(head_size * num_heads, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        return self.dropout(self.proj(out))

# ── Feed-Forward Network ────────────────────────────────────
class FeedForward(nn.Module):
    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )
    def forward(self, x): return self.net(x)

# ── Transformer Block ───────────────────────────────────────
class Block(nn.Module):
    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size   = n_embd // n_head
        self.sa     = MultiHeadAttention(n_head, head_size)
        self.ffwd   = FeedForward(n_embd)
        self.ln1    = nn.LayerNorm(n_embd)
        self.ln2    = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))    # residual + attention
        x = x + self.ffwd(self.ln2(x))  # residual + FFN
        return x

Step 3 — The Full GPT Model

class GPTLanguageModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.token_embedding_table    = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks     = nn.Sequential(*[Block(n_embd, n_head) for _ in range(n_layer)])
        self.ln_f       = nn.LayerNorm(n_embd)         # final layer norm
        self.lm_head    = nn.Linear(n_embd, vocab_size) # output projection
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        tok_emb = self.token_embedding_table(idx)              # (B,T,n_embd)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        x       = tok_emb + pos_emb                              # (B,T,n_embd)
        x       = self.blocks(x)
        x       = self.ln_f(x)
        logits  = self.lm_head(x)                              # (B,T,vocab_size)

        loss = None
        if targets is not None:
            B, T, C = logits.shape
            logits  = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss    = F.cross_entropy(logits, targets)
        return logits, loss

    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -block_size:]                      # crop to context
            logits, _ = self(idx_cond)
            logits    = logits[:, -1, :]                       # last token only
            probs     = F.softmax(logits, dim=-1)
            idx_next  = torch.multinomial(probs, num_samples=1) # sample
            idx       = torch.cat((idx, idx_next), dim=1)
        return idx

Step 4 — Training Loop

model     = GPTLanguageModel().to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

print(f"Model parameters: {sum(p.numel() for p in model.parameters())/1e6:.2f}M")

for step in range(max_iters):
    xb, yb    = get_batch('train')
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    if step % 500 == 0:
        print(f"Step {step:4d}  |  loss: {loss.item():.4f}")

# ── Generate text ───────────────────────────────────────────
context = torch.zeros((1,1), dtype=torch.long, device=device)
generated = decode(model.generate(context, max_new_tokens=200)[0].tolist())
print(generated)

TRAINING OUTPUT

Section 07

Pre-Training — Teaching the Model Everything

Pre-training is the most compute-intensive phase. The model sees trillions of tokens and learns to predict the next one. No labels, no human annotation — just raw text and the self-supervised next-token prediction objective. This is where the model "learns the world."

📚

Data

Web crawl (CommonCrawl), books, code, Wikipedia, academic papers. Quality filtering is critical — garbage in, garbage out. Deduplicate aggressively.

scale: 1T–15T tokens

🎯

Objective

Causal Language Modelling (CLM): given tokens 1…t, predict token t+1. Loss is cross-entropy over the vocabulary. Gradients flow through all layers on every step.

next-token prediction

⚡

Compute

Training a 7B model takes ~180,000 GPU-hours on A100s. A 70B model: ~1.7M GPU-hours. Cost: $1M–$100M+. Only a handful of organisations can do this.

distributed across 1000s of GPUs

📐

Scaling Laws

Chinchilla (2022): optimal training trains on ~20 tokens per parameter. A 7B model should see ~140B tokens. Undertrained models are common — more tokens ≠ always better without scaling model too.

Hoffman et al., 2022

🔧

Optimiser

AdamW with cosine learning rate schedule, gradient clipping at 1.0, mixed-precision (bfloat16). ZeRO optimizer sharding and tensor/pipeline parallelism for multi-GPU training.

AdamW + bf16 + ZeRO

📊

Evaluation

Perplexity on held-out data. Lower = better. Also evaluated on few-shot benchmarks (MMLU, HellaSwag, ARC) throughout training to track emergent capabilities.

perplexity + benchmark evals

⚠️

Pre-trained ≠ Usable

A raw pre-trained model is a next-token predictor. Ask it "What is the capital of France?" and it might continue with "…What is the capital of Germany? What is the capital of Spain?" — because it's seen quiz-style text. It doesn't "answer" questions. That behaviour requires the next phase: instruction fine-tuning and alignment.

Section 08

Fine-Tuning — Teaching the Model to Be Helpful

Fine-tuning adapts a pre-trained LLM to a specific task or to follow instructions. There are three main approaches, each with a different cost/performance profile.

🎯

Full Fine-Tuning

All parameters updated

Update every weight in the model on task-specific data. Highest performance but requires as much VRAM as pre-training. Risks catastrophic forgetting of base knowledge.

✅ Best accuracy

❌ Most expensive, can't run on consumer GPUs

🔗

LoRA

Low-Rank Adaptation

Freeze the original model. Add small trainable rank-decomposition matrices to each attention layer. Train only these adapters (~0.1% of parameters). Merge at inference — zero latency cost.

✅ 10x memory reduction, near-full-FT quality

❌ Slight quality gap on very specific tasks

⚡

QLoRA

Quantised LoRA

Load the base model in 4-bit quantisation. Apply LoRA adapters in 16-bit. Fine-tune a 70B model on a single 48GB GPU. The democratisation breakthrough for LLM fine-tuning.

✅ Fine-tune 70B on one consumer GPU

❌ Slightly lower quality than full LoRA

LoRA Fine-Tuning with Hugging Face + PEFT

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
from datasets import load_dataset

# ── Load base model ─────────────────────────────────────────
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer  = AutoTokenizer.from_pretrained(model_name)
model      = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,             # QLoRA: 4-bit quantisation
    device_map="auto"
)

# ── LoRA config ─────────────────────────────────────────────
lora_config = LoraConfig(
    task_type    = TaskType.CAUSAL_LM,
    r            = 16,           # rank of the adapter matrices
    lora_alpha   = 32,           # scaling factor
    lora_dropout = 0.05,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"],
    bias         = "none",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# → trainable params: 4,194,304 || all params: 6,742,609,920 || 0.0622%

# ── Dataset ─────────────────────────────────────────────────
dataset = load_dataset("timdettmers/openassistant-guanaco", split="train")

# ── Training ────────────────────────────────────────────────
training_args = TrainingArguments(
    output_dir          = "./llama2-lora",
    num_train_epochs    = 3,
    per_device_train_batch_size = 4,
    gradient_accumulation_steps = 4,
    learning_rate       = 2e-4,
    fp16                = True,
    logging_steps       = 50,
    save_steps          = 500,
)

trainer = SFTTrainer(
    model           = model,
    train_dataset   = dataset,
    args            = training_args,
    dataset_text_field = "text",
    max_seq_length  = 512,
)

trainer.train()

🎁

Only 0.06% of Parameters Need Updating

LoRA trains ~4 million parameters out of 6.7 billion — yet achieves near full fine-tuning quality. The adapter adds two small matrices A and B to each target layer such that the update is ΔW = B·A where rank(ΔW) ≤ r. The low-rank assumption works because most adaptation is low-dimensional.

Section 09

RLHF — Making the Model Safe and Helpful

📖 Story

Training a Dog vs Training an LLM

You've taught a dog to sit (pre-training — it knows the behaviour exists). Now you want it to sit on command, not randomly. You reward good sits with treats and ignore bad ones. Over time it learns what earns rewards.

RLHF does the same to an LLM. Humans rate model outputs ("this response is helpful, that one is harmful"). A reward model learns to score responses. Then the LLM is optimised with reinforcement learning to maximise that reward — producing helpful, harmless, and honest outputs.

Supervised Fine-Tuning (SFT)

Fine-tune the base model on high-quality human-written demonstrations. "Question: … Answer: …" pairs written by expert labellers. This teaches the model the instruction-following format.

Reward Model Training

Human raters compare pairs of model responses and rank them. A separate model (also a Transformer) is trained to predict human preference scores. This is the reward model — it approximates "what humans like."

Reinforcement Learning (PPO)

The SFT model is treated as the policy. It generates responses; the reward model scores them. PPO (Proximal Policy Optimisation) updates the LLM to generate higher-scoring responses, with a KL divergence penalty to prevent it drifting too far from the SFT baseline.

DPO — The Modern Simplification

Direct Preference Optimisation (2023) eliminates the separate reward model and RL loop. It directly optimises the LLM on preference pairs using a re-derived loss function. Simpler, more stable, and often matches PPO quality.

from trl import DPOTrainer, DPOConfig
from datasets import Dataset

# DPO Dataset format: prompt, chosen response, rejected response
dpo_data = {
    "prompt":   ["Explain photosynthesis.", "What is 2+2?"],
    "chosen":   ["Photosynthesis is the process by which plants use sunlight...",
                  "2+2 equals 4."],
    "rejected": ["Plants eat sunlight lol",
                  "I don't know, maybe 5?"],
}
dataset = Dataset.from_dict(dpo_data)

dpo_config = DPOConfig(
    beta              = 0.1,  # KL penalty coefficient — lower = more aggressive updates
    max_prompt_length = 512,
    max_length        = 1024,
    output_dir        = "./dpo_model",
    num_train_epochs  = 1,
    per_device_train_batch_size = 2,
    learning_rate     = 5e-5,
)

dpo_trainer = DPOTrainer(
    model     = model,       # your SFT-fine-tuned model
    ref_model = ref_model,   # frozen copy of SFT model (KL reference)
    args      = dpo_config,
    train_dataset = dataset,
    tokenizer = tokenizer,
)

dpo_trainer.train()

Section 10

Inference — How LLMs Generate Text

Inference is the process of generating tokens from a trained model. The model takes an input (the prompt) and auto-regressively generates one token at a time. The way tokens are sampled from the output distribution dramatically affects output quality.

🎯

Greedy Decoding

Always pick the top token

At each step, pick the single highest-probability token. Fast but repetitive and often degenerate. A local optimal choice at each step leads to globally poor outputs.

❌ Repetitive, boring, no diversity

🌡️

Temperature Sampling

Divide logits by T before softmax

T=1.0: original distribution. T<1.0: sharper (more predictable). T>1.0: flatter (more random). T≈0: greedy. T=0.7 is a popular creative writing default.

✅ Controllable diversity

📊

Top-p (Nucleus) Sampling

Sample from top-p probability mass

Sort tokens by probability. Include them until cumulative probability reaches p (e.g. 0.9). Sample only from this "nucleus." Adapts the candidate set size to the distribution — more robust than top-k.

✅ Widely used in production

import torch
import torch.nn.functional as F

def generate(model, tokenizer, prompt, max_new_tokens=200,
             temperature=0.8, top_p=0.9, top_k=50):

    inputs = tokenizer.encode(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        for _ in range(max_new_tokens):
            outputs = model(inputs)
            logits  = outputs.logits[:, -1, :]   # last token's distribution

            # ── Temperature ──────────────────────────────
            logits = logits / temperature

            # ── Top-k filter ─────────────────────────────
            top_k_vals, _ = torch.topk(logits, top_k)
            threshold = top_k_vals[:, [-1]]
            logits = logits.masked_fill(logits < threshold, -float('inf'))

            # ── Top-p filter (nucleus sampling) ──────────
            probs       = F.softmax(logits, dim=-1)
            sorted_probs, sorted_idx = torch.sort(probs, descending=True)
            cumulative  = torch.cumsum(sorted_probs, dim=-1)
            to_remove   = cumulative - sorted_probs > top_p
            sorted_probs[to_remove] = 0
            sorted_probs /= sorted_probs.sum(dim=-1, keepdim=True)

            # ── Sample ───────────────────────────────────
            next_token_idx = torch.multinomial(sorted_probs, num_samples=1)
            next_token     = sorted_idx.gather(-1, next_token_idx)
            inputs         = torch.cat([inputs, next_token], dim=1)

            if next_token.item() == tokenizer.eos_token_id:
                break

    return tokenizer.decode(inputs[0], skip_special_tokens=True)

output = generate(model, tokenizer, "Explain quantum entanglement simply:")
print(output)

Section 11

Using LLMs via API — The Practical Path

For most practitioners, training an LLM from scratch is not the goal. The goal is to use LLMs. The fastest path is through APIs.

OpenAI / Anthropic API

from openai import OpenAI
import anthropic

# ── OpenAI ──────────────────────────────────────────────────
openai_client = OpenAI(api_key="sk-...")

response = openai_client.chat.completions.create(
    model   = "gpt-4o",
    messages= [
        { "role": "system", "content": "You are a helpful data scientist." },
        { "role": "user",   "content": "Explain p-values in one paragraph."  },
    ],
    max_tokens  = 300,
    temperature = 0.7,
)
print(response.choices[0].message.content)

# ── Anthropic (Claude) ───────────────────────────────────────
claude = anthropic.Anthropic(api_key="sk-ant-...")

message = claude.messages.create(
    model      = "claude-sonnet-4-6",
    max_tokens = 1024,
    messages   = [
        { "role": "user", "content": "Write a Python function to detect outliers using IQR." }
    ]
)
print(message.content[0].text)

Running LLMs Locally with Ollama + LangChain

# First: curl -fsSL https://ollama.ai/install.sh | sh
# Then:  ollama pull llama3.2

from langchain_ollama import OllamaLLM
from langchain.prompts import PromptTemplate

# Completely free, runs on your machine
llm = OllamaLLM(model="llama3.2", temperature=0.7)

prompt = PromptTemplate(
    input_variables=["topic"],
    template="You are an expert data scientist. Explain {topic} in simple terms with an example."
)

chain = prompt | llm
result = chain.invoke({ "topic": "gradient descent" })
print(result)

Section 12

RAG — Retrieval-Augmented Generation

📖 Story

The Open-Book Exam

An LLM trained in 2024 doesn't know about events in 2025. It also doesn't know your company's private documents. RAG solves this by giving the LLM an open book: before answering, it searches a database for relevant documents, then uses those as context for its answer. The LLM doesn't need to have memorised anything — it reads and reasons on demand.

from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader

# ── Step 1: Load and chunk documents ───────────────────────
loader   = PyPDFLoader("company_docs.pdf")
docs     = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks   = splitter.split_documents(docs)

# ── Step 2: Embed chunks and store in vector DB ────────────
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore     = Chroma.from_documents(chunks, embedding_model)

# ── Step 3: Retrieval QA chain ──────────────────────────────
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})  # top 4 chunks

qa_chain = RetrievalQA.from_chain_type(
    llm       = ChatOpenAI(model="gpt-4o", temperature=0),
    chain_type= "stuff",           # stuff all chunks into prompt
    retriever = retriever,
    return_source_documents = True,
)

result = qa_chain.invoke("What is our refund policy?")
print("Answer:", result["result"])
print("Sources:", [doc.metadata["source"] for doc in result["source_documents"]])

🔑

The RAG Pipeline at a Glance

Offline: Chunk documents → embed each chunk → store vectors in database.
Online: Embed user query → find top-k similar chunks → inject into LLM prompt → LLM answers grounded in retrieved context.

Section 13

Evaluating LLMs — How Do You Know It's Good?

Evaluation Method	Measures	Tools	Best For
Perplexity	How well the model predicts held-out text. Lower = better.	Automatic	Comparing model versions during training
MMLU	Massive Multitask Language Understanding — 57 academic subjects, 4-choice MCQ	Automatic	General knowledge and reasoning capability
HumanEval	Pass@k on 164 Python coding problems from docstrings	Automatic	Code generation ability
MT-Bench	Multi-turn conversation quality, scored by GPT-4 as judge	LLM-as-Judge	Instruction-following, helpfulness
ROUGE / BLEU	N-gram overlap between generated and reference text	Automatic	Summarisation, translation
Human Eval	Direct human preference ratings (A vs B)	Expensive	Gold standard for safety and helpfulness

from evaluate import load
import numpy as np

# ── Perplexity ─────────────────────────────────────────────
perplexity = load("perplexity", module_type="metric")
results = perplexity.compute(
    predictions=["The model generated this sentence.", "Another example output."],
    model_id="gpt2"
)
print(f"Mean Perplexity: {results['mean_perplexity']:.2f}")

# ── ROUGE for summarisation ─────────────────────────────────
rouge = load("rouge")
predictions = ["The cat sat on the mat near the window."]
references  = ["The cat sat on the mat."]

scores = rouge.compute(predictions=predictions, references=references)
print(f"ROUGE-1: {scores['rouge1']:.4f}")
print(f"ROUGE-L: {scores['rougeL']:.4f}")

OUTPUT

Mean Perplexity: 24.31 ROUGE-1: 0.8571 ROUGE-L: 0.7143

Section 14

LLM Architecture Comparison

Model	Params	Architecture	Key Innovation	Open Source?
GPT-2	1.5B	Decoder-only	First large-scale LM to show emergent capabilities	Yes
GPT-3	175B	Decoder-only	Few-shot in-context learning at scale	No
BERT	340M	Encoder-only	Bidirectional masking; NLU benchmark dominance	Yes
T5	11B	Encoder-Decoder	"Text-to-text" unified framing for all NLP tasks	Yes
LLaMA-3	8B–405B	Decoder-only	Open weights; GQA; RoPE; competitive with closed models	Yes
Mistral 7B	7B	Decoder-only	Sliding window attention; beats LLaMA-2 13B at 7B size	Yes
GPT-4o	~1.8T (MoE)	MoE Decoder	Mixture of experts; multimodal; best public benchmark scores	No
Claude 3.5+	Undisclosed	Decoder-only	Constitutional AI; long context; coding excellence	No

Section 15

Prompt Engineering — Communicating with LLMs

You don't need to train an LLM to get dramatically better results. The right prompt can unlock capabilities the model already has. Prompt engineering is now a professional skill.

🧠

Chain-of-Thought (CoT)

"Let's think step by step"

Adding "think step by step" to a prompt dramatically improves reasoning on math, logic, and multi-step problems. The model externalises its reasoning, which reduces errors.

✅ +20–40% accuracy on reasoning tasks

📋

Few-Shot Prompting

Show examples in the prompt

Provide 3–5 input/output examples directly in the prompt. The model learns the pattern from context alone — no gradient updates needed. Works surprisingly well for classification and formatting tasks.

✅ No training data required

🔧

System Prompts

Define the model's persona and rules

The system message is processed before the user's input and sets the model's "mode". "You are a strict JSON API" constrains output format. "You are a Socratic tutor" changes interaction style entirely.

✅ Consistent behaviour across all turns

from openai import OpenAI
client = OpenAI()

# ── Chain-of-Thought example ────────────────────────────────
cot_prompt = """Solve the following problem. Think through it step by step.

Problem: A train travels 120 km at 60 km/h, then 80 km at 40 km/h.
What is the average speed for the whole journey?

Think step by step:"""

# ── Few-Shot example ────────────────────────────────────────
few_shot_prompt = """Classify the sentiment of each review.

Review: "The food was absolutely amazing!" → Positive
Review: "Terrible service, never coming back." → Negative
Review: "It was okay, nothing special." → Neutral

Review: "Best burger I've ever had, will definitely return!" → """

# ── Structured JSON output ───────────────────────────────────
response = client.chat.completions.create(
    model = "gpt-4o",
    messages=[
        {"role": "system",
         "content": "You are a JSON API. Always respond with valid JSON only. No explanation."},
        {"role": "user",
         "content": "Extract: name, age, city from: 'Alice Smith, 29, lives in London.'"}
    ],
    response_format={"type": "json_object"}
)
print(response.choices[0].message.content)
# → {"name": "Alice Smith", "age": 29, "city": "London"}

Section 16

Golden Rules for LLM Practitioners

🌿 LLM Engineering — Non-Negotiable Rules

Never train from scratch unless you have billions of dollars and petabytes of data. Start with a pre-trained model. Fine-tune with LoRA/QLoRA for custom behaviour. 90% of production use cases are API calls or fine-tuning, not pre-training.

Prompt engineering before fine-tuning. Fine-tuning before pre-training. Try a well-crafted prompt first. If that's insufficient, fine-tune. Only pre-train if you have domain-specific data unavailable in existing models (medical, legal, scientific).

Temperature controls creativity, not intelligence. For factual tasks, use temperature=0 or very low (0.1–0.3). For creative tasks, use 0.7–1.0. Never set temperature > 1.2 in production — outputs become incoherent.

LLMs hallucinate — always verify factual claims. LLMs are trained to produce plausible text, not necessarily true text. For factual applications (medical, legal, financial), use RAG with trusted sources and require citations. Never trust a model's confidence as ground truth.

Chunk size matters enormously in RAG. Too small: chunks lack context. Too large: irrelevant text dilutes the signal. Experiment with 256–1024 tokens with 10–20% overlap. Evaluate retrieval quality, not just generation quality — retrieval is the most common bottleneck.

Use structured outputs in production. Always request JSON output via response_format={"type": "json_object"} or use tools/function calling. Parsing free-form text in production is a maintenance nightmare. Structured outputs are deterministic, parseable, and testable.

Monitor token costs obsessively. A GPT-4o call costs ~$5/million output tokens. A naive chatbot processing 100,000 user messages per day with 500 output tokens each costs ~$250/day. Cache aggressively, compress prompts, use smaller models for simpler tasks (GPT-4o-mini vs GPT-4o).

The context window is not free memory. Every token in the context is processed at every layer during inference. Cost scales quadratically with sequence length for standard attention. For very long contexts, explore sparse attention, sliding window, or summarisation strategies.