The Story That Explains Building an LLM
At first the child guesses wildly. Cat follows "The dog chased the…"? Probably not. But after reading ten million sentences, the child builds deep intuitions. They know that "The stock market…" probably continues with fell, rose, or closed — never with barked.
That is precisely how a Large Language Model is trained. It reads an enormous slice of human text. At each position, it predicts the next token. Every wrong prediction fires a correction signal through the network. After billions of corrections, the model does not just predict text — it understands language well enough to reason, translate, summarise, and code.
In this tutorial you will build a working LLM from scratch using Python. We start from raw text, write a tokeniser, implement a Transformer architecture, train it on real data, and finish with a model that can generate coherent language. Every line of code is explained. Every concept has a story.
You have already completed the LLM Basics tutorial — you know what tokens, embeddings, attention, and the Transformer architecture are conceptually. Here we focus entirely on building, training, and running one using Python, PyTorch, and real data.
The LLM Creation Pipeline — The Big Picture
Building an LLM is a six-stage journey. Each stage feeds the next. Skip one and the whole system breaks down.
Setting Up Your Environment
Every professional LLM experiment lives in an isolated environment. Here is how to set yours up in under five minutes.
# Create and activate a virtual environment
python -m venv llm_env
source llm_env/bin/activate # Linux / macOS
# llm_env\Scripts\activate # Windows
# Install the full stack
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install tokenizers datasets transformers wandb tqdm numpy
# Verify GPU is available
import torch
print("CUDA available:", torch.cuda.is_available())
print("GPU:", torch.cuda.get_device_name(0))
print("PyTorch version:", torch.__version__)
Step 1 — Data Collection & Cleaning
What Makes Good Training Data?
Below is the full data pipeline — from raw download to clean training-ready text:
from datasets import load_dataset
import re, unicodedata
# ── 1. Load a public dataset (streaming to save RAM) ─────────
dataset = load_dataset(
"wikipedia",
"20220301.en",
split="train",
streaming=True
)
# ── 2. Text cleaning function ─────────────────────────────────
def clean_text(text: str) -> str:
# Normalise unicode (é → e where appropriate)
text = unicodedata.normalize("NFKC", text)
# Remove URLs
text = re.sub(r"https?://\S+", "", text)
# Remove excessive whitespace
text = re.sub(r"\s{3,}", "\n\n", text)
# Remove very short lines (likely navigation noise)
lines = [l for l in text.splitlines() if len(l.split()) > 5]
return "\n".join(lines).strip()
# ── 3. Stream, clean, and write to disk ──────────────────────
with open("train.txt", "w", encoding="utf-8") as f:
for i, example in enumerate(dataset):
clean = clean_text(example["text"])
if len(clean) > 200: # Skip tiny stubs
f.write(clean + "\n\n")
if i >= 50_000: break # 50k articles for demo
print("Done. Dataset written to train.txt")
Step 2 — Building the BPE Tokeniser
A tokeniser converts raw text into integer IDs the model can process. We train a Byte-Pair Encoding (BPE) tokeniser — the same algorithm used by GPT-2, GPT-3, and LLaMA. BPE starts with individual characters and iteratively merges the most frequent pairs, building a vocabulary of subword units.
Count all adjacent pairs across your whole corpus. If "a b" is the most common pair, merge it into a single token "ab". Repeat until your vocabulary reaches the target size (e.g. 32,768). The result: common words become single tokens ("the" → 1 token), rare words are split into known subwords ("unbelievable" → un + believ + able).
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevel
from tokenizers.decoders import ByteLevel as ByteLevelDecoder
# ── Initialise a blank BPE tokeniser ─────────────────────────
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = ByteLevel(add_prefix_space=False)
tokenizer.decoder = ByteLevelDecoder()
# ── Define special tokens ─────────────────────────────────────
trainer = BpeTrainer(
vocab_size=32768,
min_frequency=2,
special_tokens=["[PAD]", "[UNK]", "[BOS]", "[EOS]"],
show_progress=True
)
# ── Train on our cleaned corpus ───────────────────────────────
tokenizer.train(files=["train.txt"], trainer=trainer)
# ── Save for reuse ────────────────────────────────────────────
tokenizer.save("my_tokenizer.json")
print(f"Vocab size: {tokenizer.get_vocab_size()}")
# ── Quick test ────────────────────────────────────────────────
sample = "Large language models learn from text."
encoded = tokenizer.encode(sample)
print("Tokens:", encoded.tokens)
print("IDs: ", encoded.ids)
The Ġ symbol (a byte-level encoding of a space) marks the beginning of a new word. This is GPT-style byte-level BPE — it operates on raw bytes rather than Unicode characters, making it fully language-agnostic and immune to out-of-vocabulary words.
Step 3 — The Transformer Architecture in Code
We now implement the decoder-only Transformer used by GPT-style models. It has five key components.
d_model. These vectors are learned during training.d_model back to vocab_size. The softmax of this output gives a probability over the entire vocabulary for the next token.Full Model Implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
# ─────────────────────────────────────────────────────────────
# CONFIG — small model for experimentation (~15M parameters)
# ─────────────────────────────────────────────────────────────
class GPTConfig:
vocab_size: int = 32768
context_len: int = 512 # sequence length
d_model: int = 512 # embedding dimension
n_heads: int = 8 # attention heads
n_layers: int = 6 # transformer blocks
dropout: float = 0.1
# ─────────────────────────────────────────────────────────────
# MULTI-HEAD CAUSAL SELF-ATTENTION
# ─────────────────────────────────────────────────────────────
class CausalSelfAttention(nn.Module):
def __init__(self, cfg):
super().__init__()
assert cfg.d_model % cfg.n_heads == 0
self.n_heads = cfg.n_heads
self.head_dim = cfg.d_model // cfg.n_heads
self.d_model = cfg.d_model
# Fused Q, K, V projection
self.qkv_proj = nn.Linear(cfg.d_model, 3 * cfg.d_model, bias=False)
self.out_proj = nn.Linear(cfg.d_model, cfg.d_model, bias=False)
self.attn_drop = nn.Dropout(cfg.dropout)
self.res_drop = nn.Dropout(cfg.dropout)
# Causal mask: upper triangle = -inf (can't attend to future)
self.register_buffer(
"mask",
torch.tril(torch.ones(cfg.context_len, cfg.context_len))
.view(1, 1, cfg.context_len, cfg.context_len)
)
def forward(self, x):
B, T, C = x.shape # batch, sequence length, d_model
# Compute Q, K, V in one shot then split
q, k, v = self.qkv_proj(x).split(self.d_model, dim=2)
# Reshape for multi-head: (B, n_heads, T, head_dim)
def reshape(t):
return t.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
q, k, v = reshape(q), reshape(k), reshape(v)
# Scaled dot-product attention
scale = math.sqrt(self.head_dim)
attn = (q @ k.transpose(-2, -1)) / scale
attn = attn.masked_fill(self.mask[:,:,:T,:T] == 0, float("-inf"))
attn = F.softmax(attn, dim=-1)
attn = self.attn_drop(attn)
# Weighted sum of values, merge heads
out = (attn @ v).transpose(1, 2).contiguous().view(B, T, C)
return self.res_drop(self.out_proj(out))
# ─────────────────────────────────────────────────────────────
# FEED-FORWARD NETWORK (with GELU activation)
# ─────────────────────────────────────────────────────────────
class FeedForward(nn.Module):
def __init__(self, cfg):
super().__init__()
self.net = nn.Sequential(
nn.Linear(cfg.d_model, 4 * cfg.d_model),
nn.GELU(),
nn.Linear(4 * cfg.d_model, cfg.d_model),
nn.Dropout(cfg.dropout),
)
def forward(self, x): return self.net(x)
# ─────────────────────────────────────────────────────────────
# TRANSFORMER BLOCK = Attention + FFN + Residuals + LayerNorm
# ─────────────────────────────────────────────────────────────
class TransformerBlock(nn.Module):
def __init__(self, cfg):
super().__init__()
self.ln1 = nn.LayerNorm(cfg.d_model)
self.attn = CausalSelfAttention(cfg)
self.ln2 = nn.LayerNorm(cfg.d_model)
self.ffn = FeedForward(cfg)
def forward(self, x):
x = x + self.attn(self.ln1(x)) # Pre-LN residual
x = x + self.ffn(self.ln2(x)) # Pre-LN residual
return x
# ─────────────────────────────────────────────────────────────
# FULL GPT MODEL
# ─────────────────────────────────────────────────────────────
class GPT(nn.Module):
def __init__(self, cfg):
super().__init__()
self.cfg = cfg
self.transformer = nn.ModuleDict({
"tok_emb": nn.Embedding(cfg.vocab_size, cfg.d_model),
"pos_emb": nn.Embedding(cfg.context_len, cfg.d_model),
"drop" : nn.Dropout(cfg.dropout),
"blocks" : nn.ModuleList([TransformerBlock(cfg) for _ in range(cfg.n_layers)]),
"ln_f" : nn.LayerNorm(cfg.d_model),
})
self.lm_head = nn.Linear(cfg.d_model, cfg.vocab_size, bias=False)
# Weight tying: share token embeddings with lm_head
self.transformer["tok_emb"].weight = self.lm_head.weight
# Initialise weights (GPT-2 style)
self.apply(self._init_weights)
def _init_weights(self, module):
if isinstance(module, nn.Linear):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
def forward(self, idx, targets=None):
B, T = idx.shape
assert T <= self.cfg.context_len
pos = torch.arange(0, T, dtype=torch.long, device=idx.device)
x = self.transformer["drop"](
self.transformer["tok_emb"](idx) +
self.transformer["pos_emb"](pos)
)
for block in self.transformer["blocks"]:
x = block(x)
x = self.transformer["ln_f"](x)
logits = self.lm_head(x) # (B, T, vocab_size)
loss = None
if targets is not None:
loss = F.cross_entropy(logits.view(-1, self.cfg.vocab_size),
targets.view(-1))
return logits, loss
# ── Count parameters ──────────────────────────────────────────
cfg = GPTConfig()
model = GPT(cfg)
n_params = sum(p.numel() for p in model.parameters())
print(f"Model parameters: {n_params/1e6:.2f}M")
We share weights between the token embedding matrix and the final LM head. This trick (used in GPT-2) reduces parameters by ~16M without hurting performance — because both layers are learning the same token semantics from opposite ends.
Step 4 — The Data Loader
Before training, we need a fast data pipeline. We memory-map the training file and sample random context windows of length context_len. Each window becomes one training example: input = tokens[0:T], target = tokens[1:T+1].
At each position t, the model must predict token t+1. So if input is [The, cat, sat], targets are [cat, sat, on]. This one-position shift is the entire training objective — called causal language modelling.
import torch
import numpy as np
from torch.utils.data import Dataset, DataLoader
from tokenizers import Tokenizer
class TextDataset(Dataset):
def __init__(self, token_file: str, context_len: int):
# Memory-map the binary token file (fast, no RAM copy)
self.data = np.memmap(token_file, dtype=np.uint16, mode="r")
self.context_len = context_len
def __len__(self):
return len(self.data) - self.context_len
def __getitem__(self, idx):
chunk = self.data[idx : idx + self.context_len + 1]
x = torch.tensor(chunk[:-1].astype(np.int64))
y = torch.tensor(chunk[1:].astype(np.int64))
return x, y
# ── Tokenise and save as binary (done once) ───────────────────
def encode_to_binary(txt_path: str, out_path: str, tokenizer_path: str):
tok = Tokenizer.from_file(tokenizer_path)
with open(txt_path, "r", encoding="utf-8") as f:
text = f.read()
ids = tok.encode(text).ids
arr = np.array(ids, dtype=np.uint16)
arr.tofile(out_path)
print(f"Saved {len(arr):,} tokens to {out_path}")
# Run once: encode_to_binary("train.txt", "train.bin", "my_tokenizer.json")
# ── Build the data loader ─────────────────────────────────────
train_ds = TextDataset("train.bin", context_len=512)
train_dl = DataLoader(
train_ds,
batch_size=32,
shuffle=True,
num_workers=4,
pin_memory=True, # faster GPU transfer
)
print(f"Batches per epoch: {len(train_dl):,}")
Step 5 — The Training Loop
The training loop is the heart of everything. At each step: forward pass → compute loss → backward pass → update weights. We add gradient clipping, a learning rate scheduler, and periodic checkpointing.
What is Cross-Entropy Loss?
import torch, math
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR
device = "cuda" if torch.cuda.is_available() else "cpu"
model = GPT(GPTConfig()).to(device)
# ── Optimiser: AdamW with decoupled weight decay ──────────────
optimizer = AdamW(
model.parameters(),
lr=3e-4,
betas=(0.9, 0.95),
weight_decay=0.1,
eps=1e-8
)
# ── Cosine LR schedule with warm-up ──────────────────────────
n_epochs = 3
total_steps = n_epochs * len(train_dl)
warmup = 500
def lr_lambda(step):
if step < warmup:
return step / warmup
progress = (step - warmup) / (total_steps - warmup)
return 0.1 + 0.5 * (1 - 0.1) * (1 + math.cos(math.pi * progress))
scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
# ── Mixed precision scaler (FP16 training) ───────────────────
scaler = torch.cuda.amp.GradScaler()
# ── Training loop ─────────────────────────────────────────────
model.train()
for epoch in range(n_epochs):
for step, (x, y) in enumerate(train_dl):
x, y = x.to(device), y.to(device)
# Forward pass with automatic mixed precision
with torch.cuda.amp.autocast():
_, loss = model(x, targets=y)
# Backward pass
optimizer.zero_grad()
scaler.scale(loss).backward()
# Gradient clipping (prevents exploding gradients)
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
scaler.step(optimizer)
scaler.update()
scheduler.step()
# Log every 100 steps
if step % 100 == 0:
ppl = math.exp(loss.item())
lr = scheduler.get_last_lr()[0]
print(f"Epoch {epoch} | Step {step:5d} | Loss {loss.item():.4f} | PPL {ppl:.1f} | LR {lr:.2e}")
# Save checkpoint after each epoch
torch.save({
"epoch": epoch,
"model_state": model.state_dict(),
"optimizer_state": optimizer.state_dict(),
}, f"checkpoint_epoch{epoch}.pt")
Loss starts near ln(vocab_size) ≈ 10.4 (random guessing). It should drop sharply in the first 500 steps as the model learns basic structure, then gradually reduce over training. If loss plateaus above 4.0, check your learning rate. If loss suddenly spikes and stays high, you likely have a gradient explosion — lower the max_norm or reduce the learning rate.
Hyperparameter Reference Table
These are the key knobs. Understanding what each one does prevents hours of debugging.
| Hyperparameter | Our Demo Value | GPT-2 Small | Effect of Increasing | Typical Range |
|---|---|---|---|---|
| d_model | 512 | 768 | More capacity, more compute | 256 – 8192 |
| n_layers | 6 | 12 | Deeper reasoning, slower training | 4 – 96 |
| n_heads | 8 | 12 | More attention patterns (limited gain) | 4 – 64 |
| context_len | 512 | 1024 | RAM grows quadratically (O(T²)) | 512 – 128k |
| learning_rate | 3e-4 | 2.5e-4 | Unstable training above 1e-3 | 1e-5 – 1e-3 |
| batch_size | 32 | 512 | Smoother gradients, needs LR scaling | 16 – 4096 |
| dropout | 0.1 | 0.1 | Reduces overfitting (set to 0 at inference) | 0.0 – 0.3 |
Step 6 — Text Generation at Inference Time
Once trained, we generate text by feeding a prompt and sampling tokens one at a time. The quality of generation depends heavily on your sampling strategy.
import torch
import torch.nn.functional as F
from tokenizers import Tokenizer
def generate(
model,
tokenizer,
prompt: str,
max_new_tokens: int = 200,
temperature: float = 0.8,
top_k: int = 50,
top_p: float = 0.9,
device: str = "cuda",
):
model.eval()
tokens = tokenizer.encode(prompt).ids
x = torch.tensor([tokens], dtype=torch.long).to(device)
with torch.no_grad():
for _ in range(max_new_tokens):
# Crop context if needed
x_ctx = x[:, -model.cfg.context_len:]
logits, _ = model(x_ctx)
logits = logits[:, -1, :] # last position only
# Temperature scaling
logits = logits / temperature
# Top-k filtering
if top_k > 0:
vals, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < vals[:, [-1]]] = float("-inf")
# Top-p (nucleus) filtering
if top_p < 1.0:
sorted_logits, sorted_idx = torch.sort(logits, descending=True)
cum_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
remove = cum_probs - F.softmax(sorted_logits, dim=-1) > top_p
sorted_logits[remove] = float("-inf")
logits.scatter_(1, sorted_idx, sorted_logits)
# Sample next token
probs = F.softmax(logits, dim=-1)
next_tok = torch.multinomial(probs, num_samples=1)
x = torch.cat([x, next_tok], dim=1)
# Stop on EOS token
if next_tok.item() == tokenizer.token_to_id("[EOS]"):
break
generated_ids = x[0].tolist()
return tokenizer.decode(generated_ids)
# ── Load checkpoint and generate ──────────────────────────────
ckpt = torch.load("checkpoint_epoch2.pt", map_location=device)
model.load_state_dict(ckpt["model_state"])
tok = Tokenizer.from_file("my_tokenizer.json")
output = generate(model, tok, prompt="The history of artificial intelligence")
print(output)
Scaling Up — From Toy to Real Model
You cannot match GPT-4 on a laptop. But you can build something useful. A well-trained 100M parameter model running on a single RTX 3090 for 3 days can generate surprisingly good text.
| Model Size | Parameters | Training Tokens | GPU Time (A100) | Target Use Case |
|---|---|---|---|---|
| 🐜 Nano | 10–50M | 1–5B | 2–8 hours | Learning, quick experiments |
| 🐦 Small | 100–300M | 15–50B | 1–5 days (1 GPU) | Domain-specific text generation |
| 🦅 Medium | 1–7B | 100–200B | 2–8 weeks (8 GPUs) | Competitive with GPT-2, good at reasoning |
| 🦒 Large | 13–70B | 1–2T | Months (100+ GPUs) | LLaMA-class, general purpose assistant |
| 🌐 Frontier | 100B+ | 10T+ | Years of GPU-time ($100M+) | GPT-4, Claude, Gemini class |
Step 7 — Fine-Tuning with LoRA
Full fine-tuning a large model requires retraining all parameters — too expensive for most use cases. LoRA (Low-Rank Adaptation) freezes the original weights and adds tiny trainable rank-decomposed matrices at each attention layer. With LoRA, you fine-tune just 0.1–1% of the parameters and achieve results competitive with full fine-tuning.
Instead of updating the full weight matrix W (shape d×d), decompose the update as ΔW = A × B where A has shape d×r and B has shape r×d, and r ≪ d. With r=8 on a 512×512 matrix, you train 8,192 parameters instead of 262,144.
import torch
import torch.nn as nn
class LoRALinear(nn.Module):
"""Wraps an existing nn.Linear with a LoRA adapter."""
def __init__(self, linear: nn.Linear, rank: int = 8, alpha: float = 16.0):
super().__init__()
self.linear = linear
self.rank = rank
self.scale = alpha / rank
d_in, d_out = linear.in_features, linear.out_features
# LoRA matrices: A is random, B is zero (so ΔW starts at 0)
self.lora_A = nn.Parameter(torch.randn(rank, d_in) * 0.01)
self.lora_B = nn.Parameter(torch.zeros(d_out, rank))
# Freeze the original weights
for p in self.linear.parameters():
p.requires_grad = False
def forward(self, x):
base = self.linear(x) # original (frozen)
delta = (x @ self.lora_A.T) @ self.lora_B.T # LoRA adapter
return base + self.scale * delta
def inject_lora(model, rank=8, alpha=16):
"""Replace all Q and V projection layers with LoRA versions."""
for name, module in model.named_modules():
if isinstance(module, CausalSelfAttention):
# We inject LoRA only into the fused QKV projection
module.qkv_proj = LoRALinear(module.qkv_proj, rank, alpha)
return model
# ── Apply LoRA ────────────────────────────────────────────────
model = inject_lora(model, rank=8, alpha=16)
# Count trainable vs frozen params
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable:,} ({100*trainable/total:.2f}% of total)")
Golden Rules — Things That Will Save You Days
torch.cuda.amp.autocast() and GradScaler.
max_norm=1.0. If loss spikes, lower it to 0.5.
Monitor the global gradient norm in your logs — a norm consistently above 10 is a red flag.
numpy.memmap or HuggingFace datasets in streaming mode.
A 1B-token dataset needs 2 GB as uint16 — fine on disk, fatal in RAM alongside the model.
model.state_dict() and optimizer.state_dict()
— without the optimiser state, resuming training will destabilise the loss curve.
lm_head.weight = tok_emb.weight.
model.eval() before any text generation.
Forgetting this produces inconsistent, noisy outputs — a subtle bug that's hard to spot.
From-Scratch vs Fine-Tuning vs Prompt Engineering
| Approach | From Scratch | Fine-Tuning (Full) | LoRA / PEFT | Prompt Engineering |
|---|---|---|---|---|
| Control | Total — you own everything | High | Moderate | Low |
| Cost | Very High ($10k–$1M+) | High ($100–$10k) | Low ($5–$100) | Near zero |
| Data needed | Billions of tokens | 10k–1M examples | 500–50k examples | 0–10 examples |
| Time to first result | Days to months | Hours to days | Minutes to hours | Seconds |
| Best for | Proprietary domain, research | Domain specialisation | Task adaptation, tight budgets | Prototyping, off-the-shelf models |
Start with prompt engineering — it's free and might be enough. If you need customisation, use LoRA fine-tuning on an existing open model (LLaMA 3, Mistral). Only build from scratch if your data is proprietary, your domain is highly specialised, or you need complete control over the model weights for regulatory reasons.
Common Errors & How to Fix Them
| Symptom | Likely Cause | Fix |
|---|---|---|
| Loss immediately NaN | Learning rate too high, bad data (inf/nan values) | Lower LR to 1e-4; run torch.isnan(loss).any() after each step; inspect your dataset |
| Loss stuck at ~10.4 | Model outputting near-uniform distribution (untrained) | Check weight initialisation; verify gradient flow; ensure LR warms up correctly |
| Loss drops then spikes | Exploding gradients, bad batch in data | Clip gradients; lower max_norm; scan training file for corrupt entries |
| CUDA out of memory | Batch too large, sequence too long, no gradient checkpointing | Halve batch_size; reduce context_len; enable torch.utils.checkpoint |
| Model generates gibberish | Undertrained, temperature too high, context window exceeded | Train more; lower temperature to 0.6; crop input to context_len |
| Training works but eval is random | Forgot to call model.eval() |
Always call model.eval() and torch.no_grad() before inference |