Large Language Models (LLMs) 📂 LLM architecture deep dive · 3 of 10 45 min read

Building Your Own Large Language Model

A hands-on, step-by-step tutorial for building a GPT-style language model from scratch using Python and PyTorch. Covers data collection and cleaning, BPE tokeniser

Section 01

The Story That Explains Building an LLM

Teaching a Child to Read — Then to Write
Imagine you hand a child every book ever written. You don't explain grammar. You don't teach vocabulary lists. You simply say: "Read all of this — and then predict what word comes next in any sentence."

At first the child guesses wildly. Cat follows "The dog chased the…"? Probably not. But after reading ten million sentences, the child builds deep intuitions. They know that "The stock market…" probably continues with fell, rose, or closed — never with barked.

That is precisely how a Large Language Model is trained. It reads an enormous slice of human text. At each position, it predicts the next token. Every wrong prediction fires a correction signal through the network. After billions of corrections, the model does not just predict text — it understands language well enough to reason, translate, summarise, and code.

In this tutorial you will build a working LLM from scratch using Python. We start from raw text, write a tokeniser, implement a Transformer architecture, train it on real data, and finish with a model that can generate coherent language. Every line of code is explained. Every concept has a story.

💡
What This Tutorial Assumes

You have already completed the LLM Basics tutorial — you know what tokens, embeddings, attention, and the Transformer architecture are conceptually. Here we focus entirely on building, training, and running one using Python, PyTorch, and real data.


Section 02

The LLM Creation Pipeline — The Big Picture

Building an LLM is a six-stage journey. Each stage feeds the next. Skip one and the whole system breaks down.

01
📄 Data Collection & Cleaning
Gather raw text — books, code, web pages, Wikipedia. Clean it: remove HTML tags, duplicates, toxic content, and non-UTF-8 garbage. Quality beats quantity at this stage.
02
🔠 Tokenisation
Convert raw text into integer token IDs. Train a Byte-Pair Encoding (BPE) tokeniser on your corpus. This builds a vocabulary of 32k–100k subword tokens.
03
🧠 Model Architecture
Define the Transformer: token embeddings, positional encodings, multi-head attention, feed-forward layers, layer norms, and a final language model head.
04
🔥 Pre-training
Train on next-token prediction (causal language modelling). This is where 90% of compute goes. For a small model: 1–3 days on a single GPU.
05
🎯 Fine-tuning (Optional)
Take your pre-trained model and train it on a task-specific dataset (instructions, Q&A pairs) to align it with desired behaviour. RLHF/LoRA techniques live here.
06
🚀 Inference & Serving
Load your trained weights, pass a prompt through the model, and decode token-by-token using sampling strategies (greedy, top-k, nucleus). Serve via a REST API.

Section 03

Setting Up Your Environment

Every professional LLM experiment lives in an isolated environment. Here is how to set yours up in under five minutes.

⚙️ Required Stack — Python 3.10+
Core
PyTorch 2.x — The neural network engine. GPU acceleration via CUDA.
Tokenise
tiktoken / tokenizers — HuggingFace tokenizer library for BPE training.
Data
datasets — HuggingFace datasets for streaming large corpora without RAM limits.
Track
wandb / tensorboard — Experiment tracking. Log loss curves, learning rates, gradients.
Speed
flash-attn — 2–4× faster attention with 10× less memory. Install only on GPU machines.
# Create and activate a virtual environment
python -m venv llm_env
source llm_env/bin/activate          # Linux / macOS
# llm_env\Scripts\activate           # Windows

# Install the full stack
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install tokenizers datasets transformers wandb tqdm numpy
# Verify GPU is available
import torch
print("CUDA available:", torch.cuda.is_available())
print("GPU:", torch.cuda.get_device_name(0))
print("PyTorch version:", torch.__version__)
OUTPUT
CUDA available: True GPU: NVIDIA GeForce RTX 3090 PyTorch version: 2.2.1+cu121

Section 04

Step 1 — Data Collection & Cleaning

The Chef's Secret: Garbage In, Garbage Out
A world-class chef can follow a perfect recipe — but if the ingredients are rotten, the dish will be inedible. LLMs work exactly the same way. GPT-3 was trained on 570 GB of filtered text; the filtering step removed roughly 40× more data than it kept. The quality of your training data determines the quality of your model's language more than architecture or training time ever will.

What Makes Good Training Data?

📚
Diversity
Many Domains
Mix books, code, scientific papers, news, and web text. A model trained only on Wikipedia will fail at poetry. A model trained only on code won't understand human emotion.
Quality
Signal over Noise
Prioritise well-written, coherent text. Web scrapes contain spam, SEO filler, and gibberish. Filter aggressively. A smaller, cleaner dataset beats a huge dirty one every time.
📋
Volume
Tokens Matter
Chinchilla scaling laws suggest you need roughly 20 tokens per model parameter for optimal compute. A 1B parameter model needs ~20B tokens. Start small: 1M tokens for experiments.

Below is the full data pipeline — from raw download to clean training-ready text:

from datasets import load_dataset
import re, unicodedata

# ── 1. Load a public dataset (streaming to save RAM) ─────────
dataset = load_dataset(
    "wikipedia",
    "20220301.en",
    split="train",
    streaming=True
)

# ── 2. Text cleaning function ─────────────────────────────────
def clean_text(text: str) -> str:
    # Normalise unicode (é → e where appropriate)
    text = unicodedata.normalize("NFKC", text)
    # Remove URLs
    text = re.sub(r"https?://\S+", "", text)
    # Remove excessive whitespace
    text = re.sub(r"\s{3,}", "\n\n", text)
    # Remove very short lines (likely navigation noise)
    lines = [l for l in text.splitlines() if len(l.split()) > 5]
    return "\n".join(lines).strip()

# ── 3. Stream, clean, and write to disk ──────────────────────
with open("train.txt", "w", encoding="utf-8") as f:
    for i, example in enumerate(dataset):
        clean = clean_text(example["text"])
        if len(clean) > 200:          # Skip tiny stubs
            f.write(clean + "\n\n")
        if i >= 50_000: break      # 50k articles for demo

print("Done. Dataset written to train.txt")
OUTPUT
Done. Dataset written to train.txt File size: ~1.4 GB | ~380M tokens (estimated)

Section 05

Step 2 — Building the BPE Tokeniser

A tokeniser converts raw text into integer IDs the model can process. We train a Byte-Pair Encoding (BPE) tokeniser — the same algorithm used by GPT-2, GPT-3, and LLaMA. BPE starts with individual characters and iteratively merges the most frequent pairs, building a vocabulary of subword units.

Merging the Most Popular Pairs
Start with the word "unbelievable" split into characters: u · n · b · e · l · i · e · v · a · b · l · e.

Count all adjacent pairs across your whole corpus. If "a b" is the most common pair, merge it into a single token "ab". Repeat until your vocabulary reaches the target size (e.g. 32,768). The result: common words become single tokens ("the" → 1 token), rare words are split into known subwords ("unbelievable"un + believ + able).
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevel
from tokenizers.decoders import ByteLevel as ByteLevelDecoder

# ── Initialise a blank BPE tokeniser ─────────────────────────
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = ByteLevel(add_prefix_space=False)
tokenizer.decoder = ByteLevelDecoder()

# ── Define special tokens ─────────────────────────────────────
trainer = BpeTrainer(
    vocab_size=32768,
    min_frequency=2,
    special_tokens=["[PAD]", "[UNK]", "[BOS]", "[EOS]"],
    show_progress=True
)

# ── Train on our cleaned corpus ───────────────────────────────
tokenizer.train(files=["train.txt"], trainer=trainer)

# ── Save for reuse ────────────────────────────────────────────
tokenizer.save("my_tokenizer.json")
print(f"Vocab size: {tokenizer.get_vocab_size()}")

# ── Quick test ────────────────────────────────────────────────
sample = "Large language models learn from text."
encoded = tokenizer.encode(sample)
print("Tokens:", encoded.tokens)
print("IDs:   ", encoded.ids)
OUTPUT
Vocab size: 32768 Tokens: ['Large', 'Ġlanguage', 'Ġmodels', 'Ġlearn', 'Ġfrom', 'Ġtext', '.'] IDs: [12045, 3303, 4981, 3537, 422, 2420, 13]
🔑
The Ġ Prefix Explained

The Ġ symbol (a byte-level encoding of a space) marks the beginning of a new word. This is GPT-style byte-level BPE — it operates on raw bytes rather than Unicode characters, making it fully language-agnostic and immune to out-of-vocabulary words.


Section 06

Step 3 — The Transformer Architecture in Code

We now implement the decoder-only Transformer used by GPT-style models. It has five key components.

📋
Token Embedding
Maps each integer token ID to a dense vector of size d_model. These vectors are learned during training.
nn.Embedding(vocab_size, d_model)
📍
Positional Encoding
Injects information about where each token sits in the sequence. Without this, "dog bites man" and "man bites dog" look identical to the model.
Learned or Sinusoidal
👀
Multi-Head Attention
Lets every token attend to every earlier token. The "multi-head" part runs attention in parallel across multiple subspaces, catching different relationship types.
Causal (masked) attention
▶️
Feed-Forward Network
A two-layer MLP applied independently to each token position. Typically 4× the model dimension. This is where most factual knowledge is believed to be stored.
d_model → 4*d_model → d_model
📒
Layer Normalisation
Stabilises training by normalising activations. Modern models use Pre-LN (before attention/FFN) rather than Post-LN for better gradient flow.
nn.LayerNorm(d_model)
🎯
LM Head
A final linear layer that projects from d_model back to vocab_size. The softmax of this output gives a probability over the entire vocabulary for the next token.
nn.Linear(d_model, vocab_size)

Full Model Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

# ─────────────────────────────────────────────────────────────
# CONFIG — small model for experimentation (~15M parameters)
# ─────────────────────────────────────────────────────────────
class GPTConfig:
    vocab_size: int  = 32768
    context_len: int = 512      # sequence length
    d_model: int     = 512      # embedding dimension
    n_heads: int     = 8        # attention heads
    n_layers: int    = 6        # transformer blocks
    dropout: float  = 0.1

# ─────────────────────────────────────────────────────────────
# MULTI-HEAD CAUSAL SELF-ATTENTION
# ─────────────────────────────────────────────────────────────
class CausalSelfAttention(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        assert cfg.d_model % cfg.n_heads == 0
        self.n_heads  = cfg.n_heads
        self.head_dim = cfg.d_model // cfg.n_heads
        self.d_model  = cfg.d_model

        # Fused Q, K, V projection
        self.qkv_proj = nn.Linear(cfg.d_model, 3 * cfg.d_model, bias=False)
        self.out_proj  = nn.Linear(cfg.d_model, cfg.d_model, bias=False)
        self.attn_drop = nn.Dropout(cfg.dropout)
        self.res_drop  = nn.Dropout(cfg.dropout)

        # Causal mask: upper triangle = -inf (can't attend to future)
        self.register_buffer(
            "mask",
            torch.tril(torch.ones(cfg.context_len, cfg.context_len))
                  .view(1, 1, cfg.context_len, cfg.context_len)
        )

    def forward(self, x):
        B, T, C = x.shape      # batch, sequence length, d_model

        # Compute Q, K, V in one shot then split
        q, k, v = self.qkv_proj(x).split(self.d_model, dim=2)

        # Reshape for multi-head: (B, n_heads, T, head_dim)
        def reshape(t):
            return t.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
        q, k, v = reshape(q), reshape(k), reshape(v)

        # Scaled dot-product attention
        scale = math.sqrt(self.head_dim)
        attn  = (q @ k.transpose(-2, -1)) / scale
        attn  = attn.masked_fill(self.mask[:,:,:T,:T] == 0, float("-inf"))
        attn  = F.softmax(attn, dim=-1)
        attn  = self.attn_drop(attn)

        # Weighted sum of values, merge heads
        out = (attn @ v).transpose(1, 2).contiguous().view(B, T, C)
        return self.res_drop(self.out_proj(out))

# ─────────────────────────────────────────────────────────────
# FEED-FORWARD NETWORK (with GELU activation)
# ─────────────────────────────────────────────────────────────
class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(cfg.d_model, 4 * cfg.d_model),
            nn.GELU(),
            nn.Linear(4 * cfg.d_model, cfg.d_model),
            nn.Dropout(cfg.dropout),
        )
    def forward(self, x): return self.net(x)

# ─────────────────────────────────────────────────────────────
# TRANSFORMER BLOCK = Attention + FFN + Residuals + LayerNorm
# ─────────────────────────────────────────────────────────────
class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.ln1  = nn.LayerNorm(cfg.d_model)
        self.attn = CausalSelfAttention(cfg)
        self.ln2  = nn.LayerNorm(cfg.d_model)
        self.ffn  = FeedForward(cfg)

    def forward(self, x):
        x = x + self.attn(self.ln1(x))   # Pre-LN residual
        x = x + self.ffn(self.ln2(x))    # Pre-LN residual
        return x

# ─────────────────────────────────────────────────────────────
# FULL GPT MODEL
# ─────────────────────────────────────────────────────────────
class GPT(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg
        self.transformer = nn.ModuleDict({
            "tok_emb": nn.Embedding(cfg.vocab_size, cfg.d_model),
            "pos_emb": nn.Embedding(cfg.context_len, cfg.d_model),
            "drop"   : nn.Dropout(cfg.dropout),
            "blocks" : nn.ModuleList([TransformerBlock(cfg) for _ in range(cfg.n_layers)]),
            "ln_f"   : nn.LayerNorm(cfg.d_model),
        })
        self.lm_head = nn.Linear(cfg.d_model, cfg.vocab_size, bias=False)

        # Weight tying: share token embeddings with lm_head
        self.transformer["tok_emb"].weight = self.lm_head.weight

        # Initialise weights (GPT-2 style)
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        assert T <= self.cfg.context_len

        pos = torch.arange(0, T, dtype=torch.long, device=idx.device)
        x   = self.transformer["drop"](
                  self.transformer["tok_emb"](idx) +
                  self.transformer["pos_emb"](pos)
              )
        for block in self.transformer["blocks"]:
            x = block(x)
        x = self.transformer["ln_f"](x)
        logits = self.lm_head(x)           # (B, T, vocab_size)

        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, self.cfg.vocab_size),
                                   targets.view(-1))
        return logits, loss

# ── Count parameters ──────────────────────────────────────────
cfg   = GPTConfig()
model = GPT(cfg)
n_params = sum(p.numel() for p in model.parameters())
print(f"Model parameters: {n_params/1e6:.2f}M")
OUTPUT
Model parameters: 44.13M
📚
Weight Tying — Free Accuracy

We share weights between the token embedding matrix and the final LM head. This trick (used in GPT-2) reduces parameters by ~16M without hurting performance — because both layers are learning the same token semantics from opposite ends.


Section 07

Step 4 — The Data Loader

Before training, we need a fast data pipeline. We memory-map the training file and sample random context windows of length context_len. Each window becomes one training example: input = tokens[0:T], target = tokens[1:T+1].

⚠️
Why targets are shifted by 1

At each position t, the model must predict token t+1. So if input is [The, cat, sat], targets are [cat, sat, on]. This one-position shift is the entire training objective — called causal language modelling.

import torch
import numpy as np
from torch.utils.data import Dataset, DataLoader
from tokenizers import Tokenizer

class TextDataset(Dataset):
    def __init__(self, token_file: str, context_len: int):
        # Memory-map the binary token file (fast, no RAM copy)
        self.data = np.memmap(token_file, dtype=np.uint16, mode="r")
        self.context_len = context_len

    def __len__(self):
        return len(self.data) - self.context_len

    def __getitem__(self, idx):
        chunk = self.data[idx : idx + self.context_len + 1]
        x = torch.tensor(chunk[:-1].astype(np.int64))
        y = torch.tensor(chunk[1:].astype(np.int64))
        return x, y


# ── Tokenise and save as binary (done once) ───────────────────
def encode_to_binary(txt_path: str, out_path: str, tokenizer_path: str):
    tok = Tokenizer.from_file(tokenizer_path)
    with open(txt_path, "r", encoding="utf-8") as f:
        text = f.read()
    ids = tok.encode(text).ids
    arr = np.array(ids, dtype=np.uint16)
    arr.tofile(out_path)
    print(f"Saved {len(arr):,} tokens to {out_path}")

# Run once: encode_to_binary("train.txt", "train.bin", "my_tokenizer.json")

# ── Build the data loader ─────────────────────────────────────
train_ds = TextDataset("train.bin", context_len=512)
train_dl = DataLoader(
    train_ds,
    batch_size=32,
    shuffle=True,
    num_workers=4,
    pin_memory=True,         # faster GPU transfer
)
print(f"Batches per epoch: {len(train_dl):,}")

Section 08

Step 5 — The Training Loop

The training loop is the heart of everything. At each step: forward pass → compute loss → backward pass → update weights. We add gradient clipping, a learning rate scheduler, and periodic checkpointing.

What is Cross-Entropy Loss?

Loss Formula
L = -log P(correct token)
For each position, the loss is the negative log probability the model assigns to the true next token. Perfect prediction → loss = 0. Random guessing on 32k vocab → loss ≈ 10.4.
Perplexity
PPL = e^L
A more interpretable metric. PPL = 10 means the model is as confused as if it had 10 equally likely choices at every step. GPT-2 (1.5B) achieves ~18 PPL on WikiText-103.
import torch, math
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR

device = "cuda" if torch.cuda.is_available() else "cpu"
model  = GPT(GPTConfig()).to(device)

# ── Optimiser: AdamW with decoupled weight decay ──────────────
optimizer = AdamW(
    model.parameters(),
    lr=3e-4,
    betas=(0.9, 0.95),
    weight_decay=0.1,
    eps=1e-8
)

# ── Cosine LR schedule with warm-up ──────────────────────────
n_epochs   = 3
total_steps = n_epochs * len(train_dl)
warmup      = 500

def lr_lambda(step):
    if step < warmup:
        return step / warmup
    progress = (step - warmup) / (total_steps - warmup)
    return 0.1 + 0.5 * (1 - 0.1) * (1 + math.cos(math.pi * progress))

scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)

# ── Mixed precision scaler (FP16 training) ───────────────────
scaler = torch.cuda.amp.GradScaler()

# ── Training loop ─────────────────────────────────────────────
model.train()
for epoch in range(n_epochs):
    for step, (x, y) in enumerate(train_dl):
        x, y = x.to(device), y.to(device)

        # Forward pass with automatic mixed precision
        with torch.cuda.amp.autocast():
            _, loss = model(x, targets=y)

        # Backward pass
        optimizer.zero_grad()
        scaler.scale(loss).backward()

        # Gradient clipping (prevents exploding gradients)
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        scaler.step(optimizer)
        scaler.update()
        scheduler.step()

        # Log every 100 steps
        if step % 100 == 0:
            ppl = math.exp(loss.item())
            lr  = scheduler.get_last_lr()[0]
            print(f"Epoch {epoch} | Step {step:5d} | Loss {loss.item():.4f} | PPL {ppl:.1f} | LR {lr:.2e}")

    # Save checkpoint after each epoch
    torch.save({
        "epoch": epoch,
        "model_state": model.state_dict(),
        "optimizer_state": optimizer.state_dict(),
    }, f"checkpoint_epoch{epoch}.pt")
OUTPUT (example)
Epoch 0 | Step 0 | Loss 10.3912 | PPL 32674.3 | LR 6.00e-07 Epoch 0 | Step 100 | Loss 6.8241 | PPL 918.7 | LR 6.00e-05 Epoch 0 | Step 500 | Loss 4.2183 | PPL 67.8 | LR 3.00e-04 Epoch 0 | Step 2000 | Loss 3.1072 | PPL 22.3 | LR 2.81e-04 Epoch 1 | Step 4000 | Loss 2.7441 | PPL 15.5 | LR 2.14e-04
📈
Reading the Loss Curve — What's Normal

Loss starts near ln(vocab_size) ≈ 10.4 (random guessing). It should drop sharply in the first 500 steps as the model learns basic structure, then gradually reduce over training. If loss plateaus above 4.0, check your learning rate. If loss suddenly spikes and stays high, you likely have a gradient explosion — lower the max_norm or reduce the learning rate.


Section 09

Hyperparameter Reference Table

These are the key knobs. Understanding what each one does prevents hours of debugging.

Hyperparameter Our Demo Value GPT-2 Small Effect of Increasing Typical Range
d_model 512 768 More capacity, more compute 256 – 8192
n_layers 6 12 Deeper reasoning, slower training 4 – 96
n_heads 8 12 More attention patterns (limited gain) 4 – 64
context_len 512 1024 RAM grows quadratically (O(T²)) 512 – 128k
learning_rate 3e-4 2.5e-4 Unstable training above 1e-3 1e-5 – 1e-3
batch_size 32 512 Smoother gradients, needs LR scaling 16 – 4096
dropout 0.1 0.1 Reduces overfitting (set to 0 at inference) 0.0 – 0.3

Section 10

Step 6 — Text Generation at Inference Time

Once trained, we generate text by feeding a prompt and sampling tokens one at a time. The quality of generation depends heavily on your sampling strategy.

🎯
Greedy Decoding
argmax at every step
Always pick the highest probability token. Fast and deterministic, but generates repetitive, dull text. Rarely used in production.
⚠ "I love love love love…"
🎲
Top-k Sampling
sample from top k tokens
Keep only the top-k most likely tokens, zero out the rest, then sample. k=50 gives diverse, coherent text. Widely used in GPT-2 and GPT-3.
✓ Diverse yet sensible
📈
Nucleus (Top-p) Sampling
sample from top p probability mass
Keep the smallest set of tokens whose cumulative probability exceeds p (e.g. 0.9). Adapts dynamically — tight when confident, wide when uncertain.
✓ Best for creative text
import torch
import torch.nn.functional as F
from tokenizers import Tokenizer

def generate(
    model,
    tokenizer,
    prompt: str,
    max_new_tokens: int = 200,
    temperature: float  = 0.8,
    top_k: int           = 50,
    top_p: float         = 0.9,
    device: str           = "cuda",
):
    model.eval()
    tokens = tokenizer.encode(prompt).ids
    x = torch.tensor([tokens], dtype=torch.long).to(device)

    with torch.no_grad():
        for _ in range(max_new_tokens):
            # Crop context if needed
            x_ctx = x[:, -model.cfg.context_len:]
            logits, _ = model(x_ctx)
            logits = logits[:, -1, :]     # last position only

            # Temperature scaling
            logits = logits / temperature

            # Top-k filtering
            if top_k > 0:
                vals, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < vals[:, [-1]]] = float("-inf")

            # Top-p (nucleus) filtering
            if top_p < 1.0:
                sorted_logits, sorted_idx = torch.sort(logits, descending=True)
                cum_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
                remove = cum_probs - F.softmax(sorted_logits, dim=-1) > top_p
                sorted_logits[remove] = float("-inf")
                logits.scatter_(1, sorted_idx, sorted_logits)

            # Sample next token
            probs     = F.softmax(logits, dim=-1)
            next_tok  = torch.multinomial(probs, num_samples=1)
            x = torch.cat([x, next_tok], dim=1)

            # Stop on EOS token
            if next_tok.item() == tokenizer.token_to_id("[EOS]"):
                break

    generated_ids = x[0].tolist()
    return tokenizer.decode(generated_ids)


# ── Load checkpoint and generate ──────────────────────────────
ckpt  = torch.load("checkpoint_epoch2.pt", map_location=device)
model.load_state_dict(ckpt["model_state"])
tok   = Tokenizer.from_file("my_tokenizer.json")

output = generate(model, tok, prompt="The history of artificial intelligence")
print(output)
SAMPLE OUTPUT (after 3 epochs, 15M param model)
The history of artificial intelligence began in the mid-twentieth century, when mathematicians and engineers first proposed that machines could simulate human reasoning. Early efforts focused on symbolic logic and rule-based systems, but these approaches struggled with the ambiguity of natural language...

Section 11

Scaling Up — From Toy to Real Model

Scale Beats Clever Engineering — Every Time
Richard Sutton's famous 2019 essay observed that in AI, methods that leverage more computation always win over carefully hand-crafted features — and they win decisively, over decades. GPT-1 had 117M parameters. GPT-3 had 175B. The architecture barely changed. The data and compute scaled 1,000×.

You cannot match GPT-4 on a laptop. But you can build something useful. A well-trained 100M parameter model running on a single RTX 3090 for 3 days can generate surprisingly good text.
Model Size Parameters Training Tokens GPU Time (A100) Target Use Case
🐜 Nano 10–50M 1–5B 2–8 hours Learning, quick experiments
🐦 Small 100–300M 15–50B 1–5 days (1 GPU) Domain-specific text generation
🦅 Medium 1–7B 100–200B 2–8 weeks (8 GPUs) Competitive with GPT-2, good at reasoning
🦒 Large 13–70B 1–2T Months (100+ GPUs) LLaMA-class, general purpose assistant
🌐 Frontier 100B+ 10T+ Years of GPU-time ($100M+) GPT-4, Claude, Gemini class

Section 12

Step 7 — Fine-Tuning with LoRA

Full fine-tuning a large model requires retraining all parameters — too expensive for most use cases. LoRA (Low-Rank Adaptation) freezes the original weights and adds tiny trainable rank-decomposed matrices at each attention layer. With LoRA, you fine-tune just 0.1–1% of the parameters and achieve results competitive with full fine-tuning.

🧾
The LoRA Idea in One Sentence

Instead of updating the full weight matrix W (shape d×d), decompose the update as ΔW = A × B where A has shape d×r and B has shape r×d, and r ≪ d. With r=8 on a 512×512 matrix, you train 8,192 parameters instead of 262,144.

import torch
import torch.nn as nn

class LoRALinear(nn.Module):
    """Wraps an existing nn.Linear with a LoRA adapter."""
    def __init__(self, linear: nn.Linear, rank: int = 8, alpha: float = 16.0):
        super().__init__()
        self.linear = linear
        self.rank   = rank
        self.scale  = alpha / rank
        d_in, d_out = linear.in_features, linear.out_features

        # LoRA matrices: A is random, B is zero (so ΔW starts at 0)
        self.lora_A = nn.Parameter(torch.randn(rank, d_in) * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(d_out, rank))

        # Freeze the original weights
        for p in self.linear.parameters():
            p.requires_grad = False

    def forward(self, x):
        base   = self.linear(x)                     # original (frozen)
        delta  = (x @ self.lora_A.T) @ self.lora_B.T  # LoRA adapter
        return base + self.scale * delta


def inject_lora(model, rank=8, alpha=16):
    """Replace all Q and V projection layers with LoRA versions."""
    for name, module in model.named_modules():
        if isinstance(module, CausalSelfAttention):
            # We inject LoRA only into the fused QKV projection
            module.qkv_proj = LoRALinear(module.qkv_proj, rank, alpha)
    return model

# ── Apply LoRA ────────────────────────────────────────────────
model = inject_lora(model, rank=8, alpha=16)

# Count trainable vs frozen params
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total     = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable:,}  ({100*trainable/total:.2f}% of total)")
OUTPUT
Trainable: 786,432 (1.78% of total)

Section 13

Golden Rules — Things That Will Save You Days

🧰 LLM Training — Non-Negotiable Rules
1
Always use mixed precision (FP16/BF16). It halves your memory and nearly doubles throughput with zero accuracy cost on modern GPUs. Use torch.cuda.amp.autocast() and GradScaler.
2
Clip gradients. LLM training is unstable without it. Set max_norm=1.0. If loss spikes, lower it to 0.5. Monitor the global gradient norm in your logs — a norm consistently above 10 is a red flag.
3
Use a cosine LR schedule with warm-up. Start at 0, linearly warm up for 500–2000 steps, then cosine-decay to ~10% of peak LR. Flat or step-decay schedules reliably underperform.
4
Memory-map your training data. Never load the full dataset into RAM. Use numpy.memmap or HuggingFace datasets in streaming mode. A 1B-token dataset needs 2 GB as uint16 — fine on disk, fatal in RAM alongside the model.
5
Save checkpoints every epoch minimum. GPU jobs die. Colab sessions time out. Save both model.state_dict() and optimizer.state_dict() — without the optimiser state, resuming training will destabilise the loss curve.
6
Tie your token embeddings to the LM head. This reduces parameters, improves convergence, and is standard in every GPT-class model. One line: lm_head.weight = tok_emb.weight.
7
Set dropout to 0 at inference. Call model.eval() before any text generation. Forgetting this produces inconsistent, noisy outputs — a subtle bug that's hard to spot.

Section 14

From-Scratch vs Fine-Tuning vs Prompt Engineering

ApproachFrom ScratchFine-Tuning (Full)LoRA / PEFTPrompt Engineering
Control Total — you own everything High Moderate Low
Cost Very High ($10k–$1M+) High ($100–$10k) Low ($5–$100) Near zero
Data needed Billions of tokens 10k–1M examples 500–50k examples 0–10 examples
Time to first result Days to months Hours to days Minutes to hours Seconds
Best for Proprietary domain, research Domain specialisation Task adaptation, tight budgets Prototyping, off-the-shelf models
🏆
The Practitioner's Decision Tree

Start with prompt engineering — it's free and might be enough. If you need customisation, use LoRA fine-tuning on an existing open model (LLaMA 3, Mistral). Only build from scratch if your data is proprietary, your domain is highly specialised, or you need complete control over the model weights for regulatory reasons.


Section 15

Common Errors & How to Fix Them

SymptomLikely CauseFix
Loss immediately NaN Learning rate too high, bad data (inf/nan values) Lower LR to 1e-4; run torch.isnan(loss).any() after each step; inspect your dataset
Loss stuck at ~10.4 Model outputting near-uniform distribution (untrained) Check weight initialisation; verify gradient flow; ensure LR warms up correctly
Loss drops then spikes Exploding gradients, bad batch in data Clip gradients; lower max_norm; scan training file for corrupt entries
CUDA out of memory Batch too large, sequence too long, no gradient checkpointing Halve batch_size; reduce context_len; enable torch.utils.checkpoint
Model generates gibberish Undertrained, temperature too high, context window exceeded Train more; lower temperature to 0.6; crop input to context_len
Training works but eval is random Forgot to call model.eval() Always call model.eval() and torch.no_grad() before inference