Build Your Own LLM From Scratch in Python

Section 01

The Story That Explains Building an LLM

📖 Real World Analogy

Teaching a Child to Read — Then to Write

Imagine you hand a child every book ever written. You don't explain grammar. You don't teach vocabulary lists. You simply say: "Read all of this — and then predict what word comes next in any sentence."

At first the child guesses wildly. Cat follows "The dog chased the…"? Probably not. But after reading ten million sentences, the child builds deep intuitions. They know that "The stock market…" probably continues with fell, rose, or closed — never with barked.

That is precisely how a Large Language Model is trained. It reads an enormous slice of human text. At each position, it predicts the next token. Every wrong prediction fires a correction signal through the network. After billions of corrections, the model does not just predict text — it understands language well enough to reason, translate, summarise, and code.

In this tutorial you will build a working LLM from scratch using Python. We start from raw text, write a tokeniser, implement a Transformer architecture, train it on real data, and finish with a model that can generate coherent language. Every line of code is explained. Every concept has a story.

💡

What This Tutorial Assumes

You have already completed the LLM Basics tutorial — you know what tokens, embeddings, attention, and the Transformer architecture are conceptually. Here we focus entirely on building, training, and running one using Python, PyTorch, and real data.

Section 02

The LLM Creation Pipeline — The Big Picture

Building an LLM is a six-stage journey. Each stage feeds the next. Skip one and the whole system breaks down.

📄 Data Collection & Cleaning

Gather raw text — books, code, web pages, Wikipedia. Clean it: remove HTML tags, duplicates, toxic content, and non-UTF-8 garbage. Quality beats quantity at this stage.

🔠 Tokenisation

Convert raw text into integer token IDs. Train a Byte-Pair Encoding (BPE) tokeniser on your corpus. This builds a vocabulary of 32k–100k subword tokens.

🧠 Model Architecture

Define the Transformer: token embeddings, positional encodings, multi-head attention, feed-forward layers, layer norms, and a final language model head.

🔥 Pre-training

Train on next-token prediction (causal language modelling). This is where 90% of compute goes. For a small model: 1–3 days on a single GPU.

🎯 Fine-tuning (Optional)

Take your pre-trained model and train it on a task-specific dataset (instructions, Q&A pairs) to align it with desired behaviour. RLHF/LoRA techniques live here.

🚀 Inference & Serving

Load your trained weights, pass a prompt through the model, and decode token-by-token using sampling strategies (greedy, top-k, nucleus). Serve via a REST API.

Section 03

Setting Up Your Environment

Every professional LLM experiment lives in an isolated environment. Here is how to set yours up in under five minutes.

⚙️ Required Stack — Python 3.10+

Core

PyTorch 2.x — The neural network engine. GPU acceleration via CUDA.

Tokenise

tiktoken / tokenizers — HuggingFace tokenizer library for BPE training.

Data

datasets — HuggingFace datasets for streaming large corpora without RAM limits.

Track

wandb / tensorboard — Experiment tracking. Log loss curves, learning rates, gradients.

Speed

flash-attn — 2–4× faster attention with 10× less memory. Install only on GPU machines.

# Create and activate a virtual environment
python -m venv llm_env
source llm_env/bin/activate          # Linux / macOS
# llm_env\Scripts\activate           # Windows

# Install the full stack
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install tokenizers datasets transformers wandb tqdm numpy

# Verify GPU is available
import torch
print("CUDA available:", torch.cuda.is_available())
print("GPU:", torch.cuda.get_device_name(0))
print("PyTorch version:", torch.__version__)

OUTPUT

CUDA available: True GPU: NVIDIA GeForce RTX 3090 PyTorch version: 2.2.1+cu121

Section 04

Step 1 — Data Collection & Cleaning

📖 Story

The Chef's Secret: Garbage In, Garbage Out

A world-class chef can follow a perfect recipe — but if the ingredients are rotten, the dish will be inedible. LLMs work exactly the same way. GPT-3 was trained on 570 GB of filtered text; the filtering step removed roughly 40× more data than it kept. The quality of your training data determines the quality of your model's language more than architecture or training time ever will.

What Makes Good Training Data?

📚

Diversity

Many Domains

Mix books, code, scientific papers, news, and web text. A model trained only on Wikipedia will fail at poetry. A model trained only on code won't understand human emotion.

✨

Quality

Signal over Noise

Prioritise well-written, coherent text. Web scrapes contain spam, SEO filler, and gibberish. Filter aggressively. A smaller, cleaner dataset beats a huge dirty one every time.

📋

Volume

Tokens Matter

Chinchilla scaling laws suggest you need roughly 20 tokens per model parameter for optimal compute. A 1B parameter model needs ~20B tokens. Start small: 1M tokens for experiments.

Below is the full data pipeline — from raw download to clean training-ready text:

from datasets import load_dataset
import re, unicodedata

# ── 1. Load a public dataset (streaming to save RAM) ─────────
dataset = load_dataset(
    "wikipedia",
    "20220301.en",
    split="train",
    streaming=True
)

# ── 2. Text cleaning function ─────────────────────────────────
def clean_text(text: str) -> str:
    # Normalise unicode (é → e where appropriate)
    text = unicodedata.normalize("NFKC", text)
    # Remove URLs
    text = re.sub(r"https?://\S+", "", text)
    # Remove excessive whitespace
    text = re.sub(r"\s{3,}", "\n\n", text)
    # Remove very short lines (likely navigation noise)
    lines = [l for l in text.splitlines() if len(l.split()) > 5]
    return "\n".join(lines).strip()

# ── 3. Stream, clean, and write to disk ──────────────────────
with open("train.txt", "w", encoding="utf-8") as f:
    for i, example in enumerate(dataset):
        clean = clean_text(example["text"])
        if len(clean) > 200:          # Skip tiny stubs
            f.write(clean + "\n\n")
        if i >= 50_000: break      # 50k articles for demo

print("Done. Dataset written to train.txt")

OUTPUT

Done. Dataset written to train.txt File size: ~1.4 GB | ~380M tokens (estimated)

Section 05

Step 2 — Building the BPE Tokeniser

A tokeniser converts raw text into integer IDs the model can process. We train a Byte-Pair Encoding (BPE) tokeniser — the same algorithm used by GPT-2, GPT-3, and LLaMA. BPE starts with individual characters and iteratively merges the most frequent pairs, building a vocabulary of subword units.

📖 How BPE Works

Merging the Most Popular Pairs

Start with the word "unbelievable" split into characters: u · n · b · e · l · i · e · v · a · b · l · e.

Count all adjacent pairs across your whole corpus. If "a b" is the most common pair, merge it into a single token "ab". Repeat until your vocabulary reaches the target size (e.g. 32,768). The result: common words become single tokens ("the" → 1 token), rare words are split into known subwords ("unbelievable" → un + believ + able).

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevel
from tokenizers.decoders import ByteLevel as ByteLevelDecoder

# ── Initialise a blank BPE tokeniser ─────────────────────────
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = ByteLevel(add_prefix_space=False)
tokenizer.decoder = ByteLevelDecoder()

# ── Define special tokens ─────────────────────────────────────
trainer = BpeTrainer(
    vocab_size=32768,
    min_frequency=2,
    special_tokens=["[PAD]", "[UNK]", "[BOS]", "[EOS]"],
    show_progress=True
)

# ── Train on our cleaned corpus ───────────────────────────────
tokenizer.train(files=["train.txt"], trainer=trainer)

# ── Save for reuse ────────────────────────────────────────────
tokenizer.save("my_tokenizer.json")
print(f"Vocab size: {tokenizer.get_vocab_size()}")

# ── Quick test ────────────────────────────────────────────────
sample = "Large language models learn from text."
encoded = tokenizer.encode(sample)
print("Tokens:", encoded.tokens)
print("IDs:   ", encoded.ids)

OUTPUT

Vocab size: 32768 Tokens: ['Large', 'Ġlanguage', 'Ġmodels', 'Ġlearn', 'Ġfrom', 'Ġtext', '.'] IDs: [12045, 3303, 4981, 3537, 422, 2420, 13]

🔑

The Ġ Prefix Explained

The Ġ symbol (a byte-level encoding of a space) marks the beginning of a new word. This is GPT-style byte-level BPE — it operates on raw bytes rather than Unicode characters, making it fully language-agnostic and immune to out-of-vocabulary words.

Section 06

Step 3 — The Transformer Architecture in Code

We now implement the decoder-only Transformer used by GPT-style models. It has five key components.

📋

Token Embedding

Maps each integer token ID to a dense vector of size d_model. These vectors are learned during training.

nn.Embedding(vocab_size, d_model)

📍

Positional Encoding

Injects information about where each token sits in the sequence. Without this, "dog bites man" and "man bites dog" look identical to the model.

Learned or Sinusoidal

👀

Multi-Head Attention

Lets every token attend to every earlier token. The "multi-head" part runs attention in parallel across multiple subspaces, catching different relationship types.

Causal (masked) attention

▶️

Feed-Forward Network

A two-layer MLP applied independently to each token position. Typically 4× the model dimension. This is where most factual knowledge is believed to be stored.

d_model → 4*d_model → d_model

📒

Layer Normalisation

Stabilises training by normalising activations. Modern models use Pre-LN (before attention/FFN) rather than Post-LN for better gradient flow.

nn.LayerNorm(d_model)

🎯

LM Head

A final linear layer that projects from d_model back to vocab_size. The softmax of this output gives a probability over the entire vocabulary for the next token.

nn.Linear(d_model, vocab_size)

Full Model Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

# ─────────────────────────────────────────────────────────────
# CONFIG — small model for experimentation (~15M parameters)
# ─────────────────────────────────────────────────────────────
class GPTConfig:
    vocab_size: int  = 32768
    context_len: int = 512      # sequence length
    d_model: int     = 512      # embedding dimension
    n_heads: int     = 8        # attention heads
    n_layers: int    = 6        # transformer blocks
    dropout: float  = 0.1

# ─────────────────────────────────────────────────────────────
# MULTI-HEAD CAUSAL SELF-ATTENTION
# ─────────────────────────────────────────────────────────────
class CausalSelfAttention(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        assert cfg.d_model % cfg.n_heads == 0
        self.n_heads  = cfg.n_heads
        self.head_dim = cfg.d_model // cfg.n_heads
        self.d_model  = cfg.d_model

        # Fused Q, K, V projection
        self.qkv_proj = nn.Linear(cfg.d_model, 3 * cfg.d_model, bias=False)
        self.out_proj  = nn.Linear(cfg.d_model, cfg.d_model, bias=False)
        self.attn_drop = nn.Dropout(cfg.dropout)
        self.res_drop  = nn.Dropout(cfg.dropout)

        # Causal mask: upper triangle = -inf (can't attend to future)
        self.register_buffer(
            "mask",
            torch.tril(torch.ones(cfg.context_len, cfg.context_len))
                  .view(1, 1, cfg.context_len, cfg.context_len)
        )

    def forward(self, x):
        B, T, C = x.shape      # batch, sequence length, d_model

        # Compute Q, K, V in one shot then split
        q, k, v = self.qkv_proj(x).split(self.d_model, dim=2)

        # Reshape for multi-head: (B, n_heads, T, head_dim)
        def reshape(t):
            return t.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
        q, k, v = reshape(q), reshape(k), reshape(v)

        # Scaled dot-product attention
        scale = math.sqrt(self.head_dim)
        attn  = (q @ k.transpose(-2, -1)) / scale
        attn  = attn.masked_fill(self.mask[:,:,:T,:T] == 0, float("-inf"))
        attn  = F.softmax(attn, dim=-1)
        attn  = self.attn_drop(attn)

        # Weighted sum of values, merge heads
        out = (attn @ v).transpose(1, 2).contiguous().view(B, T, C)
        return self.res_drop(self.out_proj(out))

# ─────────────────────────────────────────────────────────────
# FEED-FORWARD NETWORK (with GELU activation)
# ─────────────────────────────────────────────────────────────
class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(cfg.d_model, 4 * cfg.d_model),
            nn.GELU(),
            nn.Linear(4 * cfg.d_model, cfg.d_model),
            nn.Dropout(cfg.dropout),
        )
    def forward(self, x): return self.net(x)

# ─────────────────────────────────────────────────────────────
# TRANSFORMER BLOCK = Attention + FFN + Residuals + LayerNorm
# ─────────────────────────────────────────────────────────────
class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.ln1  = nn.LayerNorm(cfg.d_model)
        self.attn = CausalSelfAttention(cfg)
        self.ln2  = nn.LayerNorm(cfg.d_model)
        self.ffn  = FeedForward(cfg)

    def forward(self, x):
        x = x + self.attn(self.ln1(x))   # Pre-LN residual
        x = x + self.ffn(self.ln2(x))    # Pre-LN residual
        return x

# ─────────────────────────────────────────────────────────────
# FULL GPT MODEL
# ─────────────────────────────────────────────────────────────
class GPT(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg
        self.transformer = nn.ModuleDict({
            "tok_emb": nn.Embedding(cfg.vocab_size, cfg.d_model),
            "pos_emb": nn.Embedding(cfg.context_len, cfg.d_model),
            "drop"   : nn.Dropout(cfg.dropout),
            "blocks" : nn.ModuleList([TransformerBlock(cfg) for _ in range(cfg.n_layers)]),
            "ln_f"   : nn.LayerNorm(cfg.d_model),
        })
        self.lm_head = nn.Linear(cfg.d_model, cfg.vocab_size, bias=False)

        # Weight tying: share token embeddings with lm_head
        self.transformer["tok_emb"].weight = self.lm_head.weight

        # Initialise weights (GPT-2 style)
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        assert T <= self.cfg.context_len

        pos = torch.arange(0, T, dtype=torch.long, device=idx.device)
        x   = self.transformer["drop"](
                  self.transformer["tok_emb"](idx) +
                  self.transformer["pos_emb"](pos)
              )
        for block in self.transformer["blocks"]:
            x = block(x)
        x = self.transformer["ln_f"](x)
        logits = self.lm_head(x)           # (B, T, vocab_size)

        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, self.cfg.vocab_size),
                                   targets.view(-1))
        return logits, loss

# ── Count parameters ──────────────────────────────────────────
cfg   = GPTConfig()
model = GPT(cfg)
n_params = sum(p.numel() for p in model.parameters())
print(f"Model parameters: {n_params/1e6:.2f}M")

OUTPUT

Model parameters: 44.13M

📚

Weight Tying — Free Accuracy

We share weights between the token embedding matrix and the final LM head. This trick (used in GPT-2) reduces parameters by ~16M without hurting performance — because both layers are learning the same token semantics from opposite ends.

Section 07

Step 4 — The Data Loader

Before training, we need a fast data pipeline. We memory-map the training file and sample random context windows of length context_len. Each window becomes one training example: input = tokens[0:T], target = tokens[1:T+1].

⚠️

Why targets are shifted by 1

At each position t, the model must predict token t+1. So if input is [The, cat, sat], targets are [cat, sat, on]. This one-position shift is the entire training objective — called causal language modelling.

import torch
import numpy as np
from torch.utils.data import Dataset, DataLoader
from tokenizers import Tokenizer

class TextDataset(Dataset):
    def __init__(self, token_file: str, context_len: int):
        # Memory-map the binary token file (fast, no RAM copy)
        self.data = np.memmap(token_file, dtype=np.uint16, mode="r")
        self.context_len = context_len

    def __len__(self):
        return len(self.data) - self.context_len

    def __getitem__(self, idx):
        chunk = self.data[idx : idx + self.context_len + 1]
        x = torch.tensor(chunk[:-1].astype(np.int64))
        y = torch.tensor(chunk[1:].astype(np.int64))
        return x, y


# ── Tokenise and save as binary (done once) ───────────────────
def encode_to_binary(txt_path: str, out_path: str, tokenizer_path: str):
    tok = Tokenizer.from_file(tokenizer_path)
    with open(txt_path, "r", encoding="utf-8") as f:
        text = f.read()
    ids = tok.encode(text).ids
    arr = np.array(ids, dtype=np.uint16)
    arr.tofile(out_path)
    print(f"Saved {len(arr):,} tokens to {out_path}")

# Run once: encode_to_binary("train.txt", "train.bin", "my_tokenizer.json")

# ── Build the data loader ─────────────────────────────────────
train_ds = TextDataset("train.bin", context_len=512)
train_dl = DataLoader(
    train_ds,
    batch_size=32,
    shuffle=True,
    num_workers=4,
    pin_memory=True,         # faster GPU transfer
)
print(f"Batches per epoch: {len(train_dl):,}")

Section 08

Step 5 — The Training Loop

The training loop is the heart of everything. At each step: forward pass → compute loss → backward pass → update weights. We add gradient clipping, a learning rate scheduler, and periodic checkpointing.

What is Cross-Entropy Loss?

Loss Formula

L = -log P(correct token)

For each position, the loss is the negative log probability the model assigns to the true next token. Perfect prediction → loss = 0. Random guessing on 32k vocab → loss ≈ 10.4.

Perplexity

PPL = e^L

A more interpretable metric. PPL = 10 means the model is as confused as if it had 10 equally likely choices at every step. GPT-2 (1.5B) achieves ~18 PPL on WikiText-103.

import torch, math
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR

device = "cuda" if torch.cuda.is_available() else "cpu"
model  = GPT(GPTConfig()).to(device)

# ── Optimiser: AdamW with decoupled weight decay ──────────────
optimizer = AdamW(
    model.parameters(),
    lr=3e-4,
    betas=(0.9, 0.95),
    weight_decay=0.1,
    eps=1e-8
)

# ── Cosine LR schedule with warm-up ──────────────────────────
n_epochs   = 3
total_steps = n_epochs * len(train_dl)
warmup      = 500

def lr_lambda(step):
    if step < warmup:
        return step / warmup
    progress = (step - warmup) / (total_steps - warmup)
    return 0.1 + 0.5 * (1 - 0.1) * (1 + math.cos(math.pi * progress))

scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)

# ── Mixed precision scaler (FP16 training) ───────────────────
scaler = torch.cuda.amp.GradScaler()

# ── Training loop ─────────────────────────────────────────────
model.train()
for epoch in range(n_epochs):
    for step, (x, y) in enumerate(train_dl):
        x, y = x.to(device), y.to(device)

        # Forward pass with automatic mixed precision
        with torch.cuda.amp.autocast():
            _, loss = model(x, targets=y)

        # Backward pass
        optimizer.zero_grad()
        scaler.scale(loss).backward()

        # Gradient clipping (prevents exploding gradients)
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        scaler.step(optimizer)
        scaler.update()
        scheduler.step()

        # Log every 100 steps
        if step % 100 == 0:
            ppl = math.exp(loss.item())
            lr  = scheduler.get_last_lr()[0]
            print(f"Epoch {epoch} | Step {step:5d} | Loss {loss.item():.4f} | PPL {ppl:.1f} | LR {lr:.2e}")

    # Save checkpoint after each epoch
    torch.save({
        "epoch": epoch,
        "model_state": model.state_dict(),
        "optimizer_state": optimizer.state_dict(),
    }, f"checkpoint_epoch{epoch}.pt")

OUTPUT (example)

📈

Reading the Loss Curve — What's Normal

Loss starts near ln(vocab_size) ≈ 10.4 (random guessing). It should drop sharply in the first 500 steps as the model learns basic structure, then gradually reduce over training. If loss plateaus above 4.0, check your learning rate. If loss suddenly spikes and stays high, you likely have a gradient explosion — lower the max_norm or reduce the learning rate.

Section 09

Hyperparameter Reference Table

These are the key knobs. Understanding what each one does prevents hours of debugging.

Hyperparameter	Our Demo Value	GPT-2 Small	Effect of Increasing	Typical Range
d_model	512	768	More capacity, more compute	256 – 8192
n_layers	6	12	Deeper reasoning, slower training	4 – 96
n_heads	8	12	More attention patterns (limited gain)	4 – 64
context_len	512	1024	RAM grows quadratically (O(T²))	512 – 128k
learning_rate	3e-4	2.5e-4	Unstable training above 1e-3	1e-5 – 1e-3
batch_size	32	512	Smoother gradients, needs LR scaling	16 – 4096
dropout	0.1	0.1	Reduces overfitting (set to 0 at inference)	0.0 – 0.3

Section 10

Step 6 — Text Generation at Inference Time

Once trained, we generate text by feeding a prompt and sampling tokens one at a time. The quality of generation depends heavily on your sampling strategy.

🎯

Greedy Decoding

argmax at every step

Always pick the highest probability token. Fast and deterministic, but generates repetitive, dull text. Rarely used in production.

⚠ "I love love love love…"

🎲

Top-k Sampling

sample from top k tokens

Keep only the top-k most likely tokens, zero out the rest, then sample. k=50 gives diverse, coherent text. Widely used in GPT-2 and GPT-3.

✓ Diverse yet sensible

📈

Nucleus (Top-p) Sampling

sample from top p probability mass

Keep the smallest set of tokens whose cumulative probability exceeds p (e.g. 0.9). Adapts dynamically — tight when confident, wide when uncertain.

✓ Best for creative text

import torch
import torch.nn.functional as F
from tokenizers import Tokenizer

def generate(
    model,
    tokenizer,
    prompt: str,
    max_new_tokens: int = 200,
    temperature: float  = 0.8,
    top_k: int           = 50,
    top_p: float         = 0.9,
    device: str           = "cuda",
):
    model.eval()
    tokens = tokenizer.encode(prompt).ids
    x = torch.tensor([tokens], dtype=torch.long).to(device)

    with torch.no_grad():
        for _ in range(max_new_tokens):
            # Crop context if needed
            x_ctx = x[:, -model.cfg.context_len:]
            logits, _ = model(x_ctx)
            logits = logits[:, -1, :]     # last position only

            # Temperature scaling
            logits = logits / temperature

            # Top-k filtering
            if top_k > 0:
                vals, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < vals[:, [-1]]] = float("-inf")

            # Top-p (nucleus) filtering
            if top_p < 1.0:
                sorted_logits, sorted_idx = torch.sort(logits, descending=True)
                cum_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
                remove = cum_probs - F.softmax(sorted_logits, dim=-1) > top_p
                sorted_logits[remove] = float("-inf")
                logits.scatter_(1, sorted_idx, sorted_logits)

            # Sample next token
            probs     = F.softmax(logits, dim=-1)
            next_tok  = torch.multinomial(probs, num_samples=1)
            x = torch.cat([x, next_tok], dim=1)

            # Stop on EOS token
            if next_tok.item() == tokenizer.token_to_id("[EOS]"):
                break

    generated_ids = x[0].tolist()
    return tokenizer.decode(generated_ids)


# ── Load checkpoint and generate ──────────────────────────────
ckpt  = torch.load("checkpoint_epoch2.pt", map_location=device)
model.load_state_dict(ckpt["model_state"])
tok   = Tokenizer.from_file("my_tokenizer.json")

output = generate(model, tok, prompt="The history of artificial intelligence")
print(output)

SAMPLE OUTPUT (after 3 epochs, 15M param model)

The history of artificial intelligence began in the mid-twentieth century, when mathematicians and engineers first proposed that machines could simulate human reasoning. Early efforts focused on symbolic logic and rule-based systems, but these approaches struggled with the ambiguity of natural language...

Section 11

Scaling Up — From Toy to Real Model

📖 The Bitter Lesson

Scale Beats Clever Engineering — Every Time

Richard Sutton's famous 2019 essay observed that in AI, methods that leverage more computation always win over carefully hand-crafted features — and they win decisively, over decades. GPT-1 had 117M parameters. GPT-3 had 175B. The architecture barely changed. The data and compute scaled 1,000×.

You cannot match GPT-4 on a laptop. But you can build something useful. A well-trained 100M parameter model running on a single RTX 3090 for 3 days can generate surprisingly good text.

Model Size	Parameters	Training Tokens	GPU Time (A100)	Target Use Case
🐜 Nano	10–50M	1–5B	2–8 hours	Learning, quick experiments
🐦 Small	100–300M	15–50B	1–5 days (1 GPU)	Domain-specific text generation
🦅 Medium	1–7B	100–200B	2–8 weeks (8 GPUs)	Competitive with GPT-2, good at reasoning
🦒 Large	13–70B	1–2T	Months (100+ GPUs)	LLaMA-class, general purpose assistant
🌐 Frontier	100B+	10T+	Years of GPU-time ($100M+)	GPT-4, Claude, Gemini class

Section 12

Step 7 — Fine-Tuning with LoRA

Full fine-tuning a large model requires retraining all parameters — too expensive for most use cases. LoRA (Low-Rank Adaptation) freezes the original weights and adds tiny trainable rank-decomposed matrices at each attention layer. With LoRA, you fine-tune just 0.1–1% of the parameters and achieve results competitive with full fine-tuning.

🧾

The LoRA Idea in One Sentence

Instead of updating the full weight matrix W (shape d×d), decompose the update as ΔW = A × B where A has shape d×r and B has shape r×d, and r ≪ d. With r=8 on a 512×512 matrix, you train 8,192 parameters instead of 262,144.

import torch
import torch.nn as nn

class LoRALinear(nn.Module):
    """Wraps an existing nn.Linear with a LoRA adapter."""
    def __init__(self, linear: nn.Linear, rank: int = 8, alpha: float = 16.0):
        super().__init__()
        self.linear = linear
        self.rank   = rank
        self.scale  = alpha / rank
        d_in, d_out = linear.in_features, linear.out_features

        # LoRA matrices: A is random, B is zero (so ΔW starts at 0)
        self.lora_A = nn.Parameter(torch.randn(rank, d_in) * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(d_out, rank))

        # Freeze the original weights
        for p in self.linear.parameters():
            p.requires_grad = False

    def forward(self, x):
        base   = self.linear(x)                     # original (frozen)
        delta  = (x @ self.lora_A.T) @ self.lora_B.T  # LoRA adapter
        return base + self.scale * delta


def inject_lora(model, rank=8, alpha=16):
    """Replace all Q and V projection layers with LoRA versions."""
    for name, module in model.named_modules():
        if isinstance(module, CausalSelfAttention):
            # We inject LoRA only into the fused QKV projection
            module.qkv_proj = LoRALinear(module.qkv_proj, rank, alpha)
    return model

# ── Apply LoRA ────────────────────────────────────────────────
model = inject_lora(model, rank=8, alpha=16)

# Count trainable vs frozen params
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total     = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable:,}  ({100*trainable/total:.2f}% of total)")

OUTPUT

Trainable: 786,432 (1.78% of total)

Section 13

Golden Rules — Things That Will Save You Days

🧰 LLM Training — Non-Negotiable Rules

Always use mixed precision (FP16/BF16). It halves your memory and nearly doubles throughput with zero accuracy cost on modern GPUs. Use torch.cuda.amp.autocast() and GradScaler.

Clip gradients. LLM training is unstable without it. Set max_norm=1.0. If loss spikes, lower it to 0.5. Monitor the global gradient norm in your logs — a norm consistently above 10 is a red flag.

Use a cosine LR schedule with warm-up. Start at 0, linearly warm up for 500–2000 steps, then cosine-decay to ~10% of peak LR. Flat or step-decay schedules reliably underperform.

Memory-map your training data. Never load the full dataset into RAM. Use numpy.memmap or HuggingFace datasets in streaming mode. A 1B-token dataset needs 2 GB as uint16 — fine on disk, fatal in RAM alongside the model.

Save checkpoints every epoch minimum. GPU jobs die. Colab sessions time out. Save both model.state_dict() and optimizer.state_dict() — without the optimiser state, resuming training will destabilise the loss curve.

Tie your token embeddings to the LM head. This reduces parameters, improves convergence, and is standard in every GPT-class model. One line: lm_head.weight = tok_emb.weight.

Set dropout to 0 at inference. Call model.eval() before any text generation. Forgetting this produces inconsistent, noisy outputs — a subtle bug that's hard to spot.

Section 14

From-Scratch vs Fine-Tuning vs Prompt Engineering

Approach	From Scratch	Fine-Tuning (Full)	LoRA / PEFT	Prompt Engineering
Control	Total — you own everything	High	Moderate	Low
Cost	Very High ($10k–$1M+)	High ($100–$10k)	Low ($5–$100)	Near zero
Data needed	Billions of tokens	10k–1M examples	500–50k examples	0–10 examples
Time to first result	Days to months	Hours to days	Minutes to hours	Seconds
Best for	Proprietary domain, research	Domain specialisation	Task adaptation, tight budgets	Prototyping, off-the-shelf models

🏆

The Practitioner's Decision Tree

Start with prompt engineering — it's free and might be enough. If you need customisation, use LoRA fine-tuning on an existing open model (LLaMA 3, Mistral). Only build from scratch if your data is proprietary, your domain is highly specialised, or you need complete control over the model weights for regulatory reasons.

Section 15

Common Errors & How to Fix Them

Symptom	Likely Cause	Fix
Loss immediately NaN	Learning rate too high, bad data (inf/nan values)	Lower LR to 1e-4; run `torch.isnan(loss).any()` after each step; inspect your dataset
Loss stuck at ~10.4	Model outputting near-uniform distribution (untrained)	Check weight initialisation; verify gradient flow; ensure LR warms up correctly
Loss drops then spikes	Exploding gradients, bad batch in data	Clip gradients; lower max_norm; scan training file for corrupt entries
CUDA out of memory	Batch too large, sequence too long, no gradient checkpointing	Halve batch_size; reduce context_len; enable `torch.utils.checkpoint`
Model generates gibberish	Undertrained, temperature too high, context window exceeded	Train more; lower temperature to 0.6; crop input to context_len
Training works but eval is random	Forgot to call `model.eval()`	Always call `model.eval()` and `torch.no_grad()` before inference