The Story That Explains Large Language Models
Now you ask this intern: "Complete this sentence: The capital of France is…"
They say "Paris" — not because they memorised a flashcard, but because they've seen that phrase completed correctly thousands of times and learned the statistical pattern.
That intern is a Large Language Model. It doesn't "know" things the way you do. It predicts the most probable next token given everything that came before — and from billions of such predictions, something remarkably intelligent emerges.
A Large Language Model (LLM) is a neural network with billions of parameters trained on massive text corpora to predict the next token in a sequence. From this single objective — next-token prediction — LLMs learn grammar, facts, reasoning, code, translation, and much more, without any of these being explicitly programmed.
Language is a compressed representation of human thought. A model that perfectly predicts language must, by necessity, learn the underlying concepts, logic, and world knowledge that generate that language. Next-token prediction is a proxy for understanding.
A Brief History — From N-Grams to GPT
Tokens — The Atoms of Language
import tiktoken # OpenAI's tokeniser library
enc = tiktoken.get_encoding("cl100k_base") # GPT-4 encoding
text = "Large Language Models are transformers trained on massive datasets."
tokens = enc.encode(text)
print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"Count: {len(tokens)} tokens")
# Decode individual tokens to see the chunks
for tok_id in tokens:
print(f" {tok_id:6d} → '{enc.decode([tok_id])}'")
The Transformer Architecture — The Engine of LLMs
Every modern LLM is built on the Transformer architecture. Understanding it is non-negotiable for building LLMs. There are two components: the Encoder (understands input) and the Decoder (generates output). Most LLMs (GPT family, LLaMA, Claude) use the decoder-only variant.
d_model = 4096 (hidden size) · n_heads = 32 (attention heads) · n_layers = 32 (transformer blocks) · d_ff = 11008 (FFN width) · vocab_size ≈ 32,000 tokens. Total parameters: ~7 billion.
Self-Attention — The Heart of the Transformer
Attention is the mechanism that lets each word dynamically decide what context to use based on its neighbours.
import numpy as np
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Q, K, V: (batch, seq_len, d_k)
Returns: (batch, seq_len, d_k)
"""
d_k = Q.shape[-1]
# Step 1: dot product of Q and K^T → scores
scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(d_k) # scale
# Step 2: causal mask (decoder only sees past tokens)
if mask is not None:
scores = np.where(mask == 0, -1e9, scores) # mask future tokens
# Step 3: softmax → attention weights
weights = np.exp(scores) / np.exp(scores).sum(axis=-1, keepdims=True)
# Step 4: weighted sum of values
output = np.matmul(weights, V)
return output, weights
# Toy example: 3 tokens, d_k=4
seq_len, d_k = 3, 4
Q = np.random.randn(1, seq_len, d_k)
K = np.random.randn(1, seq_len, d_k)
V = np.random.randn(1, seq_len, d_k)
output, weights = scaled_dot_product_attention(Q, K, V)
print("Attention weights (3×3 matrix):")
print(np.round(weights[0], 3))
print(f"\nOutput shape: {output.shape}")
Building a Minimal GPT from Scratch in PyTorch
A character-level language model — the simplest possible GPT. It reads text, tokenises into individual characters, and trains to predict the next character. The same architecture, scaled up 10,000×, powers GPT-4. Understanding this tiny version is understanding the core.
Step 1 — Data Preparation
import torch
import torch.nn as nn
from torch.nn import functional as F
# ── Config ─────────────────────────────────────────────────
batch_size = 32
block_size = 128 # context window (tokens)
n_embd = 192 # embedding dimension
n_head = 6 # attention heads
n_layer = 6 # transformer blocks
dropout = 0.1
learning_rate = 3e-4
max_iters = 3000
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# ── Load text data ─────────────────────────────────────────
with open('input.txt', 'r') as f:
text = f.read()
# Character-level vocabulary
chars = sorted(list(set(text)))
vocab_size = len(chars)
stoi = { ch: i for i, ch in enumerate(chars) } # char → index
itos = { i: ch for i, ch in enumerate(chars) } # index → char
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])
# Train / val split
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]
def get_batch(split):
d = train_data if split == 'train' else val_data
ix = torch.randint(len(d) - block_size, (batch_size,))
x = torch.stack([d[i : i+block_size] for i in ix])
y = torch.stack([d[i+1 : i+block_size+1] for i in ix])
return x.to(device), y.to(device)
Step 2 — The Transformer Blocks
# ── Single Attention Head ───────────────────────────────────
class Head(nn.Module):
def __init__(self, head_size):
super().__init__()
self.key = nn.Linear(n_embd, head_size, bias=False)
self.query = nn.Linear(n_embd, head_size, bias=False)
self.value = nn.Linear(n_embd, head_size, bias=False)
self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
self.dropout = nn.Dropout(dropout)
def forward(self, x):
B, T, C = x.shape
k = self.key(x) # (B, T, head_size)
q = self.query(x) # (B, T, head_size)
# Attention scores — scaled dot product
wei = q @ k.transpose(-2, -1) * k.shape[-1]**-0.5
wei = wei.masked_fill(self.tril[:T,:T] == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
wei = self.dropout(wei)
v = self.value(x)
return wei @ v # (B, T, head_size)
# ── Multi-Head Attention ────────────────────────────────────
class MultiHeadAttention(nn.Module):
def __init__(self, num_heads, head_size):
super().__init__()
self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
self.proj = nn.Linear(head_size * num_heads, n_embd)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
out = torch.cat([h(x) for h in self.heads], dim=-1)
return self.dropout(self.proj(out))
# ── Feed-Forward Network ────────────────────────────────────
class FeedForward(nn.Module):
def __init__(self, n_embd):
super().__init__()
self.net = nn.Sequential(
nn.Linear(n_embd, 4 * n_embd),
nn.ReLU(),
nn.Linear(4 * n_embd, n_embd),
nn.Dropout(dropout),
)
def forward(self, x): return self.net(x)
# ── Transformer Block ───────────────────────────────────────
class Block(nn.Module):
def __init__(self, n_embd, n_head):
super().__init__()
head_size = n_embd // n_head
self.sa = MultiHeadAttention(n_head, head_size)
self.ffwd = FeedForward(n_embd)
self.ln1 = nn.LayerNorm(n_embd)
self.ln2 = nn.LayerNorm(n_embd)
def forward(self, x):
x = x + self.sa(self.ln1(x)) # residual + attention
x = x + self.ffwd(self.ln2(x)) # residual + FFN
return x
Step 3 — The Full GPT Model
class GPTLanguageModel(nn.Module):
def __init__(self):
super().__init__()
self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
self.position_embedding_table = nn.Embedding(block_size, n_embd)
self.blocks = nn.Sequential(*[Block(n_embd, n_head) for _ in range(n_layer)])
self.ln_f = nn.LayerNorm(n_embd) # final layer norm
self.lm_head = nn.Linear(n_embd, vocab_size) # output projection
self.apply(self._init_weights)
def _init_weights(self, module):
if isinstance(module, nn.Linear):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
def forward(self, idx, targets=None):
B, T = idx.shape
tok_emb = self.token_embedding_table(idx) # (B,T,n_embd)
pos_emb = self.position_embedding_table(torch.arange(T, device=device))
x = tok_emb + pos_emb # (B,T,n_embd)
x = self.blocks(x)
x = self.ln_f(x)
logits = self.lm_head(x) # (B,T,vocab_size)
loss = None
if targets is not None:
B, T, C = logits.shape
logits = logits.view(B*T, C)
targets = targets.view(B*T)
loss = F.cross_entropy(logits, targets)
return logits, loss
def generate(self, idx, max_new_tokens):
for _ in range(max_new_tokens):
idx_cond = idx[:, -block_size:] # crop to context
logits, _ = self(idx_cond)
logits = logits[:, -1, :] # last token only
probs = F.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1) # sample
idx = torch.cat((idx, idx_next), dim=1)
return idx
Step 4 — Training Loop
model = GPTLanguageModel().to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
print(f"Model parameters: {sum(p.numel() for p in model.parameters())/1e6:.2f}M")
for step in range(max_iters):
xb, yb = get_batch('train')
logits, loss = model(xb, yb)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
if step % 500 == 0:
print(f"Step {step:4d} | loss: {loss.item():.4f}")
# ── Generate text ───────────────────────────────────────────
context = torch.zeros((1,1), dtype=torch.long, device=device)
generated = decode(model.generate(context, max_new_tokens=200)[0].tolist())
print(generated)
Pre-Training — Teaching the Model Everything
Pre-training is the most compute-intensive phase. The model sees trillions of tokens and learns to predict the next one. No labels, no human annotation — just raw text and the self-supervised next-token prediction objective. This is where the model "learns the world."
A raw pre-trained model is a next-token predictor. Ask it "What is the capital of France?" and it might continue with "…What is the capital of Germany? What is the capital of Spain?" — because it's seen quiz-style text. It doesn't "answer" questions. That behaviour requires the next phase: instruction fine-tuning and alignment.
Fine-Tuning — Teaching the Model to Be Helpful
Fine-tuning adapts a pre-trained LLM to a specific task or to follow instructions. There are three main approaches, each with a different cost/performance profile.
LoRA Fine-Tuning with Hugging Face + PEFT
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
from datasets import load_dataset
# ── Load base model ─────────────────────────────────────────
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_4bit=True, # QLoRA: 4-bit quantisation
device_map="auto"
)
# ── LoRA config ─────────────────────────────────────────────
lora_config = LoraConfig(
task_type = TaskType.CAUSAL_LM,
r = 16, # rank of the adapter matrices
lora_alpha = 32, # scaling factor
lora_dropout = 0.05,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"],
bias = "none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# → trainable params: 4,194,304 || all params: 6,742,609,920 || 0.0622%
# ── Dataset ─────────────────────────────────────────────────
dataset = load_dataset("timdettmers/openassistant-guanaco", split="train")
# ── Training ────────────────────────────────────────────────
training_args = TrainingArguments(
output_dir = "./llama2-lora",
num_train_epochs = 3,
per_device_train_batch_size = 4,
gradient_accumulation_steps = 4,
learning_rate = 2e-4,
fp16 = True,
logging_steps = 50,
save_steps = 500,
)
trainer = SFTTrainer(
model = model,
train_dataset = dataset,
args = training_args,
dataset_text_field = "text",
max_seq_length = 512,
)
trainer.train()
LoRA trains ~4 million parameters out of 6.7 billion — yet achieves near full fine-tuning quality. The adapter adds two small matrices A and B to each target layer such that the update is ΔW = B·A where rank(ΔW) ≤ r. The low-rank assumption works because most adaptation is low-dimensional.
RLHF — Making the Model Safe and Helpful
RLHF does the same to an LLM. Humans rate model outputs ("this response is helpful, that one is harmful"). A reward model learns to score responses. Then the LLM is optimised with reinforcement learning to maximise that reward — producing helpful, harmless, and honest outputs.
from trl import DPOTrainer, DPOConfig
from datasets import Dataset
# DPO Dataset format: prompt, chosen response, rejected response
dpo_data = {
"prompt": ["Explain photosynthesis.", "What is 2+2?"],
"chosen": ["Photosynthesis is the process by which plants use sunlight...",
"2+2 equals 4."],
"rejected": ["Plants eat sunlight lol",
"I don't know, maybe 5?"],
}
dataset = Dataset.from_dict(dpo_data)
dpo_config = DPOConfig(
beta = 0.1, # KL penalty coefficient — lower = more aggressive updates
max_prompt_length = 512,
max_length = 1024,
output_dir = "./dpo_model",
num_train_epochs = 1,
per_device_train_batch_size = 2,
learning_rate = 5e-5,
)
dpo_trainer = DPOTrainer(
model = model, # your SFT-fine-tuned model
ref_model = ref_model, # frozen copy of SFT model (KL reference)
args = dpo_config,
train_dataset = dataset,
tokenizer = tokenizer,
)
dpo_trainer.train()
Inference — How LLMs Generate Text
Inference is the process of generating tokens from a trained model. The model takes an input (the prompt) and auto-regressively generates one token at a time. The way tokens are sampled from the output distribution dramatically affects output quality.
import torch
import torch.nn.functional as F
def generate(model, tokenizer, prompt, max_new_tokens=200,
temperature=0.8, top_p=0.9, top_k=50):
inputs = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
for _ in range(max_new_tokens):
outputs = model(inputs)
logits = outputs.logits[:, -1, :] # last token's distribution
# ── Temperature ──────────────────────────────
logits = logits / temperature
# ── Top-k filter ─────────────────────────────
top_k_vals, _ = torch.topk(logits, top_k)
threshold = top_k_vals[:, [-1]]
logits = logits.masked_fill(logits < threshold, -float('inf'))
# ── Top-p filter (nucleus sampling) ──────────
probs = F.softmax(logits, dim=-1)
sorted_probs, sorted_idx = torch.sort(probs, descending=True)
cumulative = torch.cumsum(sorted_probs, dim=-1)
to_remove = cumulative - sorted_probs > top_p
sorted_probs[to_remove] = 0
sorted_probs /= sorted_probs.sum(dim=-1, keepdim=True)
# ── Sample ───────────────────────────────────
next_token_idx = torch.multinomial(sorted_probs, num_samples=1)
next_token = sorted_idx.gather(-1, next_token_idx)
inputs = torch.cat([inputs, next_token], dim=1)
if next_token.item() == tokenizer.eos_token_id:
break
return tokenizer.decode(inputs[0], skip_special_tokens=True)
output = generate(model, tokenizer, "Explain quantum entanglement simply:")
print(output)
Using LLMs via API — The Practical Path
For most practitioners, training an LLM from scratch is not the goal. The goal is to use LLMs. The fastest path is through APIs.
OpenAI / Anthropic API
from openai import OpenAI
import anthropic
# ── OpenAI ──────────────────────────────────────────────────
openai_client = OpenAI(api_key="sk-...")
response = openai_client.chat.completions.create(
model = "gpt-4o",
messages= [
{ "role": "system", "content": "You are a helpful data scientist." },
{ "role": "user", "content": "Explain p-values in one paragraph." },
],
max_tokens = 300,
temperature = 0.7,
)
print(response.choices[0].message.content)
# ── Anthropic (Claude) ───────────────────────────────────────
claude = anthropic.Anthropic(api_key="sk-ant-...")
message = claude.messages.create(
model = "claude-sonnet-4-6",
max_tokens = 1024,
messages = [
{ "role": "user", "content": "Write a Python function to detect outliers using IQR." }
]
)
print(message.content[0].text)
Running LLMs Locally with Ollama + LangChain
# First: curl -fsSL https://ollama.ai/install.sh | sh
# Then: ollama pull llama3.2
from langchain_ollama import OllamaLLM
from langchain.prompts import PromptTemplate
# Completely free, runs on your machine
llm = OllamaLLM(model="llama3.2", temperature=0.7)
prompt = PromptTemplate(
input_variables=["topic"],
template="You are an expert data scientist. Explain {topic} in simple terms with an example."
)
chain = prompt | llm
result = chain.invoke({ "topic": "gradient descent" })
print(result)
RAG — Retrieval-Augmented Generation
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader
# ── Step 1: Load and chunk documents ───────────────────────
loader = PyPDFLoader("company_docs.pdf")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)
# ── Step 2: Embed chunks and store in vector DB ────────────
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embedding_model)
# ── Step 3: Retrieval QA chain ──────────────────────────────
retriever = vectorstore.as_retriever(search_kwargs={"k": 4}) # top 4 chunks
qa_chain = RetrievalQA.from_chain_type(
llm = ChatOpenAI(model="gpt-4o", temperature=0),
chain_type= "stuff", # stuff all chunks into prompt
retriever = retriever,
return_source_documents = True,
)
result = qa_chain.invoke("What is our refund policy?")
print("Answer:", result["result"])
print("Sources:", [doc.metadata["source"] for doc in result["source_documents"]])
Offline: Chunk documents → embed each chunk → store vectors in database.
Online: Embed user query → find top-k similar chunks → inject into LLM prompt → LLM answers grounded in retrieved context.
Evaluating LLMs — How Do You Know It's Good?
| Evaluation Method | Measures | Tools | Best For |
|---|---|---|---|
| Perplexity | How well the model predicts held-out text. Lower = better. | Automatic | Comparing model versions during training |
| MMLU | Massive Multitask Language Understanding — 57 academic subjects, 4-choice MCQ | Automatic | General knowledge and reasoning capability |
| HumanEval | Pass@k on 164 Python coding problems from docstrings | Automatic | Code generation ability |
| MT-Bench | Multi-turn conversation quality, scored by GPT-4 as judge | LLM-as-Judge | Instruction-following, helpfulness |
| ROUGE / BLEU | N-gram overlap between generated and reference text | Automatic | Summarisation, translation |
| Human Eval | Direct human preference ratings (A vs B) | Expensive | Gold standard for safety and helpfulness |
from evaluate import load
import numpy as np
# ── Perplexity ─────────────────────────────────────────────
perplexity = load("perplexity", module_type="metric")
results = perplexity.compute(
predictions=["The model generated this sentence.", "Another example output."],
model_id="gpt2"
)
print(f"Mean Perplexity: {results['mean_perplexity']:.2f}")
# ── ROUGE for summarisation ─────────────────────────────────
rouge = load("rouge")
predictions = ["The cat sat on the mat near the window."]
references = ["The cat sat on the mat."]
scores = rouge.compute(predictions=predictions, references=references)
print(f"ROUGE-1: {scores['rouge1']:.4f}")
print(f"ROUGE-L: {scores['rougeL']:.4f}")
LLM Architecture Comparison
| Model | Params | Architecture | Key Innovation | Open Source? |
|---|---|---|---|---|
| GPT-2 | 1.5B | Decoder-only | First large-scale LM to show emergent capabilities | Yes |
| GPT-3 | 175B | Decoder-only | Few-shot in-context learning at scale | No |
| BERT | 340M | Encoder-only | Bidirectional masking; NLU benchmark dominance | Yes |
| T5 | 11B | Encoder-Decoder | "Text-to-text" unified framing for all NLP tasks | Yes |
| LLaMA-3 | 8B–405B | Decoder-only | Open weights; GQA; RoPE; competitive with closed models | Yes |
| Mistral 7B | 7B | Decoder-only | Sliding window attention; beats LLaMA-2 13B at 7B size | Yes |
| GPT-4o | ~1.8T (MoE) | MoE Decoder | Mixture of experts; multimodal; best public benchmark scores | No |
| Claude 3.5+ | Undisclosed | Decoder-only | Constitutional AI; long context; coding excellence | No |
Prompt Engineering — Communicating with LLMs
You don't need to train an LLM to get dramatically better results. The right prompt can unlock capabilities the model already has. Prompt engineering is now a professional skill.
from openai import OpenAI
client = OpenAI()
# ── Chain-of-Thought example ────────────────────────────────
cot_prompt = """Solve the following problem. Think through it step by step.
Problem: A train travels 120 km at 60 km/h, then 80 km at 40 km/h.
What is the average speed for the whole journey?
Think step by step:"""
# ── Few-Shot example ────────────────────────────────────────
few_shot_prompt = """Classify the sentiment of each review.
Review: "The food was absolutely amazing!" → Positive
Review: "Terrible service, never coming back." → Negative
Review: "It was okay, nothing special." → Neutral
Review: "Best burger I've ever had, will definitely return!" → """
# ── Structured JSON output ───────────────────────────────────
response = client.chat.completions.create(
model = "gpt-4o",
messages=[
{"role": "system",
"content": "You are a JSON API. Always respond with valid JSON only. No explanation."},
{"role": "user",
"content": "Extract: name, age, city from: 'Alice Smith, 29, lives in London.'"}
],
response_format={"type": "json_object"}
)
print(response.choices[0].message.content)
# → {"name": "Alice Smith", "age": 29, "city": "London"}
Golden Rules for LLM Practitioners
response_format={"type": "json_object"}
or use tools/function calling. Parsing free-form text in production is a maintenance nightmare.
Structured outputs are deterministic, parseable, and testable.