Large Language Models (LLMs) 📂 LLM architecture deep dive · 6 of 10 56 min read

Parameter-Efficient Fine-Tuning

A comprehensive, story-driven tutorial on Parameter-Efficient Fine-Tuning (PEFT) — covering LoRA, Prefix Tuning, Adapters, and QLoRA with animated SVG diagrams, full Python implementations, and practical golden rules.

Section 01

The Story That Explains Parameter-Efficient Fine-Tuning

The Expert Chef and the New Restaurant
Imagine a world-class chef who spent 20 years mastering every cuisine on Earth. One day, a restaurant hires them to serve only Neapolitan pizza.

Now, you have two choices. Option A: Re-train the chef from scratch — 20 years of school again, focused only on pizza. Expensive, slow, and wasteful. Option B: Give the chef a small specialisation course — two weeks on Neapolitan dough, San Marzano tomatoes, and wood-fired ovens. The chef keeps all their knowledge, but gains the specific skill you need.

Large Language Models are that chef. Training them from scratch on your task costs hundreds of thousands of dollars and months of compute. Parameter-Efficient Fine-Tuning (PEFT) is the two-week specialisation course — you get the specialised output for a fraction of the cost.

A modern LLM like GPT-3 or LLaMA-3 has billions of parameters. Fine-tuning all of them on a task-specific dataset is called full fine-tuning, and it demands the same GPU memory and compute as pre-training. For most teams, that is simply impossible.

PEFT is a family of techniques that fine-tune only a tiny fraction of parameters — sometimes less than 0.1% of the model — while keeping the vast majority frozen. The result: near-full-fine-tuning performance at a fraction of the cost.

🧠
Why This Matters Today

A 70-billion-parameter LLM in 16-bit precision needs ~140 GB of GPU memory just to store the weights. Add optimiser states during full fine-tuning and you need ~560 GB — eight A100 80 GB GPUs. With LoRA, you can fine-tune the same model on a single A100, or even on a consumer RTX 4090. PEFT is not a shortcut — it is what makes LLM customisation accessible.


Section 02

The Problem — Why Full Fine-Tuning Is Impractical

Before understanding PEFT, you need to feel the pain it solves. Here is what full fine-tuning actually requires:

💾
Memory Explosion
4× weight size
Adam optimiser stores model weights + gradients + two momentum tensors. A 7B model needs ~28 GB for params alone.
⏱️
Compute Cost
Days → Weeks
Full fine-tuning GPT-3 sized models can cost $10,000–$100,000+ on cloud GPUs even for relatively small datasets.
🗂️
Storage Overhead
One copy per task
Deploy 10 task-specific models? That's 10 full copies of a 7B model — 140 GB of storage, minimum.
⚠️
Catastrophic Forgetting

Full fine-tuning also risks catastrophic forgetting — where the model's general capabilities degrade as it over-specialises on a narrow training set. PEFT methods inherently protect against this by keeping pre-trained weights frozen.

ApproachTrainable ParamsGPU Memory (7B model)Task CheckpointsRisk of Forgetting
Full Fine-Tuning100%~80–120 GBFull copy per taskHigh
LoRA0.1–1%~16–24 GB~10–50 MB per taskVery low
Prefix Tuning~0.1%~18 GB~5–20 MB per taskMinimal
Adapters0.5–3%~20–30 GB~20–80 MB per taskVery low

Section 03

LoRA — Low-Rank Adaptation

The Shortcut Note in the Margin
Imagine a university textbook — 1,000 pages of condensed knowledge. A professor wants to adapt it for a specific course, but can't reprint the book. Instead, they write a thin annotation booklet — 20 pages of margin notes, corrections, and additions. Students read both: the original textbook plus the annotation. The result is a fully customised course without touching the original manuscript.

LoRA works exactly like that. The pre-trained weight matrix is the textbook. The two small low-rank matrices are the annotation booklet. At inference, you simply add them together.

The Core Idea — Low-Rank Decomposition

LoRA, introduced by Hu et al. (2021), rests on a key insight: the weight updates during fine-tuning have a low intrinsic rank. This means the change matrix ΔW doesn't need to be full-sized — it can be expressed as the product of two much smaller matrices.

Full Fine-Tuning Update
W' = W₀ + ΔW
ΔW has the same shape as W₀. For a 4096×4096 weight: 16.7M parameters to train.
LoRA Reparameterisation
W' = W₀ + B·A
A ∈ ℝ^(r×d), B ∈ ℝ^(d×r), rank r ≪ d. With r=8: only 2×4096×8 = 65K parameters.
Scaling Factor
ΔW = (α/r) · B·A
α is a hyperparameter that scales the LoRA contribution. Controls influence without retuning learning rate.
Parameter Savings
Ratio = 2rd / d²
For d=4096 and r=8: savings ratio = 2×8/4096 ≈ 0.4%. A 250× reduction in trainable parameters.
🎬 ANIMATED DIAGRAM — LoRA Architecture Inside a Transformer Layer
Input x Frozen W₀ (pre-trained, no grad) A d × r B r × d Output FROZEN — gradients blocked TRAINABLE — only A and B learn W₀·x + B·A·x rank r ≪ d scale: α/r

Animation: data flows through frozen W₀ (top path) and trainable low-rank B·A (bottom path). Their outputs are summed element-wise. Only A and B accumulate gradients.

What Gets LoRA Applied To?

In practice, LoRA is applied to the query (Q) and value (V) projection matrices inside every transformer attention layer. Some implementations also target the key (K) and the feed-forward layers. The original paper found Q + V sufficient for most tasks.

🔍 LoRA Targets Inside a Transformer Self-Attention Block
Q proj
Most commonly targeted. Query matrix shapes task-specific attention patterns. LoRA here redirects what the model "looks for".
V proj
Value matrix shapes what information gets passed forward. Second most impactful target.
K proj
Key matrix. Often skipped in cost-conscious implementations; adding it rarely hurts but increases parameters.
FFN
Feed-forward layers. QLoRA and DoRA often add LoRA here for domain-shift tasks (e.g. code → medicine).
Output
Output projection. Occasionally targeted in instruction-following fine-tunes for better format adherence.

LoRA Hyperparameters

ParameterTypical RangeWhat It ControlsPractical Advice
r (rank)4 – 64Expressiveness of the update; higher = more parametersStart with r=8. Use r=16–32 for complex domain shifts.
lora_alpha8 – 128Scales the LoRA contribution: α/r multiplies B·ASet alpha = 2×r as a safe default. Some use alpha = r.
lora_dropout0.0 – 0.1Dropout on the LoRA pathway for regularisation0.05 is a safe default; 0.0 works fine for large datasets.
target_modules["q_proj","v_proj"]Which weight matrices receive LoRA adaptersStart q+v. Add k+o+ffn if task is hard domain shift.
bias"none" / "lora_only"Whether to train bias terms"none" (frozen biases) is standard and saves memory.

Full LoRA Implementation with HuggingFace PEFT

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset
from trl import SFTTrainer

# ── 1. Load base model (frozen) ───────────────────────────
model_id = "meta-llama/Llama-3.2-3B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token  # LLaMA has no pad token by default

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",       # auto-shard across available GPUs
    trust_remote_code=True
)

# ── 2. Define LoRA configuration ──────────────────────────
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                        # rank: expressiveness of the low-rank update
    lora_alpha=32,               # scaling factor (alpha/r = 2.0 here)
    target_modules=[              # apply LoRA to these projection layers
        "q_proj", "k_proj",
        "v_proj", "o_proj"
    ],
    lora_dropout=0.05,           # light dropout for regularisation
    bias="none",                 # keep biases frozen
    use_rslora=False,            # rank-stabilised LoRA (set True for large r)
)

# ── 3. Wrap model with LoRA adapters ──────────────────────
model = get_peft_model(model, lora_config)

# Inspect how many parameters are actually trainable
model.print_trainable_parameters()

# ── 4. Load a sample instruction dataset ─────────────────
dataset = load_dataset("tatsu-lab/alpaca", split="train[:5000]")

def format_instruction(example):
    return {"text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"}

dataset = dataset.map(format_instruction)

# ── 5. Training arguments ─────────────────────────────────
training_args = TrainingArguments(
    output_dir="./lora-llama3-alpaca",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,   # effective batch = 16
    warmup_ratio=0.03,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=50,
    save_strategy="epoch",
    optim="adamw_8bit",              # bitsandbytes 8-bit optimiser saves ~50% memory
    report_to="none",
)

# ── 6. Supervised Fine-Tuning Trainer ────────────────────
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=512,
)

trainer.train()

# ── 7. Save only the LoRA adapter weights (~30 MB) ───────
model.save_pretrained("./lora-adapter")
tokenizer.save_pretrained("./lora-adapter")
OUTPUT
trainable params: 10,485,760 || all params: 3,221,225,472 || trainable%: 0.3255% 0%| | 0/375 [00:00<?, ?it/s] 26%|██▌ | 98/375 [04:12<11:53, 2.58s/it] loss=1.4821 52%|█████▏ | 195/375 [08:26<07:45, 2.59s/it] loss=1.1034 78%|███████▊ | 292/375 [12:39<03:34, 2.58s/it] loss=0.9876 100%|██████████| 375/375 [16:12<00:00, 2.59s/it] loss=0.9241 Training completed. Saved LoRA adapter (32 MB) to ./lora-adapter

Merging LoRA Back Into the Base Model (for deployment)

from peft import PeftModel

# Load base model + LoRA adapter, then merge into a single model
base_model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.float16, device_map="auto"
)
model_with_lora = PeftModel.from_pretrained(base_model, "./lora-adapter")

# Merge: absorbs B·A into W₀ permanently — no inference overhead
merged_model = model_with_lora.merge_and_unload()

# The merged model is identical to a fully fine-tuned model at inference
merged_model.save_pretrained("./merged-model")

# Inference
inputs = tokenizer(
    "### Instruction:\nExplain gradient descent in simple terms\n\n### Response:\n",
    return_tensors="pt"
).to("cuda")

with torch.no_grad():
    output = merged_model.generate(**inputs, max_new_tokens=200, temperature=0.7)

print(tokenizer.decode(output[0], skip_special_tokens=True))
Zero Inference Overhead After Merging

After merge_and_unload(), the LoRA matrices are mathematically absorbed into W₀. The resulting model is computationally identical to a fully fine-tuned model — there is no extra latency at inference time. This is one of LoRA's biggest practical advantages over Adapters, which add layers that must be processed at runtime.


Section 04

Prefix Tuning

The Stage Directions Before the Play
Imagine a theatre company that has a brilliant troupe of actors (the pre-trained model). They know every Shakespearean play by heart. But tonight, you need them to perform it as a noir detective thriller. You don't retrain the actors — you give them secret stage directions at the very start of the performance: costumes, mood, accents, lighting cues.

Those stage directions before the curtain rises — that is Prefix Tuning. You prepend a sequence of learned, trainable "virtual tokens" to every layer of the transformer. The model never sees them as real text, but they steer every attention calculation that follows, shaping the model's output from the inside out.

How Prefix Tuning Works

Introduced by Li and Liang (2021), Prefix Tuning prepends trainable continuous vectors (called the prefix) to the key (K) and value (V) matrices of every transformer attention layer. These are not real tokens from the vocabulary — they are free-floating vectors optimised end-to-end.

🎬 ANIMATED DIAGRAM — Prefix Tuning Attention Flow
TRANSFORMER ATTENTION LAYER (every layer gets its own prefix) PREFIX P₁ P₂ P₃ P₄ P₅ TRAINABLE TOKENS T₁ T₂ T₃ T₄ T₅ FROZEN concat MULTI-HEAD ATTENTION K = [P_K ; T_K] V = [P_V ; T_V] Q = T_Q (tokens only) softmax(QKᵀ/√d)·V Task-Steered Output Prefix (trainable) Input tokens (frozen)

Prefix vectors are prepended to K and V at every layer. Token queries attend to both prefix and real tokens. Only prefix vectors (blue) accumulate gradients — the base model is untouched.

Prefix Tuning vs Prompt Tuning

🔵 Prefix Tuning (Li & Liang 2021)
PropertyValue
Where insertedEvery layer's K, V
Prefix length10–200 tokens
Params trained~0.1%
PerformanceNear full FT on NLG
Inference costSlightly higher (extra K/V)
🟢 Prompt Tuning (Lester et al. 2021)
PropertyValue
Where insertedInput layer only
Prefix length10–100 tokens
Params trained~0.01%
PerformanceNeeds very large model (11B+)
Inference costMinimal overhead

Prefix Tuning Implementation

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from peft import PrefixTuningConfig, get_peft_model, TaskType
import torch

# ── 1. Load a seq2seq model (T5-base for summarisation) ───
model_id = "t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

# ── 2. Prefix Tuning config ───────────────────────────────
prefix_config = PrefixTuningConfig(
    task_type=TaskType.SEQ_2_SEQ_LM,
    num_virtual_tokens=30,          # 30 virtual prefix tokens per layer
    encoder_hidden_size=768,          # match T5-base hidden dim
    prefix_projection=True,          # MLP re-parameterisation for stability
)

# ── 3. Apply prefix tuning ────────────────────────────────
model = get_peft_model(model, prefix_config)
model.print_trainable_parameters()

# ── 4. Forward pass example ───────────────────────────────
# Prefix tokens are automatically prepended inside the PEFT wrapper
input_text = "summarize: The rapid advancement of large language models has transformed..."
inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=60, num_beams=4)
    summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"Summary: {summary}")

# ── 5. Training loop (simplified) ────────────────────────
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
from datasets import load_dataset

dataset = load_dataset("cnn_dailymail", "3.0.0", split="train[:2000]")

training_args = Seq2SeqTrainingArguments(
    output_dir="./prefix-t5-cnn",
    num_train_epochs=5,
    per_device_train_batch_size=8,
    predict_with_generate=True,
    learning_rate=1e-3,           # prefix tuning uses a HIGHER lr than LoRA
    warmup_steps=100,
    fp16=True,
    save_strategy="epoch",
)
OUTPUT
trainable params: 368,640 || all params: 248,548,352 || trainable%: 0.1482%
💡
Use prefix_projection=True for Stability

Directly optimising the prefix vectors can be unstable. The prefix_projection flag adds a small MLP re-parameterisation: the learned parameters are inputs to the MLP, and the MLP outputs are the actual prefix vectors injected into K and V. This regularises the training landscape significantly and is strongly recommended.


Section 05

Adapters — Bottleneck Layers Between Frozen Weights

The USB-C Dongle for Your Laptop
Your laptop is a powerful machine. You never open it up and solder new chips — that would void the warranty and is enormously risky. Instead, you plug in a small dongle that adds the functionality you need: HDMI out, extra USB, an Ethernet port. The dongle is tiny, cheap, and completely removable. Your laptop hardware never changes.

Adapter modules are the dongles for neural networks. Small two-layer bottleneck networks are inserted between each frozen transformer sub-layer. Each adapter learns task-specific transformations. Swap adapters to switch tasks — no retraining required. Your base model is always the same laptop.

Adapter Architecture — The Bottleneck Module

Houlsby et al. (2019) introduced the original Adapter architecture. Each adapter is a down-project → non-linearity → up-project bottleneck, with a residual connection:

Down Projection
h = W_down · x
Projects from d_model (e.g. 768) down to bottleneck dimension d_b (e.g. 64). Compresses information.
Non-linearity
h = ReLU(h)
Standard activation. Some variants use GELU or SiLU. Adds non-linearity to the bottleneck.
Up Projection
h = W_up · h
Projects back to d_model. Same dimension as input so the residual addition works.
Residual Output
output = x + h
Residual ensures that at initialisation (W_up≈0), the adapter is a near-identity. Safe to insert anywhere.
🎬 ANIMATED DIAGRAM — Adapter Module Inside a Transformer Block
FROZEN TRANSFORMER BLOCK Self-Attention (frozen) ADAPTER 1 d→d_b→d + skip trainable ✓ LayerNorm Feed-Forward (FFN) ADAPTER 2 (optional) Input ADAPTER ANATOMY Input x ∈ ℝ^d W_down ∈ ℝ^(d_b×d) ReLU(h) W_up ∈ ℝ^(d×d_b) skip connection Output = x + W_up·ReLU(W_down·x) PARAM COUNT d=768, d_b=64: 2 × 768×64 = 98,304 params vs 768×768 = 589K 6× compression

Left: adapter modules inserted after attention and optionally after FFN in a frozen transformer block. Right: the internal bottleneck structure of a single adapter — down-project, activate, up-project, plus a residual skip connection that ensures identity behaviour at initialisation.

Adapter Implementation with PEFT

from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer
from peft import AdaptionPromptConfig, get_peft_model, TaskType
from peft.tuners.adaption_prompt import AdaptionPromptConfig
from datasets import load_dataset
import numpy as np

# ── Alternatively: use AdapterHub / adapter-transformers ──
# pip install adapter-transformers  (separate library with more adapter types)
import transformers.adapters as adapters
from transformers import RobertaAdapterModel, RobertaTokenizer

# ── Load model with adapter support ──────────────────────
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaAdapterModel.from_pretrained("roberta-base")

# ── Add a Houlsby-style bottleneck adapter ────────────────
# reduction_factor=16 means bottleneck dim = hidden/16 = 768/16 = 48
model.add_adapter("sentiment-adapter", config="houlsby")

# Add classification head on top
model.add_classification_head("sentiment-adapter", num_labels=2)

# Activate the adapter — freezes base model, trains adapter only
model.train_adapter("sentiment-adapter")

# Inspect trainable parameters
total = sum(p.numel() for p in model.parameters())
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Trainable: {trainable:,} / {total:,} ({trainable/total*100:.2f}%)")

# ── Dataset preparation ───────────────────────────────────
dataset = load_dataset("sst2", split="train[:3000]")

def tokenize(batch):
    return tokenizer(batch["sentence"], truncation=True, padding="max_length", max_length=128)

dataset = dataset.map(tokenize, batched=True)
dataset = dataset.rename_column("label", "labels")
dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

# ── Training ──────────────────────────────────────────────
training_args = TrainingArguments(
    output_dir="./adapter-roberta-sst2",
    num_train_epochs=5,
    per_device_train_batch_size=32,
    learning_rate=1e-4,           # adapters use higher lr than LoRA
    fp16=True,
    logging_steps=50,
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {"accuracy": (predictions == labels).mean()}

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    compute_metrics=compute_metrics,
)
trainer.train()

# ── Save only the adapter (very small checkpoint) ─────────
model.save_adapter("./saved-adapter", "sentiment-adapter")
print("Adapter saved!")

# ── Load later: combine any adapter with the base model ───
model2 = RobertaAdapterModel.from_pretrained("roberta-base")
model2.load_adapter("./saved-adapter", load_as="sentiment-adapter")
model2.set_active_adapters("sentiment-adapter")
OUTPUT
Trainable: 895,490 / 125,083,392 (0.72%) Epoch 1/5: loss=0.4821 eval_accuracy=0.8820 Epoch 2/5: loss=0.3012 eval_accuracy=0.9150 Epoch 3/5: loss=0.2244 eval_accuracy=0.9280 Epoch 4/5: loss=0.1891 eval_accuracy=0.9310 Epoch 5/5: loss=0.1654 eval_accuracy=0.9350 Adapter saved! (checkpoint size: 3.4 MB)

Section 06

Head-to-Head — LoRA vs Prefix Tuning vs Adapters

PropertyLoRAPrefix TuningAdapters
Where changes happen Inside weight matrices (low-rank ΔW) Prepended to K,V at every layer New layers inserted into block
Trainable params 0.1–1% ~0.1% 0.5–3%
Inference overhead Zero (merge into W) Small (extra K/V tokens) Yes (extra layers processed)
Architecture change No (merged away) No permanent change Yes (new module)
Memory during training Lowest Low Slightly higher
Best for Instruction tuning, chat, code Seq2seq, summarisation, translation Multi-task, modular task switching
Task modularity (swap adapters) Manual re-merge Swap prefix vectors Native plug-and-play
Ecosystem / tooling Excellent (HF PEFT, QLoRA, DoRA) Good (HF PEFT) Good (AdapterHub)
🏆
The 2024 Community Consensus

LoRA (and its variants QLoRA, DoRA, LoRA+) dominate in practice for LLM fine-tuning. It requires the least memory, merges away inference overhead, and is supported by every major framework. Adapters remain strong for multi-task NLP pipelines where truly modular task-switching matters. Prefix Tuning excels for seq2seq generation tasks and is often the better choice for tasks that were seen at pre-training time.


Section 07

QLoRA — Quantised LoRA (The Game-Changer)

The Compressed Encyclopedia on a Thumb Drive
Imagine a 32-volume encyclopedia that normally needs an entire bookshelf (80 GB). QLoRA takes that encyclopedia and stores it in a compressed format on a thumb drive (16 GB), using 4-bit encoding. When you need to read a passage, it temporarily decompresses only that volume in a high-quality format. When you want to annotate (fine-tune), you use LoRA's annotation booklet system on top of the compressed encyclopedia.

The result: 70-billion parameter models fine-tuned on a single consumer GPU.

QLoRA (Dettmers et al. 2023) stacks three innovations on top of LoRA:

⚙️
NF4 Quantisation
4-bit NormalFloat
The base model is stored in 4-bit NormalFloat (NF4) format, which is information-theoretically optimal for normally distributed weights. Cuts GPU memory ~4×.
🔀
Double Quantisation
Quantise the quantisation
The quantisation constants themselves are quantised, saving an additional ~0.37 bits per parameter — roughly 3 GB on a 7B model.
Paged Optimisers
NVIDIA unified memory
Optimiser state is paged between GPU and CPU RAM during gradient spikes. Prevents OOM errors without slowing training on average.

QLoRA Implementation

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType

# ── 1. BitsAndBytes 4-bit quantisation config ─────────────
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,              # load base model in 4-bit
    bnb_4bit_use_double_quant=True,  # double-quantise the quant constants
    bnb_4bit_quant_type="nf4",       # NormalFloat-4 (optimal for Gaussian weights)
    bnb_4bit_compute_dtype=torch.bfloat16,  # dequantise to bfloat16 for compute
)

# ── 2. Load 7B model in 4-bit (only ~5 GB VRAM!) ─────────
model_id = "meta-llama/Llama-3.1-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

# ── 3. Prepare model — cast LayerNorm to fp32 etc. ────────
model = prepare_model_for_kbit_training(model)

# ── 4. Apply LoRA on top of the 4-bit frozen base ─────────
lora_config = LoraConfig(
    r=64,                            # higher r than usual, base is compressed
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],  # include FFN for qlora
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# Memory summary
if torch.cuda.is_available():
    allocated = torch.cuda.memory_allocated() / 1e9
    print(f"GPU memory used: {allocated:.2f} GB")
OUTPUT
trainable params: 167,772,160 || all params: 8,198,049,792 || trainable%: 2.0465% GPU memory used: 5.83 GB <-- 8B model in 4-bit + LoRA adapters (Compare: full fp16 would require ~16 GB; full fp32 = ~32 GB)
🎯
QLoRA Benchmark — What Does It Actually Cost?

The original QLoRA paper showed that a 65B parameter model fine-tuned with QLoRA on a single 48 GB GPU achieved 99.3% of the performance of ChatGPT (GPT-3.5) on the Vicuna benchmark — at a total training cost of less than $300 on cloud compute. Before QLoRA, this was a $10,000+ exercise.


Section 08

Decision Guide — Which PEFT Method for Your Task?

01
Do you have < 24 GB VRAM and a model > 7B parameters?
QLoRA is your only practical option. Install bitsandbytes and use the NF4 config. Full fine-tuning and regular LoRA are both memory-infeasible at this scale without QLoRA.
02
Is inference latency critical and must you avoid all overhead?
LoRA with merge_and_unload(). After training, merge the adapter into W₀. Inference is identical to a full fine-tuned model. Adapters add permanent layers; Prefix Tuning adds tokens — both increase latency slightly.
03
Do you need to serve many tasks simultaneously from one model?
Adapters. You can load different adapters per request on top of a single shared base model in memory. LoRA requires a separate merged copy per task.
04
Is your task seq2seq (summarisation, translation, NLU)?
Prefix Tuning often outperforms LoRA here, especially when the task is close to what the model saw during pre-training. The continuous prefix steers the encoder-decoder dynamics well.
05
Are you training a chat / instruction / code model (< 14B params)?
Standard LoRA (r=16, alpha=32, q+k+v+o targets). This is the default recommendation from the Alpaca, Vicuna, and Code LLaMA papers. Most benchmarks show it matches full fine-tuning on this class of task.

Section 09

Golden Rules — Non-Negotiables for PEFT in Practice

⚡ PEFT Golden Rules — Never Violate These
1
Always call prepare_model_for_kbit_training(model) before applying LoRA to a quantised model. Without this, LayerNorm layers remain in quantised precision and gradients explode silently. It also enables gradient checkpointing for you.
2
Set lora_alpha = 2 × r as your starting point. This keeps the effective LoRA learning rate stable across different rank choices. Some papers use alpha = r — both are defensible, but never set alpha much larger than 2r without careful learning rate adjustment.
3
Use optim="adamw_8bit" from bitsandbytes during LoRA/QLoRA training. 8-bit Adam reduces the optimiser state from 32-bit floats to 8-bit integers — cutting optimiser memory by ~75% with no meaningful accuracy cost. This alone can reduce peak training memory by 10–15 GB.
4
Never evaluate on the same data you used to select rank (r). LoRA with very high r on small datasets will overfit. Always hold out at least 10% as a validation set, and tune r on that. Common mistake: testing on train split, seeing 99% accuracy, and shipping a model that scores 60% in production.
5
For Prefix Tuning, always enable prefix_projection=True. Direct prefix optimisation without the MLP re-parameterisation is numerically unstable — gradients can vanish in early layers. The projection adds a negligible ~10K parameters and makes training dramatically more reliable.
6
When deploying adapters across multiple tasks, never share the base model weights between tasks without verifying adapter isolation. If two tasks share an adapter accidentally (wrong load_as argument), you will silently corrupt both. Always name adapters explicitly per task.
7
Start with the smallest PEFT method that might work. Try r=4 before r=64. Try Prefix Tuning before LoRA. Try LoRA before QLoRA. Smaller methods are faster to iterate, cheaper to debug, and often match larger methods at the performance you need. Only scale up when a validation plateau proves you need more capacity.

Section 10

Quick Reference — Install, Import, Run

Environment Setup

# Install core PEFT dependencies
pip install transformers peft datasets trl accelerate
pip install bitsandbytes  # for QLoRA (NF4 quantisation + 8-bit Adam)
pip install adapter-transformers  # for Houlsby-style adapters (optional)
pip install sentencepiece protobuf  # needed by some tokenisers (LLaMA, T5)

Minimal LoRA Cheatsheet

from peft import LoraConfig, get_peft_model, TaskType, PeftModel

# Standard config — works for 90% of instruction tuning tasks
cfg = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16, lora_alpha=32, lora_dropout=0.05,
    target_modules=["q_proj", "v_proj"],
    bias="none"
)
model = get_peft_model(model, cfg)    # wraps model; freezes base
model.print_trainable_parameters()     # always sanity-check this
model.save_pretrained("./lora-out")   # saves adapter only (~10-50 MB)

# Load and merge for deployment
base = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16)
peft = PeftModel.from_pretrained(base, "./lora-out")
merged = peft.merge_and_unload()      # now identical to full FT model
🚀
What to Read Next

If you enjoyed this tutorial, the natural next steps are: DoRA (Weight-Decomposed LoRA, 2024) — a drop-in LoRA replacement that matches full fine-tuning more closely by decomposing weights into magnitude and direction; LoRA+ — which uses different learning rates for A and B matrices for faster convergence; and RLHF with PEFT — using LoRA adapters inside PPO reward-model training pipelines for safer and cheaper alignment tuning.