The Story That Explains Parameter-Efficient Fine-Tuning
Now, you have two choices. Option A: Re-train the chef from scratch — 20 years of school again, focused only on pizza. Expensive, slow, and wasteful. Option B: Give the chef a small specialisation course — two weeks on Neapolitan dough, San Marzano tomatoes, and wood-fired ovens. The chef keeps all their knowledge, but gains the specific skill you need.
Large Language Models are that chef. Training them from scratch on your task costs hundreds of thousands of dollars and months of compute. Parameter-Efficient Fine-Tuning (PEFT) is the two-week specialisation course — you get the specialised output for a fraction of the cost.
A modern LLM like GPT-3 or LLaMA-3 has billions of parameters. Fine-tuning all of them on a task-specific dataset is called full fine-tuning, and it demands the same GPU memory and compute as pre-training. For most teams, that is simply impossible.
PEFT is a family of techniques that fine-tune only a tiny fraction of parameters — sometimes less than 0.1% of the model — while keeping the vast majority frozen. The result: near-full-fine-tuning performance at a fraction of the cost.
A 70-billion-parameter LLM in 16-bit precision needs ~140 GB of GPU memory just to store the weights. Add optimiser states during full fine-tuning and you need ~560 GB — eight A100 80 GB GPUs. With LoRA, you can fine-tune the same model on a single A100, or even on a consumer RTX 4090. PEFT is not a shortcut — it is what makes LLM customisation accessible.
The Problem — Why Full Fine-Tuning Is Impractical
Before understanding PEFT, you need to feel the pain it solves. Here is what full fine-tuning actually requires:
Full fine-tuning also risks catastrophic forgetting — where the model's general capabilities degrade as it over-specialises on a narrow training set. PEFT methods inherently protect against this by keeping pre-trained weights frozen.
| Approach | Trainable Params | GPU Memory (7B model) | Task Checkpoints | Risk of Forgetting |
|---|---|---|---|---|
| Full Fine-Tuning | 100% | ~80–120 GB | Full copy per task | High |
| LoRA | 0.1–1% | ~16–24 GB | ~10–50 MB per task | Very low |
| Prefix Tuning | ~0.1% | ~18 GB | ~5–20 MB per task | Minimal |
| Adapters | 0.5–3% | ~20–30 GB | ~20–80 MB per task | Very low |
LoRA — Low-Rank Adaptation
LoRA works exactly like that. The pre-trained weight matrix is the textbook. The two small low-rank matrices are the annotation booklet. At inference, you simply add them together.
The Core Idea — Low-Rank Decomposition
LoRA, introduced by Hu et al. (2021), rests on a key insight: the weight updates during fine-tuning have a low intrinsic rank. This means the change matrix ΔW doesn't need to be full-sized — it can be expressed as the product of two much smaller matrices.
Animation: data flows through frozen W₀ (top path) and trainable low-rank B·A (bottom path). Their outputs are summed element-wise. Only A and B accumulate gradients.
What Gets LoRA Applied To?
In practice, LoRA is applied to the query (Q) and value (V) projection matrices inside every transformer attention layer. Some implementations also target the key (K) and the feed-forward layers. The original paper found Q + V sufficient for most tasks.
LoRA Hyperparameters
| Parameter | Typical Range | What It Controls | Practical Advice |
|---|---|---|---|
r (rank) | 4 – 64 | Expressiveness of the update; higher = more parameters | Start with r=8. Use r=16–32 for complex domain shifts. |
lora_alpha | 8 – 128 | Scales the LoRA contribution: α/r multiplies B·A | Set alpha = 2×r as a safe default. Some use alpha = r. |
lora_dropout | 0.0 – 0.1 | Dropout on the LoRA pathway for regularisation | 0.05 is a safe default; 0.0 works fine for large datasets. |
target_modules | ["q_proj","v_proj"] | Which weight matrices receive LoRA adapters | Start q+v. Add k+o+ffn if task is hard domain shift. |
bias | "none" / "lora_only" | Whether to train bias terms | "none" (frozen biases) is standard and saves memory. |
Full LoRA Implementation with HuggingFace PEFT
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset
from trl import SFTTrainer
# ── 1. Load base model (frozen) ───────────────────────────
model_id = "meta-llama/Llama-3.2-3B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token # LLaMA has no pad token by default
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto", # auto-shard across available GPUs
trust_remote_code=True
)
# ── 2. Define LoRA configuration ──────────────────────────
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # rank: expressiveness of the low-rank update
lora_alpha=32, # scaling factor (alpha/r = 2.0 here)
target_modules=[ # apply LoRA to these projection layers
"q_proj", "k_proj",
"v_proj", "o_proj"
],
lora_dropout=0.05, # light dropout for regularisation
bias="none", # keep biases frozen
use_rslora=False, # rank-stabilised LoRA (set True for large r)
)
# ── 3. Wrap model with LoRA adapters ──────────────────────
model = get_peft_model(model, lora_config)
# Inspect how many parameters are actually trainable
model.print_trainable_parameters()
# ── 4. Load a sample instruction dataset ─────────────────
dataset = load_dataset("tatsu-lab/alpaca", split="train[:5000]")
def format_instruction(example):
return {"text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"}
dataset = dataset.map(format_instruction)
# ── 5. Training arguments ─────────────────────────────────
training_args = TrainingArguments(
output_dir="./lora-llama3-alpaca",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # effective batch = 16
warmup_ratio=0.03,
learning_rate=2e-4,
fp16=True,
logging_steps=50,
save_strategy="epoch",
optim="adamw_8bit", # bitsandbytes 8-bit optimiser saves ~50% memory
report_to="none",
)
# ── 6. Supervised Fine-Tuning Trainer ────────────────────
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
args=training_args,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=512,
)
trainer.train()
# ── 7. Save only the LoRA adapter weights (~30 MB) ───────
model.save_pretrained("./lora-adapter")
tokenizer.save_pretrained("./lora-adapter")
Merging LoRA Back Into the Base Model (for deployment)
from peft import PeftModel
# Load base model + LoRA adapter, then merge into a single model
base_model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype=torch.float16, device_map="auto"
)
model_with_lora = PeftModel.from_pretrained(base_model, "./lora-adapter")
# Merge: absorbs B·A into W₀ permanently — no inference overhead
merged_model = model_with_lora.merge_and_unload()
# The merged model is identical to a fully fine-tuned model at inference
merged_model.save_pretrained("./merged-model")
# Inference
inputs = tokenizer(
"### Instruction:\nExplain gradient descent in simple terms\n\n### Response:\n",
return_tensors="pt"
).to("cuda")
with torch.no_grad():
output = merged_model.generate(**inputs, max_new_tokens=200, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))
After merge_and_unload(), the LoRA matrices are mathematically absorbed into W₀.
The resulting model is computationally identical to a fully fine-tuned model —
there is no extra latency at inference time. This is one of LoRA's biggest
practical advantages over Adapters, which add layers that must be processed at runtime.
Prefix Tuning
Those stage directions before the curtain rises — that is Prefix Tuning. You prepend a sequence of learned, trainable "virtual tokens" to every layer of the transformer. The model never sees them as real text, but they steer every attention calculation that follows, shaping the model's output from the inside out.
How Prefix Tuning Works
Introduced by Li and Liang (2021), Prefix Tuning prepends trainable continuous vectors (called the prefix) to the key (K) and value (V) matrices of every transformer attention layer. These are not real tokens from the vocabulary — they are free-floating vectors optimised end-to-end.
Prefix vectors are prepended to K and V at every layer. Token queries attend to both prefix and real tokens. Only prefix vectors (blue) accumulate gradients — the base model is untouched.
Prefix Tuning vs Prompt Tuning
| Property | Value |
|---|---|
| Where inserted | Every layer's K, V |
| Prefix length | 10–200 tokens |
| Params trained | ~0.1% |
| Performance | Near full FT on NLG |
| Inference cost | Slightly higher (extra K/V) |
| Property | Value |
|---|---|
| Where inserted | Input layer only |
| Prefix length | 10–100 tokens |
| Params trained | ~0.01% |
| Performance | Needs very large model (11B+) |
| Inference cost | Minimal overhead |
Prefix Tuning Implementation
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from peft import PrefixTuningConfig, get_peft_model, TaskType
import torch
# ── 1. Load a seq2seq model (T5-base for summarisation) ───
model_id = "t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
# ── 2. Prefix Tuning config ───────────────────────────────
prefix_config = PrefixTuningConfig(
task_type=TaskType.SEQ_2_SEQ_LM,
num_virtual_tokens=30, # 30 virtual prefix tokens per layer
encoder_hidden_size=768, # match T5-base hidden dim
prefix_projection=True, # MLP re-parameterisation for stability
)
# ── 3. Apply prefix tuning ────────────────────────────────
model = get_peft_model(model, prefix_config)
model.print_trainable_parameters()
# ── 4. Forward pass example ───────────────────────────────
# Prefix tokens are automatically prepended inside the PEFT wrapper
input_text = "summarize: The rapid advancement of large language models has transformed..."
inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=60, num_beams=4)
summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Summary: {summary}")
# ── 5. Training loop (simplified) ────────────────────────
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
from datasets import load_dataset
dataset = load_dataset("cnn_dailymail", "3.0.0", split="train[:2000]")
training_args = Seq2SeqTrainingArguments(
output_dir="./prefix-t5-cnn",
num_train_epochs=5,
per_device_train_batch_size=8,
predict_with_generate=True,
learning_rate=1e-3, # prefix tuning uses a HIGHER lr than LoRA
warmup_steps=100,
fp16=True,
save_strategy="epoch",
)
prefix_projection=True for Stability
Directly optimising the prefix vectors can be unstable. The prefix_projection
flag adds a small MLP re-parameterisation: the learned parameters are inputs to the MLP,
and the MLP outputs are the actual prefix vectors injected into K and V.
This regularises the training landscape significantly and is strongly recommended.
Adapters — Bottleneck Layers Between Frozen Weights
Adapter modules are the dongles for neural networks. Small two-layer bottleneck networks are inserted between each frozen transformer sub-layer. Each adapter learns task-specific transformations. Swap adapters to switch tasks — no retraining required. Your base model is always the same laptop.
Adapter Architecture — The Bottleneck Module
Houlsby et al. (2019) introduced the original Adapter architecture. Each adapter is a down-project → non-linearity → up-project bottleneck, with a residual connection:
Left: adapter modules inserted after attention and optionally after FFN in a frozen transformer block. Right: the internal bottleneck structure of a single adapter — down-project, activate, up-project, plus a residual skip connection that ensures identity behaviour at initialisation.
Adapter Implementation with PEFT
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer
from peft import AdaptionPromptConfig, get_peft_model, TaskType
from peft.tuners.adaption_prompt import AdaptionPromptConfig
from datasets import load_dataset
import numpy as np
# ── Alternatively: use AdapterHub / adapter-transformers ──
# pip install adapter-transformers (separate library with more adapter types)
import transformers.adapters as adapters
from transformers import RobertaAdapterModel, RobertaTokenizer
# ── Load model with adapter support ──────────────────────
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaAdapterModel.from_pretrained("roberta-base")
# ── Add a Houlsby-style bottleneck adapter ────────────────
# reduction_factor=16 means bottleneck dim = hidden/16 = 768/16 = 48
model.add_adapter("sentiment-adapter", config="houlsby")
# Add classification head on top
model.add_classification_head("sentiment-adapter", num_labels=2)
# Activate the adapter — freezes base model, trains adapter only
model.train_adapter("sentiment-adapter")
# Inspect trainable parameters
total = sum(p.numel() for p in model.parameters())
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Trainable: {trainable:,} / {total:,} ({trainable/total*100:.2f}%)")
# ── Dataset preparation ───────────────────────────────────
dataset = load_dataset("sst2", split="train[:3000]")
def tokenize(batch):
return tokenizer(batch["sentence"], truncation=True, padding="max_length", max_length=128)
dataset = dataset.map(tokenize, batched=True)
dataset = dataset.rename_column("label", "labels")
dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
# ── Training ──────────────────────────────────────────────
training_args = TrainingArguments(
output_dir="./adapter-roberta-sst2",
num_train_epochs=5,
per_device_train_batch_size=32,
learning_rate=1e-4, # adapters use higher lr than LoRA
fp16=True,
logging_steps=50,
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="accuracy",
)
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return {"accuracy": (predictions == labels).mean()}
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
compute_metrics=compute_metrics,
)
trainer.train()
# ── Save only the adapter (very small checkpoint) ─────────
model.save_adapter("./saved-adapter", "sentiment-adapter")
print("Adapter saved!")
# ── Load later: combine any adapter with the base model ───
model2 = RobertaAdapterModel.from_pretrained("roberta-base")
model2.load_adapter("./saved-adapter", load_as="sentiment-adapter")
model2.set_active_adapters("sentiment-adapter")
Head-to-Head — LoRA vs Prefix Tuning vs Adapters
| Property | LoRA | Prefix Tuning | Adapters |
|---|---|---|---|
| Where changes happen | Inside weight matrices (low-rank ΔW) | Prepended to K,V at every layer | New layers inserted into block |
| Trainable params | 0.1–1% | ~0.1% | 0.5–3% |
| Inference overhead | Zero (merge into W) | Small (extra K/V tokens) | Yes (extra layers processed) |
| Architecture change | No (merged away) | No permanent change | Yes (new module) |
| Memory during training | Lowest | Low | Slightly higher |
| Best for | Instruction tuning, chat, code | Seq2seq, summarisation, translation | Multi-task, modular task switching |
| Task modularity (swap adapters) | Manual re-merge | Swap prefix vectors | Native plug-and-play |
| Ecosystem / tooling | Excellent (HF PEFT, QLoRA, DoRA) | Good (HF PEFT) | Good (AdapterHub) |
LoRA (and its variants QLoRA, DoRA, LoRA+) dominate in practice for LLM fine-tuning. It requires the least memory, merges away inference overhead, and is supported by every major framework. Adapters remain strong for multi-task NLP pipelines where truly modular task-switching matters. Prefix Tuning excels for seq2seq generation tasks and is often the better choice for tasks that were seen at pre-training time.
QLoRA — Quantised LoRA (The Game-Changer)
The result: 70-billion parameter models fine-tuned on a single consumer GPU.
QLoRA (Dettmers et al. 2023) stacks three innovations on top of LoRA:
QLoRA Implementation
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType
# ── 1. BitsAndBytes 4-bit quantisation config ─────────────
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, # load base model in 4-bit
bnb_4bit_use_double_quant=True, # double-quantise the quant constants
bnb_4bit_quant_type="nf4", # NormalFloat-4 (optimal for Gaussian weights)
bnb_4bit_compute_dtype=torch.bfloat16, # dequantise to bfloat16 for compute
)
# ── 2. Load 7B model in 4-bit (only ~5 GB VRAM!) ─────────
model_id = "meta-llama/Llama-3.1-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto"
)
# ── 3. Prepare model — cast LayerNorm to fp32 etc. ────────
model = prepare_model_for_kbit_training(model)
# ── 4. Apply LoRA on top of the 4-bit frozen base ─────────
lora_config = LoraConfig(
r=64, # higher r than usual, base is compressed
lora_alpha=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"], # include FFN for qlora
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Memory summary
if torch.cuda.is_available():
allocated = torch.cuda.memory_allocated() / 1e9
print(f"GPU memory used: {allocated:.2f} GB")
The original QLoRA paper showed that a 65B parameter model fine-tuned with QLoRA on a single 48 GB GPU achieved 99.3% of the performance of ChatGPT (GPT-3.5) on the Vicuna benchmark — at a total training cost of less than $300 on cloud compute. Before QLoRA, this was a $10,000+ exercise.
Decision Guide — Which PEFT Method for Your Task?
Golden Rules — Non-Negotiables for PEFT in Practice
prepare_model_for_kbit_training(model) before applying LoRA
to a quantised model. Without this, LayerNorm layers remain in quantised precision and
gradients explode silently. It also enables gradient checkpointing for you.
lora_alpha = 2 × r as your starting point.
This keeps the effective LoRA learning rate stable across different rank choices.
Some papers use alpha = r — both are defensible, but never set alpha much larger
than 2r without careful learning rate adjustment.
optim="adamw_8bit" from bitsandbytes during LoRA/QLoRA training.
8-bit Adam reduces the optimiser state from 32-bit floats to 8-bit integers —
cutting optimiser memory by ~75% with no meaningful accuracy cost.
This alone can reduce peak training memory by 10–15 GB.
prefix_projection=True.
Direct prefix optimisation without the MLP re-parameterisation is numerically unstable —
gradients can vanish in early layers. The projection adds a negligible ~10K parameters
and makes training dramatically more reliable.
load_as argument), you will silently corrupt both.
Always name adapters explicitly per task.
Quick Reference — Install, Import, Run
Environment Setup
# Install core PEFT dependencies
pip install transformers peft datasets trl accelerate
pip install bitsandbytes # for QLoRA (NF4 quantisation + 8-bit Adam)
pip install adapter-transformers # for Houlsby-style adapters (optional)
pip install sentencepiece protobuf # needed by some tokenisers (LLaMA, T5)
Minimal LoRA Cheatsheet
from peft import LoraConfig, get_peft_model, TaskType, PeftModel
# Standard config — works for 90% of instruction tuning tasks
cfg = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, lora_alpha=32, lora_dropout=0.05,
target_modules=["q_proj", "v_proj"],
bias="none"
)
model = get_peft_model(model, cfg) # wraps model; freezes base
model.print_trainable_parameters() # always sanity-check this
model.save_pretrained("./lora-out") # saves adapter only (~10-50 MB)
# Load and merge for deployment
base = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16)
peft = PeftModel.from_pretrained(base, "./lora-out")
merged = peft.merge_and_unload() # now identical to full FT model
If you enjoyed this tutorial, the natural next steps are: DoRA (Weight-Decomposed LoRA, 2024) — a drop-in LoRA replacement that matches full fine-tuning more closely by decomposing weights into magnitude and direction; LoRA+ — which uses different learning rates for A and B matrices for faster convergence; and RLHF with PEFT — using LoRA adapters inside PPO reward-model training pipelines for safer and cheaper alignment tuning.