The Story That Makes Benchmarks Make Sense
Language models are the same. When OpenAI, Google, and Meta each claim their model is "best at reasoning," what does that actually mean? Without a shared, standardised evaluation suite, every number is marketing. Benchmarks are the Olympic track — the same questions, the same rules, the same scoring for every model, so the world can compare them honestly.
This tutorial walks you through the three most important benchmark families: GLUE, SuperGLUE, and MMLU — what they test, why they were built, how they work, and what their scores actually tell you.
LLM evaluation is the systematic process of measuring how well a language model performs across a defined set of tasks. Because language is infinitely varied, evaluation is hard — and the field has converged on a handful of trusted benchmark suites that have become the common currency of AI research.
A model that scores 92% on a benchmark may have simply memorised the test answers during training — a phenomenon called data contamination. Real evaluation requires checking that the benchmark was not part of the model's training corpus, running multiple benchmarks (not just one), and testing on held-out data the model has never seen. No single benchmark score tells the whole story.
What Are We Actually Measuring?
Before diving into specific benchmarks, understand the landscape of NLP capabilities that evaluation suites are designed to probe. Language understanding involves multiple distinct skills — and different benchmarks target different slices of that landscape.
Each layer builds on the one below. GLUE probes Layer 1–2; SuperGLUE stresses Layer 2–3; MMLU targets Layer 2–3 at domain-expert depth.
GLUE — General Language Understanding Evaluation
Alex Wang and colleagues at NYU assembled GLUE — a single leaderboard covering 9 NLP tasks with a unified scoring system. For the first time, you could ask: "Is Model A better than Model B at general language understanding?" and get a defensible answer. The era of benchmark chasing had begun.
GLUE (Wang et al., 2018) aggregates nine NLP classification tasks into a single overall score — a weighted average — and provides a public leaderboard for fair model comparison. All nine tasks test a model's ability to read a sentence (or pair of sentences) and make a classification decision.
The Nine GLUE Tasks
| Task | What It Tests | Example Input | Labels | Size |
|---|---|---|---|---|
| CoLA | Grammatical acceptability | "The horse raced past the barn fell." → Is this grammatical? | Acceptable / Unacceptable | 10k |
| SST-2 | Sentiment analysis (movie reviews) | "A deeply moving and beautifully shot film." → Sentiment? | Positive / Negative | 67k |
| MRPC | Sentence paraphrase detection | S1: "He said the bank robbed him." S2: "He claimed to have been robbed." → Same meaning? | Equivalent / Not Equivalent | 5.8k |
| STS-B | Semantic textual similarity (0–5 score) | "A man is playing piano" vs "A man is playing a guitar" → Similarity? | Score 0–5 | 8.6k |
| QQP | Question paraphrase (Quora pairs) | "How do I learn Python?" vs "Best way to learn Python?" → Duplicate? | Duplicate / Not Duplicate | 364k |
| MNLI | Multi-genre natural language inference | P: "The sky is blue." H: "The sky is not red." → Relationship? | Entails / Neutral / Contradicts | 393k |
| QNLI | Question–answer entailment | Q: "When did WWII end?" Sentence: "WWII ended in 1945." → Answers Q? | Entails / Not Entails | 105k |
| RTE | Recognising textual entailment | P: "Marie Curie won two Nobel Prizes." H: "Marie Curie won a Nobel Prize." → Entails? | Entails / Not Entails | 2.5k |
| WNLI | Winograd schema coreference | "The trophy didn't fit because it was too big." → What is 'it'? | Entails / Not Entails | 852 |
Each task uses its own primary metric (Accuracy, F1, Matthews Correlation, Pearson/Spearman correlation). The overall GLUE score is the simple average of all nine task scores (after normalising each to 0–100). The human baseline on GLUE is ~87.1. By 2020, the best models were scoring ~90+, effectively saturating the benchmark — which is exactly why SuperGLUE was created.
GLUE Score Evolution (2018–2020)
GLUE was saturated within 18 months. Models crossed the human baseline (~87.1) by mid-2019, forcing researchers to build a harder challenge.
Running GLUE Evaluation in Python
from datasets import load_dataset
from transformers import (
AutoTokenizer, AutoModelForSequenceClassification,
TrainingArguments, Trainer
)
import numpy as np
import evaluate
# ── 1. Load a GLUE task (SST-2 sentiment) ──────────────────
dataset = load_dataset("glue", "sst2")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# ── 2. Tokenise ─────────────────────────────────────────────
def tokenize(batch):
return tokenizer(batch["sentence"], truncation=True,
padding="max_length", max_length=128)
tokenised = dataset.map(tokenize, batched=True)
# ── 3. Load model with classification head ──────────────────
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased", num_labels=2
)
# ── 4. Define GLUE metric ────────────────────────────────────
glue_metric = evaluate.load("glue", "sst2")
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = np.argmax(logits, axis=-1)
return glue_metric.compute(predictions=preds, references=labels)
# ── 5. Train & evaluate ──────────────────────────────────────
args = TrainingArguments(
output_dir="./sst2-bert",
num_train_epochs=3,
per_device_train_batch_size=32,
evaluation_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="accuracy",
report_to="none"
)
trainer = Trainer(
model=model, args=args,
train_dataset=tokenised["train"],
eval_dataset=tokenised["validation"],
compute_metrics=compute_metrics
)
trainer.train()
results = trainer.evaluate()
print(f"SST-2 Accuracy: {results['eval_accuracy']:.4f}")
By late 2019, models like ALBERT scored 90.9 on GLUE — above the estimated human baseline of 87.1. A benchmark where machines outperform humans has stopped being informative. GLUE could no longer distinguish frontier models from each other. It was time to build something harder.
SuperGLUE — The Harder Successor
That is exactly what happened to GLUE in 2019. Wang et al. responded by building SuperGLUE — eight tasks specifically selected because they were hard for the best models of the time, yet solvable by humans. The tasks require commonsense reasoning, multi-sentence understanding, and fine-grained logical inference — not just surface-level pattern matching.
The Eight SuperGLUE Tasks
| Task | Skill Tested | Example | Difficulty |
|---|---|---|---|
| BoolQ | Yes/No QA from a passage | "Is a bat a mammal?" [passage about bats] → True/False | Medium |
| CB | Commitment Bank — 3-way NLI on rare premises | Premise has embedded negation and conditionals | Medium–Hard |
| COPA | Causal commonsense reasoning | "The man turned on the fan. What happened as a result?" | Hard |
| MultiRC | Multi-sentence reading comprehension | Long passage → multiple questions, each with multiple correct answers | Very Hard |
| ReCoRD | Reading comprehension with cloze-style NER | News article with masked entity → choose correct entity from passage | Hard |
| RTE | Textual entailment (carried over from GLUE) | Two sentences → Entails / Not Entails | Medium |
| WiC | Word-in-Context sense disambiguation | "He had a bank account." vs "The bank of the river." → Same sense? | Hard |
| WSC | Winograd Schema coreference (full) | "The trophy didn't fit because it was too big." → What is 'it'? | Very Hard |
WSC and MultiRC remained below human performance even for top-tier models in 2020, proving SuperGLUE's staying power.
Evaluating on SuperGLUE — BoolQ
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import evaluate, numpy as np
# ── SuperGLUE: BoolQ task (question + passage → yes/no) ─────
dataset = load_dataset("super_glue", "boolq")
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
def preprocess(batch):
return tokenizer(
batch["question"], batch["passage"],
truncation=True, padding="max_length", max_length=256
)
tokenised = dataset.map(preprocess, batched=True)
tokenised = tokenised.rename_column("label", "labels")
tokenised.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
model = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=2)
metric = evaluate.load("super_glue", "boolq")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
# SuperGLUE tasks are small — need more epochs at lower LR
args = TrainingArguments(
output_dir="./boolq-roberta",
num_train_epochs=10,
per_device_train_batch_size=16,
learning_rate=1e-5,
evaluation_strategy="epoch",
report_to="none"
)
trainer = Trainer(model=model, args=args,
train_dataset=tokenised["train"],
eval_dataset=tokenised["validation"],
compute_metrics=compute_metrics)
trainer.train()
print(trainer.evaluate())
GLUE tasks are mostly sentence-pair classification with clear lexical cues. SuperGLUE tasks require reading longer passages, resolving ambiguous pronouns, inferring causal chains, and handling negation and conditionals. The human baseline on SuperGLUE (~89.8) was not surpassed by models until late 2021 — a three-year run of genuine usefulness.
MMLU — Massive Multitask Language Understanding
Dan Hendrycks (UC Berkeley) asked a different question: "If a model is supposed to be a general-purpose AI assistant, it needs actual knowledge — the kind you test in university exams." In 2020, he assembled 57 academic subjects covering everything from elementary mathematics to professional law and medicine, and called it MMLU. It became the most widely cited benchmark of the GPT-4 era.
What an MMLU Question Looks Like
GPT-4 approximate scores by category. STEM remains the hardest. Expert human performance ≈ 89%. Chance level = 25% (four choices).
MMLU Zero-Shot Evaluation in Python
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch, numpy as np
from tqdm import tqdm
# ── MMLU is evaluated zero/few-shot — no fine-tuning ────────
model_name = "mistralai/Mistral-7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.float16, device_map="auto"
)
subject = "clinical_knowledge"
dataset = load_dataset("cais/mmlu", subject, split="test")
CHOICES = ["A", "B", "C", "D"]
def build_prompt(row):
# 5-shot prompt format from original MMLU paper
q = row["question"]
prompt = f"Question: {q}\n"
for letter, choice in zip(CHOICES, row["choices"]):
prompt += f"{letter}. {choice}\n"
prompt += "Answer:"
return prompt
def get_choice_logprob(prompt, choice):
"""Score each answer by its log-probability (higher = more likely)."""
full_text = prompt + f" {choice}"
inputs = tokenizer(full_text, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model(**inputs, labels=inputs["input_ids"])
return -outputs.loss.item()
correct = 0
for row in tqdm(dataset, desc=f"MMLU {subject}"):
prompt = build_prompt(row)
scores = [get_choice_logprob(prompt, c) for c in CHOICES]
pred_idx = int(np.argmax(scores))
if pred_idx == row["answer"]:
correct += 1
accuracy = correct / len(dataset)
print(f"\n{subject} accuracy : {accuracy:.4f}")
print(f"Random baseline : 0.2500")
print(f"Expert human : ~0.8900")
GLUE vs SuperGLUE vs MMLU — Direct Comparison
| Property | GLUE (2018) | SuperGLUE (2019) | MMLU (2020) |
|---|---|---|---|
| Primary skill tested | Linguistic understanding | Deep reasoning + NLI | World knowledge + reasoning |
| Number of tasks/subjects | 9 tasks | 8 tasks | 57 subjects |
| Format | Classification (2–3 labels) | Classification + QA | 4-choice MCQ |
| Human baseline | 87.1 | 89.8 | ~89% avg |
| Models surpassed human | 2019 — 1 year | 2021 — 2 years | 2023+ — still informative |
| Requires fine-tuning? | Yes — per task | Yes — per task | No — zero/few-shot |
| Still used in 2025? | Rarely (saturated) | Sometimes (older models) | Yes — standard benchmark |
Each benchmark served its purpose: GLUE established standardisation, SuperGLUE raised the reasoning bar, MMLU shifted focus to knowledge.
Full Multi-Benchmark Evaluation Pipeline
from datasets import load_dataset
from transformers import pipeline
import numpy as np
MODEL = "HuggingFaceH4/zephyr-7b-beta"
generator = pipeline(
"text-generation", model=MODEL,
device_map="auto", torch_dtype="auto",
max_new_tokens=4, temperature=0.0, do_sample=False
)
# ── GLUE SST-2 ───────────────────────────────────────────────
def eval_sst2(n=200):
ds = load_dataset("glue", "sst2", split=f"validation[:{n}]")
correct = 0
for row in ds:
prompt = (f"Classify as Positive or Negative.\nReview: {row['sentence']}\nSentiment:")
out = generator(prompt)[0]["generated_text"][len(prompt):].strip().lower()
pred = 1 if "positive" in out else 0
correct += int(pred == row["label"])
return correct / n
# ── SuperGLUE BoolQ ──────────────────────────────────────────
def eval_boolq(n=200):
ds = load_dataset("super_glue", "boolq", split=f"validation[:{n}]")
correct = 0
for row in ds:
prompt = (f"Passage: {row['passage'][:400]}\nQuestion: {row['question']}\nAnswer yes or no:")
out = generator(prompt)[0]["generated_text"][len(prompt):].strip().lower()
pred = 1 if "yes" in out else 0
correct += int(pred == row["label"])
return correct / n
# ── MMLU subset ──────────────────────────────────────────────
CHOICES = ["A", "B", "C", "D"]
def eval_mmlu(subject="high_school_biology", n=100):
ds = load_dataset("cais/mmlu", subject, split=f"test[:{n}]")
correct = 0
for row in ds:
opts = "\n".join(f"{c}. {t}" for c,t in zip(CHOICES, row["choices"]))
prompt = f"Question: {row['question']}\n{opts}\nAnswer:"
out = generator(prompt)[0]["generated_text"][len(prompt):].strip()[:1].upper()
pred = CHOICES.index(out) if out in CHOICES else -1
correct += int(pred == row["answer"])
return correct / n
# ── Report ────────────────────────────────────────────────────
results = {
"GLUE SST-2": eval_sst2(200),
"SuperGLUE BoolQ": eval_boolq(200),
"MMLU HS Biology": eval_mmlu("high_school_biology", 100),
"MMLU Clinical Know.": eval_mmlu("clinical_knowledge", 100),
}
print("\n═══ Evaluation Report ═══")
for task, acc in results.items():
bar = "█" * int(acc * 20) + "░" * (20 - int(acc * 20))
print(f"{task:25s} {bar} {acc:.2%}")
Critical Limitations — What Benchmarks Cannot Tell You
State-of-the-art evaluation now uses a portfolio of benchmarks: MMLU (knowledge), HumanEval / MBPP (coding), GSM8K / MATH (mathematics), HellaSwag / WinoGrande (common sense), MT-Bench / AlpacaEval (instruction following), TruthfulQA (factual accuracy), and BBH / BIG-Bench Hard (complex reasoning). No single benchmark is sufficient. Responsible model releases report scores on all of them.
Golden Rules — LLM Evaluation Practitioner Checklist
lm-evaluation-harness contamination checker.
EleutherAI/lm-evaluation-harness provides standardised, reproducible
evaluation across 200+ tasks including GLUE, SuperGLUE, and MMLU. Use it instead of
writing custom evaluation loops.
You now understand the full arc of NLP benchmarking: why GLUE standardised evaluation, why SuperGLUE pushed reasoning harder, and why MMLU shifted the field toward real-world knowledge. You can run all three programmatically, interpret their scores honestly, and understand exactly what they do — and do not — tell you about a model's capabilities. That is the foundation of principled LLM evaluation.