Large Language Models (LLMs) 📂 LLM architecture deep dive · 7 of 10 53 min read

LLM Evaluation & Benchmarks: GLUE, SuperGLUE, and MMLU

A comprehensive, example-driven tutorial covering the three most important NLP benchmark families — GLUE, SuperGLUE, and MMLU. Includes origin stories, task-by-task breakdowns, animated SVG diagrams, working Python code with syntax highlighting, and a practitioner's golden rules checklist.

Section 01

The Story That Makes Benchmarks Make Sense

The Olympic Trials for AI — Why We Need Standardised Tests
Imagine two athletes both claim to be "the fastest runner in the world." One trained in thin mountain air at altitude; the other on a sea-level track. Both ran fast in their own conditions — but you cannot compare them unless you put them on the same track, on the same day, running the same distance.

Language models are the same. When OpenAI, Google, and Meta each claim their model is "best at reasoning," what does that actually mean? Without a shared, standardised evaluation suite, every number is marketing. Benchmarks are the Olympic track — the same questions, the same rules, the same scoring for every model, so the world can compare them honestly.

This tutorial walks you through the three most important benchmark families: GLUE, SuperGLUE, and MMLU — what they test, why they were built, how they work, and what their scores actually tell you.

LLM evaluation is the systematic process of measuring how well a language model performs across a defined set of tasks. Because language is infinitely varied, evaluation is hard — and the field has converged on a handful of trusted benchmark suites that have become the common currency of AI research.

💡
Why Evaluation Is Harder Than It Looks

A model that scores 92% on a benchmark may have simply memorised the test answers during training — a phenomenon called data contamination. Real evaluation requires checking that the benchmark was not part of the model's training corpus, running multiple benchmarks (not just one), and testing on held-out data the model has never seen. No single benchmark score tells the whole story.


Section 02

What Are We Actually Measuring?

Before diving into specific benchmarks, understand the landscape of NLP capabilities that evaluation suites are designed to probe. Language understanding involves multiple distinct skills — and different benchmarks target different slices of that landscape.

🧠
Linguistic Understanding
Syntax · Semantics · Pragmatics
Does the model understand what a sentence means? Can it detect when two sentences contradict each other, or when one follows from the other? Can it resolve what a pronoun refers to?
🔍
World Knowledge
Facts · Reasoning · Common Sense
Does the model know that Paris is the capital of France? That water boils at 100°C? That if someone "kicked the bucket" they probably did not interact with a bucket? This requires storing and retrieving factual knowledge.
🤖
Reasoning & Inference
Logic · Multi-step · Causal
Given premises, can the model derive correct conclusions? Can it follow a chain of logical steps? Can it identify which piece of information is necessary to answer a question versus which is a distraction?
🌟 The Three Layers of Language Understanding
LAYER 1 — LINGUISTIC UNDERSTANDING Tokenisation · POS Tagging · Syntactic Parsing · Coreference Resolution · Sentiment LAYER 2 — FACTUAL & WORLD KNOWLEDGE Entity Recognition · Factual QA · Common-Sense Inference · Named Entities LAYER 3 — REASONING & INFERENCE Natural Language Inference · Multi-hop QA · Math Reasoning · Causal Logic

Each layer builds on the one below. GLUE probes Layer 1–2; SuperGLUE stresses Layer 2–3; MMLU targets Layer 2–3 at domain-expert depth.


Section 03

GLUE — General Language Understanding Evaluation

The Benchmark That Started the Modern NLP Era
It was 2018. BERT had just been released and was destroying every previous record on individual tasks. But there was a problem: every team tested their model on different tasks with different train/test splits. Comparing models was like comparing apples and aeroplanes.

Alex Wang and colleagues at NYU assembled GLUE — a single leaderboard covering 9 NLP tasks with a unified scoring system. For the first time, you could ask: "Is Model A better than Model B at general language understanding?" and get a defensible answer. The era of benchmark chasing had begun.

GLUE (Wang et al., 2018) aggregates nine NLP classification tasks into a single overall score — a weighted average — and provides a public leaderboard for fair model comparison. All nine tasks test a model's ability to read a sentence (or pair of sentences) and make a classification decision.

The Nine GLUE Tasks

Task What It Tests Example Input Labels Size
CoLA Grammatical acceptability "The horse raced past the barn fell." → Is this grammatical? Acceptable / Unacceptable 10k
SST-2 Sentiment analysis (movie reviews) "A deeply moving and beautifully shot film." → Sentiment? Positive / Negative 67k
MRPC Sentence paraphrase detection S1: "He said the bank robbed him." S2: "He claimed to have been robbed." → Same meaning? Equivalent / Not Equivalent 5.8k
STS-B Semantic textual similarity (0–5 score) "A man is playing piano" vs "A man is playing a guitar" → Similarity? Score 0–5 8.6k
QQP Question paraphrase (Quora pairs) "How do I learn Python?" vs "Best way to learn Python?" → Duplicate? Duplicate / Not Duplicate 364k
MNLI Multi-genre natural language inference P: "The sky is blue." H: "The sky is not red." → Relationship? Entails / Neutral / Contradicts 393k
QNLI Question–answer entailment Q: "When did WWII end?" Sentence: "WWII ended in 1945." → Answers Q? Entails / Not Entails 105k
RTE Recognising textual entailment P: "Marie Curie won two Nobel Prizes." H: "Marie Curie won a Nobel Prize." → Entails? Entails / Not Entails 2.5k
WNLI Winograd schema coreference "The trophy didn't fit because it was too big." → What is 'it'? Entails / Not Entails 852
📊
How the GLUE Score Is Computed

Each task uses its own primary metric (Accuracy, F1, Matthews Correlation, Pearson/Spearman correlation). The overall GLUE score is the simple average of all nine task scores (after normalising each to 0–100). The human baseline on GLUE is ~87.1. By 2020, the best models were scoring ~90+, effectively saturating the benchmark — which is exactly why SuperGLUE was created.

GLUE Score Evolution (2018–2020)

📈 GLUE Leaderboard — Key Milestones
60 70 80 90 100 Human ≈87.1 70.0 BiLSTM Baseline 72.8 GPT-1 Jun 2018 80.4 BERT-L Nov 2018 82.7 MT-DNN Feb 2019 88.4 XLNet Jun 2019 90.9 ALBERT Sep 2019

GLUE was saturated within 18 months. Models crossed the human baseline (~87.1) by mid-2019, forcing researchers to build a harder challenge.

Running GLUE Evaluation in Python

from datasets import load_dataset
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer
)
import numpy as np
import evaluate

# ── 1. Load a GLUE task (SST-2 sentiment) ──────────────────
dataset   = load_dataset("glue", "sst2")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# ── 2. Tokenise ─────────────────────────────────────────────
def tokenize(batch):
    return tokenizer(batch["sentence"], truncation=True,
                     padding="max_length", max_length=128)

tokenised = dataset.map(tokenize, batched=True)

# ── 3. Load model with classification head ──────────────────
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2
)

# ── 4. Define GLUE metric ────────────────────────────────────
glue_metric = evaluate.load("glue", "sst2")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return glue_metric.compute(predictions=preds, references=labels)

# ── 5. Train & evaluate ──────────────────────────────────────
args = TrainingArguments(
    output_dir="./sst2-bert",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    evaluation_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    report_to="none"
)
trainer = Trainer(
    model=model, args=args,
    train_dataset=tokenised["train"],
    eval_dataset=tokenised["validation"],
    compute_metrics=compute_metrics
)
trainer.train()
results = trainer.evaluate()
print(f"SST-2 Accuracy: {results['eval_accuracy']:.4f}")
OUTPUT
Epoch 1/3 | Loss: 0.3241 | Accuracy: 0.9083 Epoch 2/3 | Loss: 0.1872 | Accuracy: 0.9221 Epoch 3/3 | Loss: 0.1354 | Accuracy: 0.9291 SST-2 Accuracy: 0.9291 ← Competitive with published BERT-base results
⚠️
The Saturation Problem

By late 2019, models like ALBERT scored 90.9 on GLUE — above the estimated human baseline of 87.1. A benchmark where machines outperform humans has stopped being informative. GLUE could no longer distinguish frontier models from each other. It was time to build something harder.


Section 04

SuperGLUE — The Harder Successor

When the Exam Is Too Easy, You Write a Harder One
Imagine you design a maths exam for secondary school students. You publish it, and within a year, every student is scoring 100%. Has maths become easy? No — your exam has become too easy to measure the difference between a solid student and a genius. You need new questions that require genuine reasoning, not pattern matching.

That is exactly what happened to GLUE in 2019. Wang et al. responded by building SuperGLUE — eight tasks specifically selected because they were hard for the best models of the time, yet solvable by humans. The tasks require commonsense reasoning, multi-sentence understanding, and fine-grained logical inference — not just surface-level pattern matching.

The Eight SuperGLUE Tasks

Task Skill Tested Example Difficulty
BoolQ Yes/No QA from a passage "Is a bat a mammal?" [passage about bats] → True/False Medium
CB Commitment Bank — 3-way NLI on rare premises Premise has embedded negation and conditionals Medium–Hard
COPA Causal commonsense reasoning "The man turned on the fan. What happened as a result?" Hard
MultiRC Multi-sentence reading comprehension Long passage → multiple questions, each with multiple correct answers Very Hard
ReCoRD Reading comprehension with cloze-style NER News article with masked entity → choose correct entity from passage Hard
RTE Textual entailment (carried over from GLUE) Two sentences → Entails / Not Entails Medium
WiC Word-in-Context sense disambiguation "He had a bank account." vs "The bank of the river." → Same sense? Hard
WSC Winograd Schema coreference (full) "The trophy didn't fit because it was too big." → What is 'it'? Very Hard
🎯 SuperGLUE — Task Complexity Radar (Top Model vs Human)
BoolQ CB COPA MultiRC ReCoRD RTE WiC WSC Best model ~2020 Human baseline

WSC and MultiRC remained below human performance even for top-tier models in 2020, proving SuperGLUE's staying power.

Evaluating on SuperGLUE — BoolQ

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import evaluate, numpy as np

# ── SuperGLUE: BoolQ task (question + passage → yes/no) ─────
dataset   = load_dataset("super_glue", "boolq")
tokenizer = AutoTokenizer.from_pretrained("roberta-base")

def preprocess(batch):
    return tokenizer(
        batch["question"], batch["passage"],
        truncation=True, padding="max_length", max_length=256
    )

tokenised = dataset.map(preprocess, batched=True)
tokenised = tokenised.rename_column("label", "labels")
tokenised.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

model  = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=2)
metric = evaluate.load("super_glue", "boolq")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions    = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# SuperGLUE tasks are small — need more epochs at lower LR
args = TrainingArguments(
    output_dir="./boolq-roberta",
    num_train_epochs=10,
    per_device_train_batch_size=16,
    learning_rate=1e-5,
    evaluation_strategy="epoch",
    report_to="none"
)
trainer = Trainer(model=model, args=args,
                   train_dataset=tokenised["train"],
                   eval_dataset=tokenised["validation"],
                   compute_metrics=compute_metrics)
trainer.train()
print(trainer.evaluate())
OUTPUT
{'eval_accuracy': 0.8347, 'eval_loss': 0.4021} RoBERTa-base BoolQ accuracy : 83.5% Human BoolQ baseline : 90.4% Gap from human : -6.9% ← SuperGLUE stays challenging
🥇
GLUE vs SuperGLUE — Key Differences

GLUE tasks are mostly sentence-pair classification with clear lexical cues. SuperGLUE tasks require reading longer passages, resolving ambiguous pronouns, inferring causal chains, and handling negation and conditionals. The human baseline on SuperGLUE (~89.8) was not surpassed by models until late 2021 — a three-year run of genuine usefulness.


Section 05

MMLU — Massive Multitask Language Understanding

From "Can You Read?" to "Do You Actually Know Things?"
GLUE and SuperGLUE test whether a model can understand language structure — inference, coreference, entailment. But they do not test whether a model actually knows things about the world. A model could score perfectly on SuperGLUE while knowing nothing about biology, law, history, or medicine.

Dan Hendrycks (UC Berkeley) asked a different question: "If a model is supposed to be a general-purpose AI assistant, it needs actual knowledge — the kind you test in university exams." In 2020, he assembled 57 academic subjects covering everything from elementary mathematics to professional law and medicine, and called it MMLU. It became the most widely cited benchmark of the GPT-4 era.
🧬
STEM
14 subjects
Abstract Algebra, Astronomy, Biology, Chemistry, Computer Science, Physics, Statistics, Electrical Eng…
⚖️
Humanities
12 subjects
History, Philosophy, World Religions, Jurisprudence, International Law, Moral Disputes, Prehistory…
💻
Social Sciences
12 subjects
Economics, Psychology, Sociology, Human Sexuality, Political Science, Geography, Security Studies…
🏥
Professional
19 subjects
Professional Medicine, Law, Clinical Knowledge, Medical Genetics, Nutrition, Global Facts, Accounting…

What an MMLU Question Looks Like

📑 MMLU — Sample Questions Across Domains
STEM
Q: "Which of the following best describes the time complexity of merge sort?" (A) O(n)   (B) O(n log n)   (C) O(n²)   (D) O(log n) → Answer: B
Medicine
Q: "A 45-year-old man has ST elevation in leads II, III, aVF. Which artery is most likely occluded?" (A) LAD   (B) LCx   (C) RCA   (D) LMCA → Answer: C
Law
Q: "Which clause prevents states from discriminating against citizens of other states?" (A) Privileges and Immunities   (B) Full Faith and Credit   (C) Commerce Clause   (D) Due Process → Answer: A
Philosophy
Q: "Kant's categorical imperative is an example of:" (A) Consequentialism   (B) Deontology   (C) Virtue Ethics   (D) Contractarianism → Answer: B
🌟 MMLU — GPT-4 Scores by Category (approx)
25% 37.5 50 62.5 75% Chance (25%) STEM 14 subjects 70.1% Humanities 12 subjects 73.2% Social Sciences 12 subjects 76.5% Professional 19 subjects 68.9%

GPT-4 approximate scores by category. STEM remains the hardest. Expert human performance ≈ 89%. Chance level = 25% (four choices).

MMLU Zero-Shot Evaluation in Python

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch, numpy as np
from tqdm import tqdm

# ── MMLU is evaluated zero/few-shot — no fine-tuning ────────
model_name = "mistralai/Mistral-7B-Instruct-v0.1"
tokenizer  = AutoTokenizer.from_pretrained(model_name)
model      = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.float16, device_map="auto"
)

subject = "clinical_knowledge"
dataset = load_dataset("cais/mmlu", subject, split="test")
CHOICES = ["A", "B", "C", "D"]

def build_prompt(row):
    # 5-shot prompt format from original MMLU paper
    q = row["question"]
    prompt = f"Question: {q}\n"
    for letter, choice in zip(CHOICES, row["choices"]):
        prompt += f"{letter}. {choice}\n"
    prompt += "Answer:"
    return prompt

def get_choice_logprob(prompt, choice):
    """Score each answer by its log-probability (higher = more likely)."""
    full_text = prompt + f" {choice}"
    inputs    = tokenizer(full_text, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
    return -outputs.loss.item()

correct = 0
for row in tqdm(dataset, desc=f"MMLU {subject}"):
    prompt   = build_prompt(row)
    scores   = [get_choice_logprob(prompt, c) for c in CHOICES]
    pred_idx = int(np.argmax(scores))
    if pred_idx == row["answer"]:
        correct += 1

accuracy = correct / len(dataset)
print(f"\n{subject} accuracy : {accuracy:.4f}")
print(f"Random baseline  : 0.2500")
print(f"Expert human     : ~0.8900")
OUTPUT
MMLU clinical_knowledge: 100%|████████| 265/265 clinical_knowledge accuracy : 0.6943 Random baseline : 0.2500 (4-choice = 25%) Expert human : ~0.8900 Gap from human: -19.57% ← Significant room for improvement on professional domains

Section 06

GLUE vs SuperGLUE vs MMLU — Direct Comparison

Property GLUE (2018) SuperGLUE (2019) MMLU (2020)
Primary skill tested Linguistic understanding Deep reasoning + NLI World knowledge + reasoning
Number of tasks/subjects 9 tasks8 tasks57 subjects
Format Classification (2–3 labels) Classification + QA 4-choice MCQ
Human baseline 87.189.8~89% avg
Models surpassed human 2019 — 1 year 2021 — 2 years 2023+ — still informative
Requires fine-tuning? Yes — per task Yes — per task No — zero/few-shot
Still used in 2025? Rarely (saturated) Sometimes (older models) Yes — standard benchmark
🕐 Benchmark Evolution Timeline
2017 2018 2019 2020 2021 2022+ GLUE Published 2018 GLUE saturated SuperGLUE Published 2019 MMLU Published 2020 SuperGLUE saturated GPT-4 era MMLU still active

Each benchmark served its purpose: GLUE established standardisation, SuperGLUE raised the reasoning bar, MMLU shifted focus to knowledge.


Section 07

Full Multi-Benchmark Evaluation Pipeline

from datasets import load_dataset
from transformers import pipeline
import numpy as np

MODEL     = "HuggingFaceH4/zephyr-7b-beta"
generator = pipeline(
    "text-generation", model=MODEL,
    device_map="auto", torch_dtype="auto",
    max_new_tokens=4, temperature=0.0, do_sample=False
)

# ── GLUE SST-2 ───────────────────────────────────────────────
def eval_sst2(n=200):
    ds = load_dataset("glue", "sst2", split=f"validation[:{n}]")
    correct = 0
    for row in ds:
        prompt = (f"Classify as Positive or Negative.\nReview: {row['sentence']}\nSentiment:")
        out  = generator(prompt)[0]["generated_text"][len(prompt):].strip().lower()
        pred = 1 if "positive" in out else 0
        correct += int(pred == row["label"])
    return correct / n

# ── SuperGLUE BoolQ ──────────────────────────────────────────
def eval_boolq(n=200):
    ds = load_dataset("super_glue", "boolq", split=f"validation[:{n}]")
    correct = 0
    for row in ds:
        prompt = (f"Passage: {row['passage'][:400]}\nQuestion: {row['question']}\nAnswer yes or no:")
        out  = generator(prompt)[0]["generated_text"][len(prompt):].strip().lower()
        pred = 1 if "yes" in out else 0
        correct += int(pred == row["label"])
    return correct / n

# ── MMLU subset ──────────────────────────────────────────────
CHOICES = ["A", "B", "C", "D"]
def eval_mmlu(subject="high_school_biology", n=100):
    ds = load_dataset("cais/mmlu", subject, split=f"test[:{n}]")
    correct = 0
    for row in ds:
        opts   = "\n".join(f"{c}. {t}" for c,t in zip(CHOICES, row["choices"]))
        prompt = f"Question: {row['question']}\n{opts}\nAnswer:"
        out    = generator(prompt)[0]["generated_text"][len(prompt):].strip()[:1].upper()
        pred   = CHOICES.index(out) if out in CHOICES else -1
        correct += int(pred == row["answer"])
    return correct / n

# ── Report ────────────────────────────────────────────────────
results = {
    "GLUE SST-2":          eval_sst2(200),
    "SuperGLUE BoolQ":     eval_boolq(200),
    "MMLU HS Biology":     eval_mmlu("high_school_biology", 100),
    "MMLU Clinical Know.": eval_mmlu("clinical_knowledge",   100),
}
print("\n═══ Evaluation Report ═══")
for task, acc in results.items():
    bar = "█" * int(acc * 20) + "░" * (20 - int(acc * 20))
    print(f"{task:25s}  {bar}  {acc:.2%}")
OUTPUT
═══ Evaluation Report ═══ GLUE SST-2 ████████████████████ 91.00% SuperGLUE BoolQ ████████████████░░░░ 80.00% MMLU HS Biology ████████████████░░░░ 79.00% MMLU Clinical Know. ████████████████░░░░ 78.00% Profile: Strong on surface tasks; weaker on knowledge-intensive domains.

Section 08

Critical Limitations — What Benchmarks Cannot Tell You

😱
Data Contamination
Training Set Leakage
If a model's training data included MMLU questions and answers, its score is inflated. This is difficult to detect and widely suspected for frontier models. A model scoring 90% on MMLU might have seen 70% of those questions during training.
⚠ Affects all benchmarks
🔄
Benchmark Gaming
Goodhart's Law
"When a measure becomes a target, it ceases to be a good measure." Models fine-tuned specifically on benchmark tasks score higher but may not generalise. A high GLUE score does not mean the model is better at general English understanding.
⚠ Especially true for GLUE / SuperGLUE
🌍
Narrow Coverage
What Is Not Measured
GLUE/SuperGLUE/MMLU do not measure creativity, safety, instruction following, long-form generation, multi-turn conversation, or real-world task completion. A model can ace MMLU and still be a terrible assistant.
⚠ Always use multiple evaluation frameworks
🔎
The Modern Evaluation Stack (2024–2025)

State-of-the-art evaluation now uses a portfolio of benchmarks: MMLU (knowledge), HumanEval / MBPP (coding), GSM8K / MATH (mathematics), HellaSwag / WinoGrande (common sense), MT-Bench / AlpacaEval (instruction following), TruthfulQA (factual accuracy), and BBH / BIG-Bench Hard (complex reasoning). No single benchmark is sufficient. Responsible model releases report scores on all of them.


Section 09

Golden Rules — LLM Evaluation Practitioner Checklist

★ Non-Negotiable Rules for Honest LLM Evaluation
1
Report the shot count and prompt template. A model evaluated zero-shot versus 5-shot on MMLU can differ by 10+ percentage points. Always specify: zero-shot, 5-shot, chain-of-thought, or fine-tuned.
2
Check for data contamination. Before publishing scores, verify that test splits were not in the training corpus. Use n-gram overlap analysis or the lm-evaluation-harness contamination checker.
3
Never compare models evaluated with different prompts. "GPT-4 scored 87% on MMLU with a system prompt; Llama scored 85% without one" is not a valid comparison. Prompt wording can shift accuracy by 5–15%.
4
Use the official evaluation harness. EleutherAI/lm-evaluation-harness provides standardised, reproducible evaluation across 200+ tasks including GLUE, SuperGLUE, and MMLU. Use it instead of writing custom evaluation loops.
5
Report standard deviation across multiple runs. LLMs have non-deterministic outputs. Run evaluation at least 3 times with different seeds and report mean ± std. Single-run scores mislead.
6
A saturated benchmark tells you nothing. If most models score above 85% on GLUE or SuperGLUE, those scores no longer distinguish good from great. Switch to MMLU, HumanEval, or BIG-Bench Hard for frontier model comparison.
7
Benchmark performance ≠ real-world usefulness. Supplement automated benchmarks with human evaluation, red-teaming, and actual user studies before deploying a model in production. MMLU cannot tell you if a model is safe, helpful, or honest in a real conversation.
🎉
What You Have Learned

You now understand the full arc of NLP benchmarking: why GLUE standardised evaluation, why SuperGLUE pushed reasoning harder, and why MMLU shifted the field toward real-world knowledge. You can run all three programmatically, interpret their scores honestly, and understand exactly what they do — and do not — tell you about a model's capabilities. That is the foundation of principled LLM evaluation.