GLUE, SuperGLUE & MMLU

Section 01

The Story That Makes Benchmarks Make Sense

📖 Real World Analogy

The Olympic Trials for AI — Why We Need Standardised Tests

Imagine two athletes both claim to be "the fastest runner in the world." One trained in thin mountain air at altitude; the other on a sea-level track. Both ran fast in their own conditions — but you cannot compare them unless you put them on the same track, on the same day, running the same distance.

Language models are the same. When OpenAI, Google, and Meta each claim their model is "best at reasoning," what does that actually mean? Without a shared, standardised evaluation suite, every number is marketing. Benchmarks are the Olympic track — the same questions, the same rules, the same scoring for every model, so the world can compare them honestly.

This tutorial walks you through the three most important benchmark families: GLUE, SuperGLUE, and MMLU — what they test, why they were built, how they work, and what their scores actually tell you.

LLM evaluation is the systematic process of measuring how well a language model performs across a defined set of tasks. Because language is infinitely varied, evaluation is hard — and the field has converged on a handful of trusted benchmark suites that have become the common currency of AI research.

💡

Why Evaluation Is Harder Than It Looks

A model that scores 92% on a benchmark may have simply memorised the test answers during training — a phenomenon called data contamination. Real evaluation requires checking that the benchmark was not part of the model's training corpus, running multiple benchmarks (not just one), and testing on held-out data the model has never seen. No single benchmark score tells the whole story.

Section 02

What Are We Actually Measuring?

Before diving into specific benchmarks, understand the landscape of NLP capabilities that evaluation suites are designed to probe. Language understanding involves multiple distinct skills — and different benchmarks target different slices of that landscape.

🧠

Linguistic Understanding

Syntax · Semantics · Pragmatics

Does the model understand what a sentence means? Can it detect when two sentences contradict each other, or when one follows from the other? Can it resolve what a pronoun refers to?

🔍

World Knowledge

Facts · Reasoning · Common Sense

Does the model know that Paris is the capital of France? That water boils at 100°C? That if someone "kicked the bucket" they probably did not interact with a bucket? This requires storing and retrieving factual knowledge.

🤖

Reasoning & Inference

Logic · Multi-step · Causal

Given premises, can the model derive correct conclusions? Can it follow a chain of logical steps? Can it identify which piece of information is necessary to answer a question versus which is a distraction?

🌟 The Three Layers of Language Understanding

Each layer builds on the one below. GLUE probes Layer 1–2; SuperGLUE stresses Layer 2–3; MMLU targets Layer 2–3 at domain-expert depth.

Section 03

GLUE — General Language Understanding Evaluation

📖 Origin Story

The Benchmark That Started the Modern NLP Era

It was 2018. BERT had just been released and was destroying every previous record on individual tasks. But there was a problem: every team tested their model on different tasks with different train/test splits. Comparing models was like comparing apples and aeroplanes.

Alex Wang and colleagues at NYU assembled GLUE — a single leaderboard covering 9 NLP tasks with a unified scoring system. For the first time, you could ask: "Is Model A better than Model B at general language understanding?" and get a defensible answer. The era of benchmark chasing had begun.

GLUE (Wang et al., 2018) aggregates nine NLP classification tasks into a single overall score — a weighted average — and provides a public leaderboard for fair model comparison. All nine tasks test a model's ability to read a sentence (or pair of sentences) and make a classification decision.

The Nine GLUE Tasks

Task	What It Tests	Example Input	Labels	Size
CoLA	Grammatical acceptability	"The horse raced past the barn fell." → Is this grammatical?	Acceptable / Unacceptable	10k
SST-2	Sentiment analysis (movie reviews)	"A deeply moving and beautifully shot film." → Sentiment?	Positive / Negative	67k
MRPC	Sentence paraphrase detection	S1: "He said the bank robbed him." S2: "He claimed to have been robbed." → Same meaning?	Equivalent / Not Equivalent	5.8k
STS-B	Semantic textual similarity (0–5 score)	"A man is playing piano" vs "A man is playing a guitar" → Similarity?	Score 0–5	8.6k
QQP	Question paraphrase (Quora pairs)	"How do I learn Python?" vs "Best way to learn Python?" → Duplicate?	Duplicate / Not Duplicate	364k
MNLI	Multi-genre natural language inference	P: "The sky is blue." H: "The sky is not red." → Relationship?	Entails / Neutral / Contradicts	393k
QNLI	Question–answer entailment	Q: "When did WWII end?" Sentence: "WWII ended in 1945." → Answers Q?	Entails / Not Entails	105k
RTE	Recognising textual entailment	P: "Marie Curie won two Nobel Prizes." H: "Marie Curie won a Nobel Prize." → Entails?	Entails / Not Entails	2.5k
WNLI	Winograd schema coreference	"The trophy didn't fit because it was too big." → What is 'it'?	Entails / Not Entails	852

📊

How the GLUE Score Is Computed

Each task uses its own primary metric (Accuracy, F1, Matthews Correlation, Pearson/Spearman correlation). The overall GLUE score is the simple average of all nine task scores (after normalising each to 0–100). The human baseline on GLUE is ~87.1. By 2020, the best models were scoring ~90+, effectively saturating the benchmark — which is exactly why SuperGLUE was created.

GLUE Score Evolution (2018–2020)

📈 GLUE Leaderboard — Key Milestones

GLUE was saturated within 18 months. Models crossed the human baseline (~87.1) by mid-2019, forcing researchers to build a harder challenge.

Running GLUE Evaluation in Python

from datasets import load_dataset
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer
)
import numpy as np
import evaluate

# ── 1. Load a GLUE task (SST-2 sentiment) ──────────────────
dataset   = load_dataset("glue", "sst2")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# ── 2. Tokenise ─────────────────────────────────────────────
def tokenize(batch):
    return tokenizer(batch["sentence"], truncation=True,
                     padding="max_length", max_length=128)

tokenised = dataset.map(tokenize, batched=True)

# ── 3. Load model with classification head ──────────────────
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2
)

# ── 4. Define GLUE metric ────────────────────────────────────
glue_metric = evaluate.load("glue", "sst2")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return glue_metric.compute(predictions=preds, references=labels)

# ── 5. Train & evaluate ──────────────────────────────────────
args = TrainingArguments(
    output_dir="./sst2-bert",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    evaluation_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    report_to="none"
)
trainer = Trainer(
    model=model, args=args,
    train_dataset=tokenised["train"],
    eval_dataset=tokenised["validation"],
    compute_metrics=compute_metrics
)
trainer.train()
results = trainer.evaluate()
print(f"SST-2 Accuracy: {results['eval_accuracy']:.4f}")

OUTPUT

⚠️

The Saturation Problem

By late 2019, models like ALBERT scored 90.9 on GLUE — above the estimated human baseline of 87.1. A benchmark where machines outperform humans has stopped being informative. GLUE could no longer distinguish frontier models from each other. It was time to build something harder.

Section 04

SuperGLUE — The Harder Successor

📖 The Challenge

When the Exam Is Too Easy, You Write a Harder One

Imagine you design a maths exam for secondary school students. You publish it, and within a year, every student is scoring 100%. Has maths become easy? No — your exam has become too easy to measure the difference between a solid student and a genius. You need new questions that require genuine reasoning, not pattern matching.

That is exactly what happened to GLUE in 2019. Wang et al. responded by building SuperGLUE — eight tasks specifically selected because they were hard for the best models of the time, yet solvable by humans. The tasks require commonsense reasoning, multi-sentence understanding, and fine-grained logical inference — not just surface-level pattern matching.

The Eight SuperGLUE Tasks

Task	Skill Tested	Example	Difficulty
BoolQ	Yes/No QA from a passage	"Is a bat a mammal?" [passage about bats] → True/False	Medium
CB	Commitment Bank — 3-way NLI on rare premises	Premise has embedded negation and conditionals	Medium–Hard
COPA	Causal commonsense reasoning	"The man turned on the fan. What happened as a result?"	Hard
MultiRC	Multi-sentence reading comprehension	Long passage → multiple questions, each with multiple correct answers	Very Hard
ReCoRD	Reading comprehension with cloze-style NER	News article with masked entity → choose correct entity from passage	Hard
RTE	Textual entailment (carried over from GLUE)	Two sentences → Entails / Not Entails	Medium
WiC	Word-in-Context sense disambiguation	"He had a bank account." vs "The bank of the river." → Same sense?	Hard
WSC	Winograd Schema coreference (full)	"The trophy didn't fit because it was too big." → What is 'it'?	Very Hard

🎯 SuperGLUE — Task Complexity Radar (Top Model vs Human)

WSC and MultiRC remained below human performance even for top-tier models in 2020, proving SuperGLUE's staying power.

Evaluating on SuperGLUE — BoolQ

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import evaluate, numpy as np

# ── SuperGLUE: BoolQ task (question + passage → yes/no) ─────
dataset   = load_dataset("super_glue", "boolq")
tokenizer = AutoTokenizer.from_pretrained("roberta-base")

def preprocess(batch):
    return tokenizer(
        batch["question"], batch["passage"],
        truncation=True, padding="max_length", max_length=256
    )

tokenised = dataset.map(preprocess, batched=True)
tokenised = tokenised.rename_column("label", "labels")
tokenised.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

model  = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=2)
metric = evaluate.load("super_glue", "boolq")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions    = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# SuperGLUE tasks are small — need more epochs at lower LR
args = TrainingArguments(
    output_dir="./boolq-roberta",
    num_train_epochs=10,
    per_device_train_batch_size=16,
    learning_rate=1e-5,
    evaluation_strategy="epoch",
    report_to="none"
)
trainer = Trainer(model=model, args=args,
                   train_dataset=tokenised["train"],
                   eval_dataset=tokenised["validation"],
                   compute_metrics=compute_metrics)
trainer.train()
print(trainer.evaluate())

OUTPUT

{'eval_accuracy': 0.8347, 'eval_loss': 0.4021} RoBERTa-base BoolQ accuracy : 83.5% Human BoolQ baseline : 90.4% Gap from human : -6.9% ← SuperGLUE stays challenging

🥇

GLUE vs SuperGLUE — Key Differences

GLUE tasks are mostly sentence-pair classification with clear lexical cues. SuperGLUE tasks require reading longer passages, resolving ambiguous pronouns, inferring causal chains, and handling negation and conditionals. The human baseline on SuperGLUE (~89.8) was not surpassed by models until late 2021 — a three-year run of genuine usefulness.

Section 05

MMLU — Massive Multitask Language Understanding

📖 The Paradigm Shift

From "Can You Read?" to "Do You Actually Know Things?"

GLUE and SuperGLUE test whether a model can understand language structure — inference, coreference, entailment. But they do not test whether a model actually knows things about the world. A model could score perfectly on SuperGLUE while knowing nothing about biology, law, history, or medicine.

Dan Hendrycks (UC Berkeley) asked a different question: "If a model is supposed to be a general-purpose AI assistant, it needs actual knowledge — the kind you test in university exams." In 2020, he assembled 57 academic subjects covering everything from elementary mathematics to professional law and medicine, and called it MMLU. It became the most widely cited benchmark of the GPT-4 era.

🧬

STEM

14 subjects

Abstract Algebra, Astronomy, Biology, Chemistry, Computer Science, Physics, Statistics, Electrical Eng…

⚖️

Humanities

12 subjects

History, Philosophy, World Religions, Jurisprudence, International Law, Moral Disputes, Prehistory…

💻

Social Sciences

12 subjects

Economics, Psychology, Sociology, Human Sexuality, Political Science, Geography, Security Studies…

🏥

Professional

19 subjects

Professional Medicine, Law, Clinical Knowledge, Medical Genetics, Nutrition, Global Facts, Accounting…

What an MMLU Question Looks Like

📑 MMLU — Sample Questions Across Domains

STEM

Q: "Which of the following best describes the time complexity of merge sort?" (A) O(n) (B) O(n log n) (C) O(n²) (D) O(log n) → Answer: B

Medicine

Q: "A 45-year-old man has ST elevation in leads II, III, aVF. Which artery is most likely occluded?" (A) LAD (B) LCx (C) RCA (D) LMCA → Answer: C

Law

Q: "Which clause prevents states from discriminating against citizens of other states?" (A) Privileges and Immunities (B) Full Faith and Credit (C) Commerce Clause (D) Due Process → Answer: A

Philosophy

Q: "Kant's categorical imperative is an example of:" (A) Consequentialism (B) Deontology (C) Virtue Ethics (D) Contractarianism → Answer: B

🌟 MMLU — GPT-4 Scores by Category (approx)

GPT-4 approximate scores by category. STEM remains the hardest. Expert human performance ≈ 89%. Chance level = 25% (four choices).

MMLU Zero-Shot Evaluation in Python

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch, numpy as np
from tqdm import tqdm

# ── MMLU is evaluated zero/few-shot — no fine-tuning ────────
model_name = "mistralai/Mistral-7B-Instruct-v0.1"
tokenizer  = AutoTokenizer.from_pretrained(model_name)
model      = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.float16, device_map="auto"
)

subject = "clinical_knowledge"
dataset = load_dataset("cais/mmlu", subject, split="test")
CHOICES = ["A", "B", "C", "D"]

def build_prompt(row):
    # 5-shot prompt format from original MMLU paper
    q = row["question"]
    prompt = f"Question: {q}\n"
    for letter, choice in zip(CHOICES, row["choices"]):
        prompt += f"{letter}. {choice}\n"
    prompt += "Answer:"
    return prompt

def get_choice_logprob(prompt, choice):
    """Score each answer by its log-probability (higher = more likely)."""
    full_text = prompt + f" {choice}"
    inputs    = tokenizer(full_text, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
    return -outputs.loss.item()

correct = 0
for row in tqdm(dataset, desc=f"MMLU {subject}"):
    prompt   = build_prompt(row)
    scores   = [get_choice_logprob(prompt, c) for c in CHOICES]
    pred_idx = int(np.argmax(scores))
    if pred_idx == row["answer"]:
        correct += 1

accuracy = correct / len(dataset)
print(f"\n{subject} accuracy : {accuracy:.4f}")
print(f"Random baseline  : 0.2500")
print(f"Expert human     : ~0.8900")

OUTPUT

MMLU clinical_knowledge: 100%|████████| 265/265 clinical_knowledge accuracy : 0.6943 Random baseline : 0.2500 (4-choice = 25%) Expert human : ~0.8900 Gap from human: -19.57% ← Significant room for improvement on professional domains

Section 06

GLUE vs SuperGLUE vs MMLU — Direct Comparison

Property	GLUE (2018)	SuperGLUE (2019)	MMLU (2020)
Primary skill tested	Linguistic understanding	Deep reasoning + NLI	World knowledge + reasoning
Number of tasks/subjects	9 tasks	8 tasks	57 subjects
Format	Classification (2–3 labels)	Classification + QA	4-choice MCQ
Human baseline	87.1	89.8	~89% avg
Models surpassed human	2019 — 1 year	2021 — 2 years	2023+ — still informative
Requires fine-tuning?	Yes — per task	Yes — per task	No — zero/few-shot
Still used in 2025?	Rarely (saturated)	Sometimes (older models)	Yes — standard benchmark

🕐 Benchmark Evolution Timeline

Each benchmark served its purpose: GLUE established standardisation, SuperGLUE raised the reasoning bar, MMLU shifted focus to knowledge.

Section 07

Full Multi-Benchmark Evaluation Pipeline

from datasets import load_dataset
from transformers import pipeline
import numpy as np

MODEL     = "HuggingFaceH4/zephyr-7b-beta"
generator = pipeline(
    "text-generation", model=MODEL,
    device_map="auto", torch_dtype="auto",
    max_new_tokens=4, temperature=0.0, do_sample=False
)

# ── GLUE SST-2 ───────────────────────────────────────────────
def eval_sst2(n=200):
    ds = load_dataset("glue", "sst2", split=f"validation[:{n}]")
    correct = 0
    for row in ds:
        prompt = (f"Classify as Positive or Negative.\nReview: {row['sentence']}\nSentiment:")
        out  = generator(prompt)[0]["generated_text"][len(prompt):].strip().lower()
        pred = 1 if "positive" in out else 0
        correct += int(pred == row["label"])
    return correct / n

# ── SuperGLUE BoolQ ──────────────────────────────────────────
def eval_boolq(n=200):
    ds = load_dataset("super_glue", "boolq", split=f"validation[:{n}]")
    correct = 0
    for row in ds:
        prompt = (f"Passage: {row['passage'][:400]}\nQuestion: {row['question']}\nAnswer yes or no:")
        out  = generator(prompt)[0]["generated_text"][len(prompt):].strip().lower()
        pred = 1 if "yes" in out else 0
        correct += int(pred == row["label"])
    return correct / n

# ── MMLU subset ──────────────────────────────────────────────
CHOICES = ["A", "B", "C", "D"]
def eval_mmlu(subject="high_school_biology", n=100):
    ds = load_dataset("cais/mmlu", subject, split=f"test[:{n}]")
    correct = 0
    for row in ds:
        opts   = "\n".join(f"{c}. {t}" for c,t in zip(CHOICES, row["choices"]))
        prompt = f"Question: {row['question']}\n{opts}\nAnswer:"
        out    = generator(prompt)[0]["generated_text"][len(prompt):].strip()[:1].upper()
        pred   = CHOICES.index(out) if out in CHOICES else -1
        correct += int(pred == row["answer"])
    return correct / n

# ── Report ────────────────────────────────────────────────────
results = {
    "GLUE SST-2":          eval_sst2(200),
    "SuperGLUE BoolQ":     eval_boolq(200),
    "MMLU HS Biology":     eval_mmlu("high_school_biology", 100),
    "MMLU Clinical Know.": eval_mmlu("clinical_knowledge",   100),
}
print("\n═══ Evaluation Report ═══")
for task, acc in results.items():
    bar = "█" * int(acc * 20) + "░" * (20 - int(acc * 20))
    print(f"{task:25s}  {bar}  {acc:.2%}")

OUTPUT

═══ Evaluation Report ═══ GLUE SST-2 ████████████████████ 91.00% SuperGLUE BoolQ ████████████████░░░░ 80.00% MMLU HS Biology ████████████████░░░░ 79.00% MMLU Clinical Know. ████████████████░░░░ 78.00% Profile: Strong on surface tasks; weaker on knowledge-intensive domains.

Section 08

Critical Limitations — What Benchmarks Cannot Tell You

😱

Data Contamination

Training Set Leakage

If a model's training data included MMLU questions and answers, its score is inflated. This is difficult to detect and widely suspected for frontier models. A model scoring 90% on MMLU might have seen 70% of those questions during training.

⚠ Affects all benchmarks

🔄

Benchmark Gaming

Goodhart's Law

"When a measure becomes a target, it ceases to be a good measure." Models fine-tuned specifically on benchmark tasks score higher but may not generalise. A high GLUE score does not mean the model is better at general English understanding.

⚠ Especially true for GLUE / SuperGLUE

🌍

Narrow Coverage

What Is Not Measured

GLUE/SuperGLUE/MMLU do not measure creativity, safety, instruction following, long-form generation, multi-turn conversation, or real-world task completion. A model can ace MMLU and still be a terrible assistant.

⚠ Always use multiple evaluation frameworks

🔎

The Modern Evaluation Stack (2024–2025)

State-of-the-art evaluation now uses a portfolio of benchmarks: MMLU (knowledge), HumanEval / MBPP (coding), GSM8K / MATH (mathematics), HellaSwag / WinoGrande (common sense), MT-Bench / AlpacaEval (instruction following), TruthfulQA (factual accuracy), and BBH / BIG-Bench Hard (complex reasoning). No single benchmark is sufficient. Responsible model releases report scores on all of them.

Section 09

Golden Rules — LLM Evaluation Practitioner Checklist

★ Non-Negotiable Rules for Honest LLM Evaluation

Report the shot count and prompt template. A model evaluated zero-shot versus 5-shot on MMLU can differ by 10+ percentage points. Always specify: zero-shot, 5-shot, chain-of-thought, or fine-tuned.

Check for data contamination. Before publishing scores, verify that test splits were not in the training corpus. Use n-gram overlap analysis or the lm-evaluation-harness contamination checker.

Never compare models evaluated with different prompts. "GPT-4 scored 87% on MMLU with a system prompt; Llama scored 85% without one" is not a valid comparison. Prompt wording can shift accuracy by 5–15%.

Use the official evaluation harness. EleutherAI/lm-evaluation-harness provides standardised, reproducible evaluation across 200+ tasks including GLUE, SuperGLUE, and MMLU. Use it instead of writing custom evaluation loops.

Report standard deviation across multiple runs. LLMs have non-deterministic outputs. Run evaluation at least 3 times with different seeds and report mean ± std. Single-run scores mislead.

A saturated benchmark tells you nothing. If most models score above 85% on GLUE or SuperGLUE, those scores no longer distinguish good from great. Switch to MMLU, HumanEval, or BIG-Bench Hard for frontier model comparison.

Benchmark performance ≠ real-world usefulness. Supplement automated benchmarks with human evaluation, red-teaming, and actual user studies before deploying a model in production. MMLU cannot tell you if a model is safe, helpful, or honest in a real conversation.

🎉

What You Have Learned

You now understand the full arc of NLP benchmarking: why GLUE standardised evaluation, why SuperGLUE pushed reasoning harder, and why MMLU shifted the field toward real-world knowledge. You can run all three programmatically, interpret their scores honestly, and understand exactly what they do — and do not — tell you about a model's capabilities. That is the foundation of principled LLM evaluation.