Definition
MMLU is a standardized AI exam of about 16,000 multiple-choice questions across 57 subjects that scores how broadly knowledgeable a model is.
At a glance
- Like a giant SAT for AI: ~16,000 questions across 57 subjects, from math and law to medicine and history[1].
- Score = percent answered correctly. With four choices each, 25% is random guessing; top models now exceed 85-90%[2].
- Created by researchers led by Dan Hendrycks in 2020 to test knowledge models were never specifically trained on[2].
Why it matters
A higher MMLU score is shorthand for broad competence across many fields, so vendors quote it heavily (the dataset has 100M+ downloads)[1][4]. For buyers comparing tools like OpenAI, Anthropic, and Google, it is a useful first filter on general knowledge[3].
What it does not tell you
MMLU only tests book knowledge. It says nothing about brand voice, your documents, made-up answers, or cost and speed at scale. A model can ace it and still fumble your customer emails.
Bottom line
Treat MMLU as a quick report card for general knowledge, not the final word; the model that wins on your own tasks is the one worth paying for.
References
- MMLU. Wikipedia en.wikipedia.org
- Measuring Massive Multitask Language Understanding — Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt. arXiv / ICLR 2021 arxiv.org
- What is MMLU? LLM Benchmark Explained and Why It Matters. DataCamp www.datacamp.com
- MMLU Benchmark (Massive Multi-task Language Understanding). Klu klu.ai
Comments
Questions, corrections, and links welcome. Be specific and civil.