Definition
An AI benchmark is a standardized test — a fixed set of questions or tasks with known answers — used to score and compare how well AI models perform.
At a glance
- Every model takes the same test, and scores are posted on a public leaderboard for easy comparison.
- MMLU, a popular benchmark, asks ~16,000 multiple-choice questions across 57 subjects like law, medicine, and math[1].
- High scores can mislead: models may have seen the answers during training (contamination) or vendors cherry-pick conditions (gaming).
- Safest check: test a model on your own real tasks, not just its leaderboard rank.
Two kinds
Some benchmarks have an answer key and mark a model right or wrong, estimating overall ability like one exam estimates a student’s[2]. Others measure human preference: Chatbot Arena shows people two anonymous answers and asks which is better, then ranks models from millions of blind votes[3].
Why scores can mislead
Test questions often leak into training data, so a model may have memorized answers rather than reasoned them out[5]. Vendors also game results by reporting only their best runs or using prompting tricks[4]. Since leaderboard rank drives funding and press, inflated numbers are common.
Bottom line
Use benchmarks as a first filter to shortlist models, then judge finalists on your own work.