Definition

An AI benchmark is a standardized test — a fixed set of questions or tasks with known answers — used to score and compare how well AI models perform.

At a glance

Every model takes the same test, and scores are posted on a public leaderboard for easy comparison.
MMLU, a popular benchmark, asks ~16,000 multiple-choice questions across 57 subjects like law, medicine, and math^[1].
High scores can mislead: models may have seen the answers during training (contamination) or vendors cherry-pick conditions (gaming).
Safest check: test a model on your own real tasks, not just its leaderboard rank.

Two kinds

Some benchmarks have an answer key and mark a model right or wrong, estimating overall ability like one exam estimates a student’s^[2]. Others measure human preference: Chatbot Arena shows people two anonymous answers and asks which is better, then ranks models from millions of blind votes^[3].

Why scores can mislead

Test questions often leak into training data, so a model may have memorized answers rather than reasoned them out^[5]. Vendors also game results by reporting only their best runs or using prompting tricks^[4]. Since leaderboard rank drives funding and press, inflated numbers are common.

Bottom line

Use benchmarks as a first filter to shortlist models, then judge finalists on your own work.

What is an AI benchmark?

At a glance

Two kinds

Why scores can mislead

Bottom line

References