Definition

An AI evaluation (eval) is a structured test that scores how well an AI does a defined job, turning fuzzy expectations into a number you can track.

At a glance

A graded exam for an AI: sample inputs, expected good answers, a way to score them.
The score that matters comes from a custom test built on your own real tasks, not a public leaderboard.^[5]
Scoring can be automated, done by another AI judge, or done by humans, and most teams blend all three.
Rerun it after any change to confirm quality did not quietly drop.

Why it matters

AI looks great in a demo but can fail quietly on the cases you care about. An eval gives you evidence instead of hope^[2]: collect real examples, define a good answer, and score the AI against them^[1]. Switch vendors or upgrade a model, and the number tells you if quality moved before customers notice.

How it’s scored

Automated tests check for a clearly correct answer (fast, cheap, only for clear-cut tasks). An AI judge rates answers against your written criteria and closely matches human raters with a good rubric^[4]. Human review is the gold standard for subjective quality but slow, so it’s used to spot-check.

Benchmarks vs. your own test

Public benchmarks like MMLU compare models in general, but top models all cluster near 88 to 90 percent, so the gap is mostly noise^[3]. A leaderboard can’t tell you how an AI handles your invoices or customers.

Bottom line

Build a small test from your own real tasks and rerun it whenever something changes: that is the difference between hoping and knowing.

What is an AI evaluation (eval)?

At a glance

Why it matters

How it’s scored

Benchmarks vs. your own test

Bottom line

References