Definition
Model evaluations are structured tests of an AI’s capabilities and risks that give policymakers evidence to write rules, set reporting duties, and decide if a model is safe to release.
At a glance
- Evals probe specific dangers: misuse (cyber or bio attacks), biased or deceptive behavior, and whether safety guardrails hold up under attack.
- Government bodies (UK AI Security Institute, US CAISI) run the tests, often before public release, and translate results into policy.
- The EU AI Act now legally requires “systemic risk” model providers to run evaluations and report serious incidents.
- US pre-release testing is voluntary today: major labs have agreed but can withdraw anytime.
How it works
An evaluation is a structured exam for a model. Testers measure dangerous capabilities, societal harms, and whether guardrails can be broken, using benchmark question sets, expert “red-teaming,” and “human uplift” studies that compare AI help against a plain web search[1]. Specialized AI Safety or Security Institutes turn these technical results into plain-language risk insights for lawmakers[5]. Increasingly, independent external evaluators do the testing, so firms aren’t grading their own homework[3].
Why it matters for a business
If you build on or sell powerful AI, evals are a compliance reality. Under the EU AI Act, providers of the largest models (above ~10^25 FLOPs) must run evaluations, do adversarial testing, and report serious incidents[2]. US testing is voluntary now but may soon be formalized[4]. Expect vendors to show evaluation evidence, and treat third-party testing as a sign of a regulator-ready product.
Bottom line
Powerful AI increasingly ships with a test report attached, and that report is what policy is built on.