Definition
A structured test of the most harm a powerful AI could do if pushed to its limit, used to decide whether it is safe to release.
At a glance
- Measures the model’s maximum ability, not its average behavior — testers push it to do its worst.
- Focuses on high-stakes harms: CBRN weapons, offensive cyber, AI self-improvement, and persuasion.
- Acts as a release gate: cross a threshold and the model ships only once safeguards are proven.
- Now formal policy at Anthropic, OpenAI, and Google DeepMind.
How it works
Instead of asking how a model usually behaves, testers ask what harm a determined bad actor could extract from it. They give it tools, let it reason in steps, and sample many attempts to draw out its true ceiling[2]. A 2024 Google DeepMind study grouped the dangers into persuasion, cyber-security, self-proliferation, and self-reasoning[1]; industry frameworks add CBRN weapon uplift[4].
How results are used
Each lab sets capability thresholds (Anthropic calls its tiers AI Safety Levels). Cross one, and the model is not released until stronger safeguards are shown to cut the risk[3]. The evaluation decides whether a model ships, ships with guardrails, or stays locked down.
Why it matters
This is the AI industry’s closest thing to a pre-market safety inspection. For a business, a vendor’s published safety framework and dangerous-capability testing are a practical signal that someone is managing risks that could otherwise land on you.
Bottom line
These tests probe an AI’s worst-case potential before launch — a published one is a quick sign your vendor checked the ceiling of risk first.
References
- Evaluating Frontier Models for Dangerous Capabilities — Mary Phuong, Matthew Aitchison, et al. (Google DeepMind). arXiv arxiv.org
- Dangerous Capability Evaluations — AI Safety Atlas. AI Safety Atlas ai-safety-atlas.com
- Anthropic's Responsible Scaling Policy — Anthropic. Anthropic www.anthropic.com
- Frontier Capability Assessments — Frontier Model Forum. Frontier Model Forum www.frontiermodelforum.org
Comments
Questions, corrections, and links welcome. Be specific and civil.