Definition

A structured test of the most harm a powerful AI could do if pushed to its limit, used to decide whether it is safe to release.

At a glance

Measures the model’s maximum ability, not its average behavior — testers push it to do its worst.
Focuses on high-stakes harms: CBRN weapons, offensive cyber, AI self-improvement, and persuasion.
Acts as a release gate: cross a threshold and the model ships only once safeguards are proven.
Now formal policy at Anthropic, OpenAI, and Google DeepMind.

How it works

Instead of asking how a model usually behaves, testers ask what harm a determined bad actor could extract from it. They give it tools, let it reason in steps, and sample many attempts to draw out its true ceiling^[2]. A 2024 Google DeepMind study grouped the dangers into persuasion, cyber-security, self-proliferation, and self-reasoning^[1]; industry frameworks add CBRN weapon uplift^[4].

How results are used

Each lab sets capability thresholds (Anthropic calls its tiers AI Safety Levels). Cross one, and the model is not released until stronger safeguards are shown to cut the risk^[3]. The evaluation decides whether a model ships, ships with guardrails, or stays locked down.

Why it matters

This is the AI industry’s closest thing to a pre-market safety inspection. For a business, a vendor’s published safety framework and dangerous-capability testing are a practical signal that someone is managing risks that could otherwise land on you.

Bottom line

These tests probe an AI’s worst-case potential before launch — a published one is a quick sign your vendor checked the ceiling of risk first.

What are dangerous capability evaluations?

At a glance

How it works

How results are used

Why it matters

Bottom line

References