Definition

Scalable oversight is how we supervise AI that is already smarter or faster than the people meant to check its work.

At a glance

Named in the 2016 paper Concrete Problems in AI Safety^[1].
Today’s main training method (RLHF) needs a human to judge which answer is better — so it breaks down once the AI outperforms the reviewer.
The shared fix: use AI to help humans supervise AI^[2].
In a 2024 study, AI debaters arguing opposite sides pushed judge accuracy to 76-88% versus a near-50% baseline^[3].

How it works

The common trick is to enlist AI in checking AI. In debate, two AIs argue opposing sides and a weaker judge picks the stronger case. Other methods split a task into checkable pieces (amplification), train AI to predict human judgments (reward modeling), or test whether a weak supervisor can still steer a stronger model^[5]. OpenAI and Anthropic ran dedicated teams on this^[4].

Why it matters

It answers a practical question: can you trust an AI tool whose output you cannot fully verify? Knowing the term helps you press vendors on how their systems are checked, and to treat unverifiable high-stakes outputs with caution.

Bottom line

Once AI beats the people reviewing it, “a human approved it” is no longer enough — scalable oversight keeps you in control by having AI help check AI.

What is scalable oversight?

At a glance

How it works

Why it matters

Bottom line

References