Definition
Scalable oversight is how we supervise AI that is already smarter or faster than the people meant to check its work.
At a glance
- Named in the 2016 paper Concrete Problems in AI Safety[1].
- Today’s main training method (RLHF) needs a human to judge which answer is better — so it breaks down once the AI outperforms the reviewer.
- The shared fix: use AI to help humans supervise AI[2].
- In a 2024 study, AI debaters arguing opposite sides pushed judge accuracy to 76-88% versus a near-50% baseline[3].
How it works
The common trick is to enlist AI in checking AI. In debate, two AIs argue opposing sides and a weaker judge picks the stronger case. Other methods split a task into checkable pieces (amplification), train AI to predict human judgments (reward modeling), or test whether a weak supervisor can still steer a stronger model[5]. OpenAI and Anthropic ran dedicated teams on this[4].
Why it matters
It answers a practical question: can you trust an AI tool whose output you cannot fully verify? Knowing the term helps you press vendors on how their systems are checked, and to treat unverifiable high-stakes outputs with caution.
Bottom line
Once AI beats the people reviewing it, “a human approved it” is no longer enough — scalable oversight keeps you in control by having AI help check AI.