Definition
ARC-AGI is a benchmark of small colored-grid puzzles that tests whether an AI can figure out brand-new rules from a few examples instead of relying on memorized data.
At a glance
- Each puzzle shows a few input-output grids; the AI must infer the hidden rule and apply it - something most people do easily.
- It measures on-the-fly reasoning, not the fact-recall most AI benchmarks reward.
- ARC-AGI-2 (March 2025) is far harder for machines: average humans score ~60%, top AI under 5%.
- A $1M annual ARC Prize exists; the $700K grand prize unlocks only above 85% and stays unclaimed.
What it tests
You see two or three examples of a grid transforming, then must produce the output for a fresh input. Each puzzle uses a different hidden rule with only a few examples[1], so it rewards genuine reasoning over memorization - a closer proxy for general intelligence than tests an AI can ace by reading the whole internet[2].
Why it matters
A big jump signals real progress: OpenAI’s o3 hit 75.7% (up to 87.5% with heavy compute) on ARC-AGI-1 in late 2024[3]. But the same model fell to roughly 3% on the harder ARC-AGI-2 - a reality check that AI still struggles with truly novel problems, useful when judging vendor claims[4].
The scoreboard
The non-profit ARC Prize Foundation runs a yearly Kaggle contest with a strict compute cap to block brute force[5]. The best 2025 entry reached only ~24%, so the $700K grand prize stays unclaimed.
Bottom line
Watch ARC-AGI scores as a grounded signal of whether AI can reason on the fly - and treat the unclaimed grand prize as proof human-level reasoning has not arrived.