Definition

ARC-AGI is a benchmark of small colored-grid puzzles that tests whether an AI can figure out brand-new rules from a few examples instead of relying on memorized data.

At a glance

Each puzzle shows a few input-output grids; the AI must infer the hidden rule and apply it - something most people do easily.
It measures on-the-fly reasoning, not the fact-recall most AI benchmarks reward.
ARC-AGI-2 (March 2025) is far harder for machines: average humans score ~60%, top AI under 5%.
A $1M annual ARC Prize exists; the $700K grand prize unlocks only above 85% and stays unclaimed.

What it tests

You see two or three examples of a grid transforming, then must produce the output for a fresh input. Each puzzle uses a different hidden rule with only a few examples^[1], so it rewards genuine reasoning over memorization - a closer proxy for general intelligence than tests an AI can ace by reading the whole internet^[2].

Why it matters

A big jump signals real progress: OpenAI’s o3 hit 75.7% (up to 87.5% with heavy compute) on ARC-AGI-1 in late 2024^[3]. But the same model fell to roughly 3% on the harder ARC-AGI-2 - a reality check that AI still struggles with truly novel problems, useful when judging vendor claims^[4].

The scoreboard

The non-profit ARC Prize Foundation runs a yearly Kaggle contest with a strict compute cap to block brute force^[5]. The best 2025 entry reached only ~24%, so the $700K grand prize stays unclaimed.

Bottom line

Watch ARC-AGI scores as a grounded signal of whether AI can reason on the fly - and treat the unclaimed grand prize as proof human-level reasoning has not arrived.

What is the ARC-AGI benchmark?

At a glance

What it tests

Why it matters

The scoreboard

Bottom line

References