Definition

Mechanistic interpretability is the field that reverse-engineers an AI’s internal wiring to find the specific concepts and reasoning steps behind its answers.

At a glance

AI models are ‘black boxes’: they give answers, but no one can directly read why.^[1]
This field opens the box, mapping ‘features’ (concepts the model fires on) and ‘circuits’ (the steps it reasons through).
Anthropic found ~34 million features in Claude 3 Sonnet, including a Golden Gate Bridge one.^[2]
For business, it is the path to auditing AI for bias, deception, or unsafe behavior.

How it works

Models are trained, not programmed, so even their builders cannot point to where an answer comes from. A ‘feature’ is an internal pattern for a concept (a bridge, a bug, flattery); a ‘circuit’ is the chain that reasons from ‘capital of the state with Dallas’ to ‘Texas’ to ‘Austin.’^[3] A sparse autoencoder untangles these into readable features.^[2]

Why it matters

Seeing internal concepts lets you check for bias or deception, debug failures systematically, and even steer behavior by adjusting features.^[4] Still early research, but the clearest route to AI you can actually audit, as regulators and customers increasingly demand.^[5]

Bottom line

It is the effort to read an AI’s wiring instead of just trusting its output, the difference between hoping a model behaves and showing why it does.

What is mechanistic interpretability?

At a glance

How it works

Why it matters

Bottom line

References