Definition
Mechanistic interpretability is the field that reverse-engineers an AI’s internal wiring to find the specific concepts and reasoning steps behind its answers.
At a glance
- AI models are ‘black boxes’: they give answers, but no one can directly read why.[1]
- This field opens the box, mapping ‘features’ (concepts the model fires on) and ‘circuits’ (the steps it reasons through).
- Anthropic found ~34 million features in Claude 3 Sonnet, including a Golden Gate Bridge one.[2]
- For business, it is the path to auditing AI for bias, deception, or unsafe behavior.
How it works
Models are trained, not programmed, so even their builders cannot point to where an answer comes from. A ‘feature’ is an internal pattern for a concept (a bridge, a bug, flattery); a ‘circuit’ is the chain that reasons from ‘capital of the state with Dallas’ to ‘Texas’ to ‘Austin.’[3] A sparse autoencoder untangles these into readable features.[2]
Why it matters
Seeing internal concepts lets you check for bias or deception, debug failures systematically, and even steer behavior by adjusting features.[4] Still early research, but the clearest route to AI you can actually audit, as regulators and customers increasingly demand.[5]
Bottom line
It is the effort to read an AI’s wiring instead of just trusting its output, the difference between hoping a model behaves and showing why it does.