Definition
Mechanistic interpretability is the field that reverse-engineers an AI’s internal wiring to find the specific concepts and reasoning steps behind its answers.
At a glance
- AI models are ‘black boxes’: they give answers, but no one can directly read why.[1]
- This field opens the box, mapping ‘features’ (concepts the model fires on) and ‘circuits’ (the steps it reasons through).
- Anthropic found ~34 million features in Claude 3 Sonnet, including a Golden Gate Bridge one.[2]
- For business, it is the path to auditing AI for bias, deception, or unsafe behavior.
How it works
Models are trained, not programmed, so even their builders cannot point to where an answer comes from. A ‘feature’ is an internal pattern for a concept (a bridge, a bug, flattery); a ‘circuit’ is the chain that reasons from ‘capital of the state with Dallas’ to ‘Texas’ to ‘Austin.’[3] A sparse autoencoder untangles these into readable features.[2]
Why it matters
Seeing internal concepts lets you check for bias or deception, debug failures systematically, and even steer behavior by adjusting features.[4] Still early research, but the clearest route to AI you can actually audit, as regulators and customers increasingly demand.[5]
Bottom line
It is the effort to read an AI’s wiring instead of just trusting its output, the difference between hoping a model behaves and showing why it does.
References
- Mechanistic interpretability. Wikipedia en.wikipedia.org
- Mapping the Mind of a Large Language Model (Scaling Monosemanticity). Anthropic www.anthropic.com
- Tracing the thoughts of a large language model. Anthropic www.anthropic.com
- Anthropic can now track the bizarre inner workings of a large language model. MIT Technology Review www.technologyreview.com
- Mechanistic Interpretability for AI Safety -- A Review — Leonard Bereska, Efstratios Gavves. arXiv arxiv.org
Comments
Questions, corrections, and links welcome. Be specific and civil.