What is mechanistic interpretability?

Q: What is mechanistic interpretability?

Published June 1, 2026 · 5 min read

Definition

Mechanistic interpretability is the field that reverse-engineers an AI’s internal wiring to find the specific concepts and reasoning steps behind its answers.

At a glance

AI models are ‘black boxes’: they give answers, but no one can directly read why.^[1]
This field opens the box, mapping ‘features’ (concepts the model fires on) and ‘circuits’ (the steps it reasons through).
Anthropic found ~34 million features in Claude 3 Sonnet, including a Golden Gate Bridge one.^[2]
For business, it is the path to auditing AI for bias, deception, or unsafe behavior.

How it works

Models are trained, not programmed, so even their builders cannot point to where an answer comes from. A ‘feature’ is an internal pattern for a concept (a bridge, a bug, flattery); a ‘circuit’ is the chain that reasons from ‘capital of the state with Dallas’ to ‘Texas’ to ‘Austin.’^[3] A sparse autoencoder untangles these into readable features.^[2]

Why it matters

Seeing internal concepts lets you check for bias or deception, debug failures systematically, and even steer behavior by adjusting features.^[4] Still early research, but the clearest route to AI you can actually audit, as regulators and customers increasingly demand.^[5]

Bottom line

It is the effort to read an AI’s wiring instead of just trusting its output, the difference between hoping a model behaves and showing why it does.

References

Mechanistic interpretability. Wikipedia en.wikipedia.org
Mapping the Mind of a Large Language Model (Scaling Monosemanticity). Anthropic www.anthropic.com
Tracing the thoughts of a large language model. Anthropic www.anthropic.com
Anthropic can now track the bizarre inner workings of a large language model. MIT Technology Review www.technologyreview.com
Mechanistic Interpretability for AI Safety -- A Review — Leonard Bereska, Efstratios Gavves. arXiv arxiv.org

Comments

Questions, corrections, and links welcome. Be specific and civil.

Loading comments…