technicals

What is mechanistic interpretability?

June 1, 2026 · 5 min read

MECHANISTIC INTERPRETABILITYA microscope for the model's mind.Zoom in and the concepts behind an answer light up.THE MODELGolden Gate BridgeTexasflatteryANSWERthe replyLike a brain scan lighting up regions, it shows which concepts the model used to reach its answer.

Definition

Mechanistic interpretability is the field that reverse-engineers an AI’s internal wiring to find the specific concepts and reasoning steps behind its answers.

At a glance

How it works

Models are trained, not programmed, so even their builders cannot point to where an answer comes from. A ‘feature’ is an internal pattern for a concept (a bridge, a bug, flattery); a ‘circuit’ is the chain that reasons from ‘capital of the state with Dallas’ to ‘Texas’ to ‘Austin.’[3] A sparse autoencoder untangles these into readable features.[2]

Why it matters

Seeing internal concepts lets you check for bias or deception, debug failures systematically, and even steer behavior by adjusting features.[4] Still early research, but the clearest route to AI you can actually audit, as regulators and customers increasingly demand.[5]

Bottom line

It is the effort to read an AI’s wiring instead of just trusting its output, the difference between hoping a model behaves and showing why it does.

Connects to NeurosciencePhilosophy

References

  1. Mechanistic interpretability. Wikipedia en.wikipedia.org
  2. Mapping the Mind of a Large Language Model (Scaling Monosemanticity). Anthropic www.anthropic.com
  3. Tracing the thoughts of a large language model. Anthropic www.anthropic.com
  4. Anthropic can now track the bizarre inner workings of a large language model. MIT Technology Review www.technologyreview.com
  5. Mechanistic Interpretability for AI Safety -- A Review — Leonard Bereska, Efstratios Gavves. arXiv arxiv.org