technicals

What is interpretability?

June 1, 2026 · 4 min read

INTERPRETABILITYAn MRI for the model's mind.Don't just read what it says — scan which concepts light up inside.usscanGolden Gate Bridgedeceptionloan-riskEach lit region is a concept the model is using — visible from the inside, not the output.

Definition

Interpretability is the work of understanding how and why an AI model reaches its outputs by looking inside its internal workings.

At a glance

Why it matters

When you hand decisions to AI, “the AI decided” won’t satisfy regulators, customers, or courts. Many credit and lending decisions legally require an explanation. Interpretability is what lets you answer “why did it do that?”—and lets you debug bad behavior, since you can’t fix reasoning you can’t inspect.

Interpretability vs. explainability

Explainability gives a human-readable reason (“denied mainly due to debt-to-income ratio”) without grasping the model’s internal math[5]. Interpretability goes deeper—actually understanding how the model reaches decisions. Explainability often suffices for daily accountability; interpretability is what truly builds trust in complex systems.

How it works

Mechanistic interpretability treats a neural network like a program to reverse-engineer[3]. In 2024 Anthropic used dictionary learning to find millions of internal “features” inside Claude—like a Golden Gate Bridge concept—and could turn them up or down to change behavior[4].

Bottom line

Interpretability is the difference between trusting AI because it sounds confident and trusting it because you can see why it decided.

Connects to NeuroscienceLaw

References

  1. What Is AI Interpretability? IBM www.ibm.com
  2. The Urgency of Interpretability — Dario Amodei. darioamodei.com www.darioamodei.com
  3. Mechanistic interpretability. Wikipedia en.wikipedia.org
  4. Golden Gate Claude / Mapping the Mind of a Large Language Model. Anthropic www.anthropic.com
  5. Interpretability vs. explainability in AI and machine learning. TechTarget www.techtarget.com