Sapiens
Technicals

What is interpretability?

Published June 1, 2026 · 4 min read

INTERPRETABILITYAn MRI for the model's mind.Don't just read what it says — scan which concepts light up inside.usscanGolden Gate Bridgedeceptionloan-riskEach lit region is a concept the model is using — visible from the inside, not the output.

Definition

Interpretability is the work of understanding how and why an AI model reaches its outputs by looking inside its internal workings.

At a glance

  • Modern AI is a black box: today’s systems are “grown” through training, so even their makers can’t say exactly why an output appeared[2].
  • Interpretability means understanding the internal mechanics; explainability just gives an after-the-fact reason[1].
  • For business it’s becoming a compliance and trust requirement—regulated decisions like lending often must be explainable[1].
  • The “MRI for AI” goal: scan a model for deception or hidden knowledge before deployment.

Why it matters

When you hand decisions to AI, “the AI decided” won’t satisfy regulators, customers, or courts. Many credit and lending decisions legally require an explanation. Interpretability is what lets you answer “why did it do that?”—and lets you debug bad behavior, since you can’t fix reasoning you can’t inspect.

Interpretability vs. explainability

Explainability gives a human-readable reason (“denied mainly due to debt-to-income ratio”) without grasping the model’s internal math[5]. Interpretability goes deeper—actually understanding how the model reaches decisions. Explainability often suffices for daily accountability; interpretability is what truly builds trust in complex systems.

How it works

Mechanistic interpretability treats a neural network like a program to reverse-engineer[3]. In 2024 Anthropic used dictionary learning to find millions of internal “features” inside Claude—like a Golden Gate Bridge concept—and could turn them up or down to change behavior[4].

Bottom line

Interpretability is the difference between trusting AI because it sounds confident and trusting it because you can see why it decided.

References

  1. What Is AI Interpretability? IBM www.ibm.com
  2. The Urgency of Interpretability — Dario Amodei. darioamodei.com www.darioamodei.com
  3. Mechanistic interpretability. Wikipedia en.wikipedia.org
  4. Golden Gate Claude / Mapping the Mind of a Large Language Model. Anthropic www.anthropic.com
  5. Interpretability vs. explainability in AI and machine learning. TechTarget www.techtarget.com

Comments

Questions, corrections, and links welcome. Be specific and civil.

  • Loading comments…