What is interpretability?

Q: What is interpretability?

Published June 1, 2026 · 4 min read

Definition

Interpretability is the work of understanding how and why an AI model reaches its outputs by looking inside its internal workings.

At a glance

Modern AI is a black box: today’s systems are “grown” through training, so even their makers can’t say exactly why an output appeared^[2].
Interpretability means understanding the internal mechanics; explainability just gives an after-the-fact reason^[1].
For business it’s becoming a compliance and trust requirement—regulated decisions like lending often must be explainable^[1].
The “MRI for AI” goal: scan a model for deception or hidden knowledge before deployment.

Why it matters

When you hand decisions to AI, “the AI decided” won’t satisfy regulators, customers, or courts. Many credit and lending decisions legally require an explanation. Interpretability is what lets you answer “why did it do that?”—and lets you debug bad behavior, since you can’t fix reasoning you can’t inspect.

Interpretability vs. explainability

Explainability gives a human-readable reason (“denied mainly due to debt-to-income ratio”) without grasping the model’s internal math^[5]. Interpretability goes deeper—actually understanding how the model reaches decisions. Explainability often suffices for daily accountability; interpretability is what truly builds trust in complex systems.

How it works

Mechanistic interpretability treats a neural network like a program to reverse-engineer^[3]. In 2024 Anthropic used dictionary learning to find millions of internal “features” inside Claude—like a Golden Gate Bridge concept—and could turn them up or down to change behavior^[4].

Bottom line

Interpretability is the difference between trusting AI because it sounds confident and trusting it because you can see why it decided.

References

What Is AI Interpretability? IBM www.ibm.com
The Urgency of Interpretability — Dario Amodei. darioamodei.com www.darioamodei.com
Mechanistic interpretability. Wikipedia en.wikipedia.org
Golden Gate Claude / Mapping the Mind of a Large Language Model. Anthropic www.anthropic.com
Interpretability vs. explainability in AI and machine learning. TechTarget www.techtarget.com

Comments

Questions, corrections, and links welcome. Be specific and civil.

Loading comments…