Definition
Interpretability is the work of understanding how and why an AI model reaches its outputs by looking inside its internal workings.
At a glance
- Modern AI is a black box: today’s systems are “grown” through training, so even their makers can’t say exactly why an output appeared[2].
- Interpretability means understanding the internal mechanics; explainability just gives an after-the-fact reason[1].
- For business it’s becoming a compliance and trust requirement—regulated decisions like lending often must be explainable[1].
- The “MRI for AI” goal: scan a model for deception or hidden knowledge before deployment.
Why it matters
When you hand decisions to AI, “the AI decided” won’t satisfy regulators, customers, or courts. Many credit and lending decisions legally require an explanation. Interpretability is what lets you answer “why did it do that?”—and lets you debug bad behavior, since you can’t fix reasoning you can’t inspect.
Interpretability vs. explainability
Explainability gives a human-readable reason (“denied mainly due to debt-to-income ratio”) without grasping the model’s internal math[5]. Interpretability goes deeper—actually understanding how the model reaches decisions. Explainability often suffices for daily accountability; interpretability is what truly builds trust in complex systems.
How it works
Mechanistic interpretability treats a neural network like a program to reverse-engineer[3]. In 2024 Anthropic used dictionary learning to find millions of internal “features” inside Claude—like a Golden Gate Bridge concept—and could turn them up or down to change behavior[4].
Bottom line
Interpretability is the difference between trusting AI because it sounds confident and trusting it because you can see why it decided.
References
- What Is AI Interpretability? IBM www.ibm.com
- The Urgency of Interpretability — Dario Amodei. darioamodei.com www.darioamodei.com
- Mechanistic interpretability. Wikipedia en.wikipedia.org
- Golden Gate Claude / Mapping the Mind of a Large Language Model. Anthropic www.anthropic.com
- Interpretability vs. explainability in AI and machine learning. TechTarget www.techtarget.com
Comments
Questions, corrections, and links welcome. Be specific and civil.