What is distillation?

Q: What is distillation?

Published June 1, 2026 · 4 min read

Definition

Distillation trains a smaller, cheaper AI model to copy a larger one, so it does similar work at lower cost and higher speed.

At a glance

A big “teacher” model trains a smaller “student” to imitate its answers^[1].
The student keeps most of the quality at far lower cost and higher speed.
DistilBERT: 40% smaller, 60% faster, ~97% of its teacher’s ability^[4].
Introduced by Geoffrey Hinton’s team in 2015; now standard^[3].

Why it matters

Big models need costly servers and charge per request. A distilled model does similar work cheaper and faster, even on a laptop. The tradeoff: a small quality drop on the hardest tasks.

Where you see it

Vendors sell distilled “mini,” “lite,” or “flash” versions of top models; DeepSeek built competitive models this way^[2]. A cheaper provider tier usually means a distilled model.

Bottom line

Distillation gives you most of a big model’s quality at a small model’s price.

References

What is Knowledge distillation? IBM www.ibm.com
How Distillation Makes AI Models Smaller and Cheaper. Quanta Magazine www.quantamagazine.org
Distilling the Knowledge in a Neural Network — Geoffrey Hinton, Oriol Vinyals, Jeff Dean. arXiv arxiv.org
DistilBERT, a distilled version of BERT smaller faster cheaper and lighter — Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf. arXiv arxiv.org

Comments

Questions, corrections, and links welcome. Be specific and civil.

Loading comments…