Sapiens
Technicals

What is distillation?

Published June 1, 2026 · 4 min read

DISTILLATIONThe master teaches the apprentice.Then the apprentice cooks nearly every meal.Masterbig · slow · costlythe recipeApprenticesmall · fast · cheapCustomersA small model learns to mimic a big one — most of the quality, a fraction of the cost.

Definition

Distillation trains a smaller, cheaper AI model to copy a larger one, so it does similar work at lower cost and higher speed.

At a glance

  • A big “teacher” model trains a smaller “student” to imitate its answers[1].
  • The student keeps most of the quality at far lower cost and higher speed.
  • DistilBERT: 40% smaller, 60% faster, ~97% of its teacher’s ability[4].
  • Introduced by Geoffrey Hinton’s team in 2015; now standard[3].

Why it matters

Big models need costly servers and charge per request. A distilled model does similar work cheaper and faster, even on a laptop. The tradeoff: a small quality drop on the hardest tasks.

Where you see it

Vendors sell distilled “mini,” “lite,” or “flash” versions of top models; DeepSeek built competitive models this way[2]. A cheaper provider tier usually means a distilled model.

Bottom line

Distillation gives you most of a big model’s quality at a small model’s price.

References

  1. What is Knowledge distillation? IBM www.ibm.com
  2. How Distillation Makes AI Models Smaller and Cheaper. Quanta Magazine www.quantamagazine.org
  3. Distilling the Knowledge in a Neural Network — Geoffrey Hinton, Oriol Vinyals, Jeff Dean. arXiv arxiv.org
  4. DistilBERT, a distilled version of BERT smaller faster cheaper and lighter — Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf. arXiv arxiv.org

Comments

Questions, corrections, and links welcome. Be specific and civil.

  • Loading comments…