Definition
Distillation trains a smaller, cheaper AI model to copy a larger one, so it does similar work at lower cost and higher speed.
At a glance
- A big “teacher” model trains a smaller “student” to imitate its answers[1].
- The student keeps most of the quality at far lower cost and higher speed.
- DistilBERT: 40% smaller, 60% faster, ~97% of its teacher’s ability[4].
- Introduced by Geoffrey Hinton’s team in 2015; now standard[3].
Why it matters
Big models need costly servers and charge per request. A distilled model does similar work cheaper and faster, even on a laptop. The tradeoff: a small quality drop on the hardest tasks.
Where you see it
Vendors sell distilled “mini,” “lite,” or “flash” versions of top models; DeepSeek built competitive models this way[2]. A cheaper provider tier usually means a distilled model.
Bottom line
Distillation gives you most of a big model’s quality at a small model’s price.