Definition

Distillation trains a smaller, cheaper AI model to copy a larger one, so it does similar work at lower cost and higher speed.

At a glance

A big “teacher” model trains a smaller “student” to imitate its answers^[1].
The student keeps most of the quality at far lower cost and higher speed.
DistilBERT: 40% smaller, 60% faster, ~97% of its teacher’s ability^[4].
Introduced by Geoffrey Hinton’s team in 2015; now standard^[3].

Why it matters

Big models need costly servers and charge per request. A distilled model does similar work cheaper and faster, even on a laptop. The tradeoff: a small quality drop on the hardest tasks.

Where you see it

Vendors sell distilled “mini,” “lite,” or “flash” versions of top models; DeepSeek built competitive models this way^[2]. A cheaper provider tier usually means a distilled model.

Bottom line

Distillation gives you most of a big model’s quality at a small model’s price.

What is distillation?

At a glance

Why it matters

Where you see it

Bottom line

References