Definition
Model parallelism splits one large AI model into pieces spread across several chips, so it can run even when too big to fit on any single one.
At a glance
- The biggest AI models won’t fit in one chip’s memory, so the model itself is divided across several chips that work together[2].
- Data parallelism (the simpler cousin) copies the whole model onto each chip; model parallelism splits the model when no chip can hold it[3].
- Two common splits: by layer (pipeline, like an assembly line) or within a layer (tensor, slicing one calculation across chips)[4].
- The cost is coordination: chips constantly pass results to each other, so weak connections slow everything down.
How it works
Pipeline parallelism divides the model by layers, like factory stations: chip one runs the first stages, then hands off to chip two[1]. Tensor parallelism instead slices one heavy calculation sideways so several chips compute pieces at once, then combine them. Big setups often mix both.
What it means for a business
Running or training a frontier model isn’t a one-computer purchase but a tightly wired cluster of chips. You gain access to far more capable models; the trade-off is added complexity and communication overhead.
Bottom line
When a model outgrows a single chip, model parallelism splits it across many — the quiet reason frontier AI demands clusters, not laptops.