Kevin Champlin
← Back to glossary

Mixture of Experts (MoE)

mixture-of-experts
Also called mixture of experts/ MoE/ sparse model

A Mixture of Experts (MoE) model splits its parameters across many "expert" sub-networks, but at inference only a small subset of them activate per token. Lets you build huge models (trillions of parameters total) where the per-token compute is comparable to a much smaller dense model.

A dense transformer activates every parameter on every forward pass. A 70B-parameter dense model does 70B parameters' worth of work to generate one token. That is expensive at scale.

MoE rearranges the math. The feed-forward layers in each transformer block are split into N separate "expert" networks (typically 8-64). A small "router" learns, for each token, which 1-2 experts to send the token to. The other experts sit idle. So the model has, say, 200B total parameters but only activates ~30B for any given token. You get capacity without paying for it on every call.

The savings are real: GPT-5, Gemini 3 Pro, and most leading frontier models in 2026 are believed to be MoE. The official parameter counts are usually given as "active parameters" rather than total, because total is the misleadingly large number.

The trade-offs:

  • Memory: total parameters still need to live in GPU memory, even though only a fraction activate per token. MoE saves compute, not memory. Bigger MoE models need bigger GPUs.
  • Routing instability: the router can collapse onto a few experts (load imbalance) or thrash across experts unpredictably. Training tricks like load-balancing losses keep it stable.
  • Specialization: experts can drift toward narrow specialties (one becomes the "code expert," another the "math expert"). Sometimes useful, sometimes wasteful.

MoE is part of why parameter-count comparisons across labs are misleading. Saying "Gemini Pro has 1T parameters" and "Llama 405B has 405B parameters" is not comparing the same thing if Gemini is MoE and Llama is dense. Active parameters is the apples-to-apples number, and labs do not always publish it.

Want the rest?

There are 40 terms total.

See the full glossary
Today, UTC
Monthly
refreshed /cost-of-mind →