Mixture of experts is a model architecture where the typical feed-forward network (FFN) part of each transformer layer is replaced by multiple, smaller FFNs called experts. For any given input token, only a fixed subset (the top-k) of the experts are actually used. A lightweight router (composed of learned routing weights) decides which experts should process each input token, and the results of each of the active experts is recombined using a weighted combination before that output is passed on to the next layer of the model.

In practice, the expert router can treat experts in different ways:

  • Shared experts get every token.
  • Routed experts get only a fraction of the tokens.

MOE models introduce a new form of parallelism: expert parallelism.