Mixture of experts is a model architecture where the typical feed-forward network (FFN) part of each transformer layer is replaced by multiple, smaller FFNs called experts. For any given input token, only a fixed subset (the top-k) of the experts are actually used. A lightweight router (composed of learned routing weights) decides which experts should process each input token, and the results of each of the active experts is recombined using a weighted combination before that output is passed on to the next layer of the model.
In practice, the expert router can treat experts in different ways:
- Shared experts get every token.
- Routed experts get only a fraction of the tokens.
MOE models introduce a new form of parallelism: expert parallelism.