mixture of experts

Mixture of experts is a model architecture where the typical feed-forward network (FFN) part of each transformer layer is replaced by multiple, smaller FFNs called experts. For any given input token, only a fixed subset (the top-k) of the experts are actually used. A lightweight router (composed of learned routing weights) decides which experts should process each input token, and the results of each of the active experts is recombined using a weighted combination before that output is passed on to the next layer of the model.

In practice, the expert router can treat experts in different ways:

Shared experts get every token.
Routed experts get only a fraction of the tokens.

MOE models introduce a new form of parallelism: expert parallelism.

Glenn's Digital Garden

Explorer

mixture of experts

Graph View

Backlinks