sequence parallelism

Sequence parallelism can mean one of two things:

NVIDIA’s definition is narrowly scoped to an optimization within tensor parallelism that shards activations along the sequence dimension between tensor-parallel regions.¹
DeepSpeed’s definition is a form of context parallelism whereby the input sequence is sharded across model replicas.² However, unlike attention > Ring attention, DeepSpeed Ulysses performs a transposition that converts parallelism along the sequence dimension into parallelism across heads. This allows each head to compute its attention without any communication.

Glenn's Digital Garden