Sequence parallelism can mean one of two things:
- NVIDIA’s definition is narrowly scoped to an optimization within tensor parallelism that shards activations along the sequence dimension between tensor-parallel regions.1
- DeepSpeed’s definition is a form of context parallelism whereby the input sequence is sharded across model replicas.2 However, unlike attention > Ring attention, DeepSpeed Ulysses performs a transposition that converts parallelism along the sequence dimension into parallelism across heads. This allows each head to compute its attention without any communication.