Sequence parallelism can mean one of two things:

  1. NVIDIA’s definition is narrowly scoped to an optimization within tensor parallelism that shards activations along the sequence dimension between tensor-parallel regions.1
  2. DeepSpeed’s definition is a form of context parallelism whereby the input sequence is sharded across model replicas.2 However, unlike attention > Ring attention, DeepSpeed Ulysses performs a transposition that converts parallelism along the sequence dimension into parallelism across heads. This allows each head to compute its attention without any communication.

Footnotes

  1. [2205.05198] Reducing Activation Recomputation in Large Transformer Models

  2. [2309.14509] DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models