These are notes and references I've been accumulating to understand how AI workflows (both training and inferencing) work to help inform the architectural decisions that go into designing supercomputers specifically for AI.
There are three ways in which training a model can be divided across GPU nodes:
- Data parallelism
- simplest way to train at scale (thousands of GPUs)
- partition the training dataset (the batch) and give each GPU node its own subset of the training dataset (a minibatch)
- each GPU node holds the entire model
- communication happens after each epoch
- scales very well since multiple copies of the model are training in parallel, but may increase the time to train a model (convergence time) since training data may be less randomized as a result of partitioning
- Pipeline parallelism (aka layer parallelism)
- break up model into layers, then distribute whole layers across GPU nodes
- requires moderate rewriting the training code to include communication within each epoch
- scales well for models with lots of big layers
- Operator parallelism (aka tensor parallelism, tensor slicing)
- break layers of a neural network up and distribute them across GPU nodes
- requires significant rewriting the training code to include communication within each epoch
- does not scale well due to high communication requirements
These parallelization approaches can be used at the same time. For example, training a large language model across multiple DGX nodes likely involves tensor parallelism within the DGX node (since it has NVLink which makes the communication fast), pipeline parallelism across 16 DGX nodes, and data parallelism to accelerate training by scaling to a thousand DGX nodes.
Implementing these levels of parallelism concurrently is complicated, and a set of frameworks and further refinements to them have popped up. Huggingface has a page on parallelism that explains some of the more sophiciated combinations such as ZeRO.
When performing data-parallel training, each concurrent instance of the model uses a different subset of inputs (the batch is partitioned into minibatches and each model instance gets a minibatch). Then,
- During forward propagation, there is no communication since the model on each concurrent instance (and its weights) are identical. The only difference is in the input data (the minibatch).
- During backpropagation, there is a nonblocking AllReduce that happens after each layer's gradients have been calculated.
- There is a barrier at the end of backpropagation to ensure that all gradients have been added up appropriately.
- After the backpropagation has completed, the optimizer is applied. In the simplest case (like plain old stochastic gradient descent), the sum of all gradients are then used to update weights on each instance. Since the weights on each model instance were identical at step 1 and AllReduce ensures our gradients are all identical, there is no communication needed to update the weights.
I found an article by Simon Boehm on Data-Parallel Distributed Training of Deep Learning Models really helpful in understanding how this works.
Running the optimizer (step 4 above) can be a can of worms though, since many optimizers (like Adam) are stateful. They maintain quantities (like momentum) that persist across epochs to help them converge faster, and these quantities do need to be synchronized across all nodes. The communication pattern of these stateful optimizers can vary though.
There are ways to perform asynchronous data-parallel tranining where not all replicas of the model synchronize their weights after each pass, but extra consideration must be taken to ensure the model still converges.
The ZeRO-DP paper (2020) states that a trillion-parameter model using a stateful optimizer (like Adam) requires 16 TiB of GPU memory at 16-bit precision. This implies around 16 bytes (128 bits) per parameter with 8× that for other quantities like optimizer states and gradients. This paper also enumerates what contributes to this 8× and breaks this down using Adam as an example. In brief, the 16 bytes per parameter is composed of
- a 16-bit weight
- a 16-bit gradient
- a 32-bit copy of the weight for the optimizer reduction
- a 32-bit momentum (one part of the optimizer state)
- a 32-bit variance (the other part of the optimizer state)
Mixed precision is used to minimize numerical instabilities (things like floating point underflow and overflow) that can result from performing multiply-accumulate operations found throughout training. For example, NVIDIA's Tensor Cores can take two 16-bit arrays, multiply them together using 32-bit precision, then add a 32-bit array to the result.
According to Microsoft DeepSpeed introduction (2020), a 40 GB GPU can hold a model containing 1.2 billion parameters which corresponds to 32 bytes (256 bits) per parameter. This number probably includes what the ZeRO-DP paper refers to as residual memory consumption - things that don't strictly scale with the number of weights but otherwise consume practically usable memory.
A lot of research goes into reducing the memory footprint of models since a smaller footprint allows you to train a model on fewer GPUs. For example, checkpointing activations is a technique that allows you to trade GPU memory consumption for GPU computation; you can checkpoint activations and recompute using these checkpoints to fit more parameters into memory.
Architectural Requirements for Deep Learning Workloads in HPC Environments by Ibrahim et al establishes a nice method for calculating how much storage bandwidth is required to keep a GPU fully utilized when training different models. They evaluate relatively small models that are most relevant to scientific research and demonstrate:
- If you can establish how many flops are required to pass one sample through a model (forward and backwards) during training, you can use the average size of a sample to calculate a MiB per FLOP ratio for a model. This is equivalent to MiB/s per FLOPS and you can multiply it by the FLOPS capability of a GPU (or a whole system) to get an order-of-magnitude estimate of the bandwidth required per GPU to train a specific model.
- They show that CosmoFlow is a computationally inexpensive model and requires 65 GB/s per petaFLOP/s. They found that, in practice, CosmoFlow can only utilize 35-50 TFLOP/s per NVIDIA V100 GPU, so training this model on a single V100 requires 2.275 - 3.250 GB/s, and an 8-way V100 node would require 18.2 GB/s - 26 GB/s. By comparison, a typical NFS client cannot achieve more than 3 GB/s over TCP.
- By comparison, ResNet-50 being trained on ImageNet is the least I/O-intensive and only requires, at most, 573 MB/s per V100 or 3.9 GB/s per 8-way V100 node.
I've created a simple tool that illustrates how to do this arithmetic and calculates the GB/s required to train a model on different GPUs:
It's a very loose model and estimates the upper bound of bandwidth required by assuming that each GPU has enough memory bandwidth, PCIe bandwidth, power, cooling, etc to train at the full rated performance on their respective spec sheets. As shown in this Ibrahim paper (they find that CosmoFlow trains at 35-50 TFLOPS of the theoretical 130 TFLOPS), this is never the case.
Exascale deep learning for climate analytics by Kurth et al directly calculated their required storage bandwidth for a modified "Tiramisu" network for climate dataset segmentation and classification at 189 MB/s per V100 GPU, or 1.14 GB/s for a 6-way Power9 GPU node, or 1.16 TB/s for the full scale of the training job they ran. Their FLOP/s/sample was 4.188 on V100 GPUs, and they achieved 20.93 TFLOP/s/GPU during training.
The datasets involved in AI include
- model checkpoints
- model weights (a subset of a checkpoint)
- raw training data
- preprocessed training data
Model checkpoints and model weights are described in the memory requirements section above.
Raw training data are text, images, audio files, that have not been wedged into a shape that the model training framework can accept.
Preprocessed training data is almost always smaller than raw training data and is ready to consume by the model training process. For large language models, this means tokenized text.
According to OpenAI, a token in a typical English-language dataset is about four bytes. The Pile paper has both tokens and words for different datasets and found an average 3.41 bytes per token. Assuming OpenAI's 4 bytes/token for all but The Pile dataset, we can estimate the size (in GB) of various LLM training datasets:
|Dataset||Training tokens||Training Bytes|
|LLaMa-2 70B dataset||2.0 trillion||8 TB (7.3 TiB)|
|OPT-175 dataset||180 billion||720 GB (670 GiB)|
|GPT-3 dataset||300 billion||1.2 TB (1.1 TiB)|
|The Pile dataset||260 billion||890 GB (830 GiB)|
|ROOTS/BLOOM dataset||341 billion||1.6 TB (1.5 TiB)|