Location>code7788 >text

LLM Parallel Training 6-Activation Optimization

Popularity:535 ℃/2024-07-21 00:55:35

prior knowledge

Activation

An activation is a temporary tensor that is computed during fp and is used during bp. If we can cache the temporary tensor after the fp computation, we can speed up the bp, but the disadvantage is that the activation will take a lot of video memory. The disadvantage is that the activation will take a lot of memory. Let's take a layer of transformer structure as an example to analyze the activation in each layer.

image-20240720164348045

The simple part of the analysis is ignored here. The main part of the analysis is to analyze a few poorly understood calculations.

  1. \(QK^T\): Need to cache Q-output and K-output\(2sbh+2sbh\)
  2. \(softmax\): a head, each head performs s times softmax probabilistic classification, there are b*s tokens in total, and the output is fp16, so that is\(2 * a * s * (s * b) = 2as^2b\)
  3. \(dropout\): need to cache an array of masks to mark which positions need to be updated in reverse. So each position only needs to be bool, so it's\(a * s * (s * b) = as^2b\)

If all the active temporary memory is not released, each transformerLayer needs to occupy an amount of\(sbh(34+5\frac{as}{h})\) (not including any parallel acceleration)

image-20240720155156445

Selective activation recalculation

The recomputation part: Take GPT-3 as an example: a = 96, s = 2048, and h = 12288, 5as/h = 80. The part that takes up about 70% of the total activation memory is mainly the part of the attentions such as the output of the softmax, the mask of the dropout, and the output of the dropout. These activations are not related to matrix multiplication and require very little computation. By releasing the memory after the forward computation, and recomputing it when the bp is used, we can get 70% of the memory storage at a fraction of the computational cost.

Split storage: For example, the intermediate outputs of the linear layer, Q,K,V, are intermediate tensors. Because they are preceded by a matrix multiplication with W, they are costly to recompute and are better cached temporarily for reuse in bp computations. These activations are stored in tensor-parallel and sequence-parallel, and then pulled by aggregate communication when used in bp to minimize memory consumption.

Tensor parallelism totaling 24sbh: Split into t copies after parallelism consists of.

  • attention section: QKV's input 2sbh.\(QK^T\)The result is 4sbh, and 2sbh in the result of multiplying V. Total 8sbh.
  • MLP part: two 8sbh, because of TP's first column first row calculation method all divided into t parts, total 16sbh
image-20240720163425716

Sequences in parallel total 10sbh, split into t copies include.

  1. Attention section: layerNorm input/output 4sbh, dropout mask sbh
  2. MLP section: layerNorm input-output 4sbh, dropout mask sbh

Serial Parallelism

Sequence parallelism refers to the non-tensor parallel slicing part of the Transformer layer, where the computation is independent in the sequence dimension (s). So the computation in this dimension starts withThe number of cuts is the same as the number of tensor parallelsThe way in which the cuts are made.

Why should the cut score be equal to the TP? Recall how the results of the computations are aggregated after TP, and after the rows are parallelized and allReduce, the results of the individual computations of each card are vertically assembled to form the complete input. If we want to slice and dice this complete input, if the number of slices is not the same as the TP, it means that we have to do another reduceScatter after allReduce at $\bar{g} $, which is the same as the TP.\(g\)Place then allGather....

And if the cutoff and TP are equal, in the\(\bar{g}\)Here, allReduce can be omitted, which is equivalent to splitting allReduce into two operations. Separate the activation of layerNorm while keeping the communication volume the same.

image-20240720162721282

Zero-R

Activating partitions & checkpointing

It is mentioned here that there are redundant copies of the activation when the models are parallelized. I think this refers to redundant copies of the TP inputs. In the paper, it says that the activation will be given to multiple copies of the partition.... I think the implementation is the same as the sequence parallelism in megatron, but I'll mark it for confirmation when I take a closer look at the deepspeed code.

There was also mention of a new way to use memory to store the activation checkpoint, which I think is similar to the following steps

  1. The initial layers of fp and bp are farther apart, so it's a good idea to memcpyAsync to memory after doing the fp.
  2. The next few layers of activation near the loss are still in memory, and are released when the bp runs out. The part of the activation that is close to the cpu is returned by memcpyAsync.
  3. If the same stream can be used for training and copying activations, then there is no need to synchronize this piece, and it can be implemented in the order of the streams.
image-20240720194310898

Constant size graphics buffer

Collective communication operations such as all-reduce are very efficient in communicating a large batch of data at once. However, the disadvantage is that a large amount of temporary memory will be allocated, which will lead to large fluctuations in memory, which can be problematic in large model scenarios.

So zero sets a fixed buffer_size in this section, and when it exceeds that buffer_size it communicates in batches. The cpu activation of checkpointing copy should be done accordingly.

Actually, a similar approach was used when writing the sparse copy of flux-gpu... Batch copy to avoid single oversized communication and fragmented communication of small data.

Video Memory Fragmentation Solution

The reason for memory fragmentation: during fp only some of the activations are stored for bp, and some of the activations that need to be recalculated in bp are released. The result is that some of the memory is used for a long time and some is used for a short time, resulting in memory fragmentation. This leads to two problems: 1. the memory allocator is inefficient at finding blocks of memory that satisfy its size, and 2. large contiguous blocks of memory may not be allocated.

The paper says that the activation and grad are preallocated with contiguous blocks of memory . .emm, this looks like the same implementation as in the paper, but in fact in the big model most of the w/grad/activation are fixed length at runtime, so we can allocate them all on the first run. We can avoid memory allocation during network computation. If we don't use allocator, there will be no fragmentation problem.

Zero-Offload

Mainly used to solve the problem of model size is much larger than the size of the memory, look at the gray déjà vu, and deployed in the local memory parameter server is very similar. Feel the difference is 2 points: 1. training is synchronized, gpu in the cpu update optimizer_state can only be in a waiting state 2. memory to save the full amount of parameters, do not carry out multi-computer communication (multi-computer communication should make the already slow cpu even worse haha).

computational strategy

  1. Ensure that the computational burden of CPU is much smaller than that of GPU, so as to prevent CPU from becoming a computational bottleneck; Ensure that the memory saving of GPU is maximized; (optimizer_state is the one that takes up the most memory and does not need to be computed again and again, and it only needs to be accessed once in a batch, and it does not participate in inverse gradient computation, like the one of w in fp16)
  2. Ensure that the communication between the CPU and GPU is minimized; (quantization and de-quantization at the time of communication)
image-20240720202450999

scheduling strategy

The offload uses the zero2 scheme, which means that the fp16 w is stored on separate cards. One of the most important reasons to consider using zero2, I suspect, is that multiple cards can copy w at the same time, and there is no redundant data communication. It avoids the pciE bandwidth drag.

The following figure shows the data flow of a single card, the part of swap is wrongly drawn in the paper, it should be CPU->GPU, there are 2 main places where communication and computation are asynchronous: the CPU->GPU, the CPU->GPU, the CPU->GPU, and the GPU.

  1. g offload, is an async copy of g to memory after each layer of g computed during gpu bp.
  2. p swap, cpu updates a batch of w, then chunks it for quantization and async copy to memory.
image-20240720202948505

Fp16 w is exactly the same as a different zero2 calculation process once it's in the video memory.

There is also a cpu-operated fully hidden training mode similar to the cpu asynchronous training of the recommendation model, but with the difference that the asynchronous training of n batch aligned densities has been changed to a fixed 1 batch.

image-20240720203603036

Zero-Offload++

In the first version of offload, all parameters were calculated on the cpu, as mentioned above. While the cpu is calculating, the gpu can only wait, so how to utilize the gpu during the waiting time window is a big problem. offload++ gives a great idea. It sets a ratio of os_w storage, for example, 40% of os_w is stored in memory and updated by the cpu, the remaining 60% is updated by the gpu. The steps are as follows.

  1. After bp is over the top 40% of the network, copy g into memory.
  2. The CPU starts incrementally counting the g's that have been pulled down, updating os_w. Finish counting the portion of the update that belongs to you.
  3. After reaching the part that belongs to the GPU update, the remaining 60% of the GPU allScatter grad to the corresponding card where os_w is stored, updating the os_w in the video memory.
  4. When the cpu is done, the quantized fp16_w is copied back to the memory and merged with the fp16_w in the memory for the next round of computation.
image-20240720205312147

The ratio is set manually, and the idea is to stuff the GPU with os_w as much as possible while trying to fill up the video memory, and then put what you can't fit into memory. This is a great idea, but let's take a closer look at the code.

consultation

Megatron-LM Thesis./pdf/2205.05198

ZERO-R Thesis./abs/1910.02054

zero-offload: /system/files/

zero-offload++ Blog./microsoft/DeepSpeed/tree/master/blogs/deepspeed-offloadpp

Megatron Thesis Interpretation./Megatron-3-Reducing-Activation-Recomputation-in-Large-Transformer-Models-b4d8bfacd33c449383aa9c61123ab578#7c3cc0cb24c444b898c4e435d12bbd4f