LLM Parallel Training 7 - Hybrid Parallelism Summary

summarize

According to the previous articles in this series, the main parallel acceleration techniques used in pre-training large models have been analyzed in a series of breakdowns. However, in the real world, training is often a mixture of parallelism. How do we configure these parallel strategies to minimize communication bottlenecks and maximize GPU utilization in the training framework? There are so many variables, so let's take the simplest 3D parallelism as an example.

Hardware level: number of cards/inter-card bandwidth/network card bandwidth on a single machine, network topology construction for inter-machine communication.
Parallelization strategies are: number of tensor parallels/pipeline parallels/data parallels.
The training hyperparameters are: batch_size / AttnHeads / seq_len / hidden_size

If the parameters are adjusted by brainstorming, there will be a very large search space, and it is difficult to find the most optimal method for computational efficiency, so it is necessary to determine the approximate range of each parameter through theoretical analysis. Finally, we need to find a better solution through a finite number of attempts. In this chapter, we refer to nvidia's GTC presentation on tuning practice to analyze the theoretical tuning of GPT-3 training attempts.

Scenario Analysis of Parallel Methods

Marker remarks in back text.

\((p, t, d)\) : 3D parallel dimension.\(p\) represents the number of streaming parallels.\(t\) representing the number of tensor parallels.\(d\) Represents the number of data parallels
\(n\): Total number of GPUs. Requirements\(p\cdot t \cdot d = n\).
\(B\): Global batch size.
\(b\): Microbatch size.
\(b^{'}\): the size of the batch to be processed by a pipeline, equals to\(B/d\).
\(m = \frac{1}{b} \cdot \frac{B}{d}\): The number of microbatches in each pipeline for a batch.
\(s\): seq_len.
\(h\): hidden_size = emb_size * attention_head_size
\(a\): attention_head_size

Tensor Parallelism (TP)

TP overhead

paradigm	Normal	ColParallel	ratio
flops	(n multiplications + n additions) * n^2 = 2n^3	2n^3/t	1/t
Bandwidth	(n^2) [read or write of nn matrix] 2(fp16)) * 3(read X, read A, write Y) = 6n^2	2n^2 + 4n^2/t (A,Y Tangent)	(1+2/t)/3
Intensity(flops/bandwidth)	n/3	n/(2+p)	3/(2+t)

When the parallelism\(t\)As the intensity grows, it can be seen that the intensity is also in a growing trend. There is a trade-off between communication and computation cost. Since TP needs to perform an active AllReduce at the end of the process, it will lead to high communication cost in multi-computer communication. Therefore, TP is generally only considered for single-card communication. There are two main usage scenarios for TP in LLM.

MLP first columns and then rows, before and after this piece will usually be combined with SP to split AllReduce into allGather and reduceScatter.

Multiple heads at attention are sliced and paralleled. Each head computes independently of the other, so it can be sliced and diced.

Parallel flow lines (PP)

The main purpose of the pipeline is to slice the data of a batch into multiple mirco-batches, and do asynchronous parallelism between micro-batches. Because the communication content only contains the output of the sliced stage, and it is a point-to-point communication, there is no need for multi-point aggregate communication. The communication data size is small, so it is more suitable for the scenario of communication between multiple machines. In LLM, a transformLayer is generally used as a stage, and a pipeline is constructed between multiple stages, as shown in the following figure.

mixed parallelism

Once the network structure is determined, the TP and PP can be estimated to a more reasonable interval, and finally the DP value can be estimated according to the calculation of the video memory capacity.

Strategy analysis of TP and PP

Data Parallelism\(d=1\)When, the\(p * t = n\), with the following formula.

Streamline bubble_time.\(\frac{(p-1)}{m}=\frac{n/t-1}{m}\), Increasing TP parallelism reduces the bubble ratio, but increases the amount of communication within a single machine, a microbatch within tp requires 4 allReduce (2 each for fp/bp).
AllReduce traffic on a single machine.\(2bsh(\frac{t-1}{t})\), (layer activation is\(bsh\), allReduce communication is 2x the amount of data)
The amount of communication between individual micro-batch machines in pipeline parallelism is.\(C_{inter} = 2bsh\) (fp/bp once each)

Let there be a pipeline with\(l^{stage}\)For 1F1B non-interleaved scheduling, the amount of internal communication within a single stage is.

\[C_{inra} = l^{stage}\cdot4\cdot2bsh(\frac{t-1}{t}) = l^{stage}\cdot4\cdot2bsh(1-\frac{1}{t}) \]

So the relationship between inter-machine and intra-machine traffic is.

\[C_{intra}= l^{stage}\cdot4\cdot(1-\frac{1}{t}) \cdot C_{inter} \]

Since the inter-machine communication rate is much smaller (IB 200GB/s) than the inter-card communication rate (NVLink 600GB/s), we should try to reduce the inter-machine communication rate if we want to optimize the throughput.

[!TIP]

i.e.Make t as large as possible without causing TP to generate inter-machine communication. If this doesn't fit the model, then use pipelined parallelism to slice the model。

micro-batch setup

With all other parameters fixed. Adjust only the number of micro_batch, the execution time of a single batch.\((\frac{b^{'}}{b}+(p-1))\cdot(t_{f} + t_{b})\) , if b is increased, the number of individual pipelines will decrease but the execution time will be longer, and the computation time and b are non-linearly related. Moreover, after adjusting the micro-batch, the communication time will also change, so the mirco-batch adjustment needs experimental attempts to find the optimal solution. In his paper, megatron tries to set the mirco-batch of gpt training to 4.

Strategic analysis of DP

Easy to analyze the setup\(t=1, d * p = n\), the pipeline bubble percentage in this case is\(\frac{p-1}{m} = \frac{n/d - 1}{B/b/d} = \frac{b(n - d)}{B}\)

PP and DP relationship: For d monotonically decreasing, it can also be seen from the figure below that the training speed is faster when the number of pipeline parallelism is smaller and the data parallelism is larger. Therefore, we can increase the DP parallelism as much as possible while the PP can satisfy the memory usage.

Relationship to Batch_size: bubble is inversely proportional to B, the larger the B the higher the throughput. However, too large B and data parallelism can lead to model non-convergence. B needs to be adjusted without affecting the effectiveness of the model.

DP and TP relationshipsIn TP, each batch needs 4 allReduce, while DP only needs one allReduce for the gradient, and if W is small in TP, it will also affect the efficiency of matrix multiplication. The following figure shows that the smaller the TP parallelism, the larger the DP parallelism, the higher the throughput. The tuning strategy is to increase the DP as much as possible as long as the TP meets the memory requirement to increase the throughput.

[!TIP]

If the model is large, you need to combine model parallelism and streaming parallelism first, the\(M=t \cdot p\) A combination of these is used to satisfy the memory requirements of the model and the data associated with the model, but keeping M as small as possible. Data parallelism is then used to scale up the training (scaling up data parallelism, scaling up Global batch size)

GPT-3 example analysis

As an example, the following superparameter GPT-3 training is used.

Video Memory Analysis

ModelMemory

A single card stores model parameters in 4 main parts (due to pipeline parallelism, a single card usually stores only 1-2 transformLayers): attention parameter / FC parameter / token_emb / Positional encoding

in order to\(N_p\)Representing a full set of parameters, the number of parameters contained in a single card is as follows.

\[\frac{N_p}{n} = h * \frac{h}{t}*3(QKV parameter) + h * \frac{h}{t}(multihead spliced through fc) + h * \frac{h}{t}*4*2(fc1+fc2 parameter) \\ + \frac{v}{t} *h(token) + s*\frac{h}{t}(positional) \approx 1.73B\approx \frac{175B}{p*t} (number of single DP parameters on a single card) \]

In the mixed-precision training, the total data volume was expanded by 1 copy of fp16 w and grad, 1 copy of fp32 optimizer_state (\(w+grad+momentum+variance\))

\[N_{storage} = 2Bytes * N_p + 2Bytes * N_p + (4+4+4+4)Bytes*N_p = 20N_P = 27.4GB \]

Activation

In the nvidia share, it looks like activation only stores the emb activation before crossing the token and the activation before entering the fc, the rest is all recalculated during bp... Since I'm not using SP either, there's a redundancy of TP parallels for each card activation.

\[M_{act}^{emb} = 2𝐵𝑦𝑡𝑒𝑠 * 𝑏 * 𝑠 * \frac{v}{t} \]

\[M_{transformer}^{emb} = 2𝐵𝑦𝑡𝑒𝑠 * 𝑏 * 𝑠 * h * \frac{n}{p} \]

Extra

Contains temporary video memory to be allocated during fp & temporary video memory needed for communication & video memory fragmentation due to allocator

This has already been analyzed in the previous chapter - Activation Optimization, so I'll ignore it here.

There are three places where memory usage may spike: 1. fp completion, where the main change is the presence of a large number of extra. 2. bp completion, where there is relatively little consumption due to releasing a large amount of act and then recalculating. 3. updating the optimizer_state, where a lot of temporary memory is used for gradient allReduce, so there is a memory spike. 3. when updating optimizer_state, this is because a lot of temporary memory is used for gradient allReduce, so there is a memory spike.

Communications analysis

BW: bus bandwidth (length of data for a single communication)

TP: fp recalculation for each mlp and attention at fp /bp / bp One allReduce for each of the three phases.

\[T_{tp} = \frac{2bsh}{BW} * \frac{2(t-1)}{t}[number of allReduce communications * length of a single communication] * (3+3)[number of mlp and attentions\ allReduce] * \frac{n}{p}[number of layers] * \frac{B}{bd}[ minibatch count] \\\\ \]

DP: Need to do an allReduce on each copy of the data when the optimizer is updated.

\[T_{dp} = \frac{N_p}{BW} * \frac{2(d-1)}{d} \]

PP (1F1B interleaved): point-to-point for inter-machine communication and allGather for intra-machine communication (TODO: don't really see the point here).

\[T_{p p}=\underbrace{\left(2 \frac{B}{b d}+2(p-1)\right) \times \frac{\frac{\text { message size }}{t}}{B W_{\text {inter }}}}_{\text {P2P }}+\underbrace{\left(\frac{B}{b d}+(p-1)\right) \times \frac{\text { message size }}{B W_{\text {intra }}} \times \frac{t-1}{t}}_{\text {Allgather }} \]

It can also be seen from practical experiments that TP accounts for a major portion of the communication costs.

summarize

Tuning experience with 3d parallelism.

If the model is large, you need to combine model parallelism and streaming parallelism first, the\(M=t \cdot p\) A combination of these is used to satisfy the memory requirements of the model and the data associated with the model, but keeping M as small as possible. Data parallelism is then used to scale up the training (scaling up data parallelism, scaling up Global batch size)
Make t as large as possible without causing TP to generate inter-machine communication. If this doesn't fit the model, then use pipelined parallelism to slice and dice the model.

Essays in megatron-LM complexity analysis./pdf/2104.04473

nvidia GTC Presentation./gtc/2020/video/s21496

[nvidia GTC GPT-3 tuning analysis](Link./s/190TFeOI9SALaaH9CVMWH7Q?pwd=chux Extract code: chux)

megatron code reading blog./rossiXYZ/p/