0 x00 the

NVIDIA Megatron is a distributed training framework based on PyTorch, which is used to train the large Transformer language model. Through the comprehensive application of data parallelism, Tensor parallelism and Pipeline parallelism, it is worth analyzing the mechanism behind GPT3.

This series will consist of 6 ~ 7 articles, which will be studied through papers and source code.

This paper combines two Megatron papers/an official PPT for translation and analysis, hoping that you can get a basic understanding of Megatron thinking through this paper.

0x01 Introduction

1.1 the problem

In the FIELD of NLP, large models can bring more accurate and powerful semantic understanding and reasoning ability, so with the popularity of scale computing and the increase of data sets, the number of model parameters also increases exponentially. Training such a large model can be challenging for several reasons:

  • (a) Challenges to video memory. The main memory of even the largest GPU cannot fit the parameters of these models, for example, a 175B GPT-3 model requires (175B * 4bytes) 700GB model parameter space, hence the gradient is 700G, and the optimizer state is 1400G, a total of 2.8TB.
  • (b) Computing challenges. Even if we could put the model into a single GPU (for example, by swapping parameters between host and device memory), the large number of computations required would result in long training times (for example, training a 175 billion parameter GPT-3 using a single V100 NVIDIA GPU would take about 288 years). See Appendix floating-point OPERATIONS in 2104.04473 for calculation.
  • (c) Computing challenges. Different parallel policies correspond to different communication modes and traffic volumes.
    • Data parallelization: The traffic occurs in backward propagated gradient protocol All-reduce operations, and the traffic is the size of the model on each GPU.
    • Model parallelism: we expand on this below.

This requires parallelization to speed it up. Hardware accelerators are used to scale out deep neural network training in two modes: data parallelism and model parallelism.

1.2 Data Parallelism

The data parallel mode copies a model on each worker, so that each worker has a copy of the complete model. The input data set is fragmented. A small batch of trained data will be divided among multiple workers. The worker periodically summarizes their gradients to ensure that all workers see a consistent version of weights. For large models that cannot be put into a single worker, people can use data parallelism on smaller fragments in the model.

Parallel data scaling generally works well, but there are two limitations:

  • A) After a certain point, the batch size of each GPU becomes too small, which reduces the utilization rate of GPU and increases the communication cost;
  • B) The maximum number of equipment available is batch size, which limits the number of accelerators that can be used for training.

1.3 Model parallelism

Memory management techniques such as Activation Checkpointing are used to overcome this limitation of data parallelism, and model parallelism is used to partition the model so that weights and their associated optimizer states do not need to reside on the processor at the same time.

The model parallel mode will distribute the memory and computation of a model among multiple workers, so as to solve the problem that a model cannot be accommodated on a card. The solution is to put the model on multiple devices.

Model parallelism is divided into two kinds: pipeline parallelism and tensor parallelism, which is the way of model segmentation.

  • Pipeline model Parallel is to put different layers of the model on different devices, such as the first layers on one device, the middle layers on another device, and the last layers on a third device.
  • Tensor parallelism is intra-layer segmentation, where a layer is segmented and placed on different devices, which can also be understood as the distribution of matrix operations on different devices, for example, a matrix multiplication is segmented into multiple matrix multiplications placed on different devices.

As shown in the figure below, interlayer parallelism (pipelined parallelism) is on the top, and a longitudinal cut is made. The first three layers are assigned to the first GPU, and the last three layers are assigned to the second GPU. Then you have the tensor parallelation, you cut it horizontally, you divide each tensor into two pieces, you put them on different Gpus.

Or, to look at it another way, the two types of segmentation exist at the same time, which are orthogonal and complimentary.

GTC 2020: Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

1.3.1 communication

Let’s look at the model parallel communication.

  • Tensor parallelization: communication takes place in the process of forward and backward communication of each layer. The communication type is All-reduce, which not only has a large amount of data in a single communication, but also has frequent communication.
  • Pipelining parallel: the communication is on the adjacent cutting points in the pipelining stage, and the communication type is P2P communication. The amount of word communication data is small but relatively frequent. Moreover, due to the characteristics of pipelining, GPU idle time will be generated, which is called pipelining Bubble here.

For example, in the figure below, the original pipeline is at the top, the model parallel is at the bottom, and the Bubble position is given in the middle.

Because tensor parallelism is generally on the same machine, it is accelerated through NVLink, and pipeline parallelism is generally connected through Infiniband switches.

Figure from Megatron paper.

1.3.2 Tensor parallelism

Some attempts at tensor (Intra-layer) Model parallelism have been made where matrix multiplications within each Transformer layer are split across multiple Gpus, While this approach works well on NVIDIA DGX A100 servers (with eight 80GB-A100 Gpus) for models with no more than 20 billion parameters, it becomes problematic for larger models. Because larger models need to be split across multiple multi-GPU servers, this leads to two problems.

  • (a) All-Reduce communication required for tensor parallelism needs to go through links between servers, which is slower than high-bandwidth NVLink in multi-GPU servers;
  • (b) High degree of model parallelism results in many small matrix multiplications (GEMMs), which may reduce GPU utilization.

1.3.3 Pipeline parallelism

Pipeline model parallelization is another technique that supports training of large models. In pipelining parallelism, the layers of a model are shelled across multiple Gpus. A batch is divided into smaller microbatches and is executed pipelined on these microbatches.

With pipelining parallelism, layers of a model are spread across multiple devices. When used for duplicate models with the same Transformer blocks, each device can be assigned the same number of Transformer layers. Megatron does not consider more asymmetric model architectures, where the allocation of layers to the pipeline stage is more difficult. In pipeline model parallelism, training performs a set of operations on one device and then passes the output to the next device in the pipeline, which performs a different set of operations.

Naive pipelines have the problem that the weight updates an input sees in its backward pass are not the same as in its forward pass. Therefore, pipelined schemes need to ensure that inputs see consistent versions of weights in forward and backward propagation to achieve unambiguous synchronous weight update semantics.

Layers of the model can be assigned to workers in various ways, and different schedules are used for forward and backward calculations of input. Layer allocation and scheduling policies result in different performance trade-offs. Regardless of the scheduling strategy, in order to maintain strict optimizer semantics, optimizer operation steps (steps) need to be synchronized across devices, such that a pipeline refresh is required at the end of each batch to complete the microbatch execution (while no new microbatches are injected). Megatron introduces regular pipeline refreshes.

At the beginning and end of each batch, the equipment is idle. We call this idle time pipeline bubble, and we want it to be as small as possible. Depending on the number of microbatches injected into the pipeline, up to 50% of the time may be spent refreshing the pipeline. The larger the ratio of microbatch number to pipeline depth (size), the less time the pipeline refresh takes. Therefore, in order to achieve high efficiency, a large batch size is usually required.

Some methods use the parameter server in parallel with the pipeline. However, there are problems with inconsistency. TensorFlow’s GPipe framework overcomes this inconsistency by using synchronous gradient descent. However, this approach requires additional logic to process these pipelines of communication and computation operations, and can encounter pipeline bubbles that reduce efficiency, or changes to the optimizer itself can affect accuracy.

Some asynchronous and bounded-staleness methods, such as PipeMare, PipeDream, and PipeDream-2BW, eliminate refresh entirely, but this loosens the weight update semantics. Megatron will consider these options in future work.

1.4 Technology Mix

Users can use a variety of techniques to train their large models, each with different trade-offs. In addition, these technologies can be used in combination. However, the combination of these technologies can lead to complex interactions, which is a great challenge for system topology. It requires not only reasonable model cutting (based on algorithm characteristics), but also the design of the system architecture with hardware and software integration, which requires careful reasoning to achieve good performance. The following questions are therefore of particular importance:

How should parallel techniques be combined to maximize training throughput for large models at a given batch size, while preserving strict optimizer semantics?

Megatron-lm developers demonstrated how a technique called PTD-P, combining pipeline, tensor, and data parallelism, will train large language models on 1000 Gpus with good computational performance (52% of peak device throughput). Ptd-p utilizes a combination of pipelining parallelism across multiple GPU servers, tensor parallelism within multiple GPU servers, and data parallelism to train models with a trillion parameters in an optimized cluster environment with high bandwidth links between gpus on the same server and across servers, and has elegant scalability.

Achieving this scale throughput requires innovation and careful design in several areas:

  • Efficient kernel implementation, which makes most computations compute-bound rather than memory-bound.
  • Intelligent segmentation of the compute graph on the device to reduce the number of bytes sent over the network, while also limiting the idle time of the device.
  • Implement domain-specific communications optimization and use of high-speed hardware (such as state-of-the-art Gpus with high bandwidth links within and between different server Gpus).

1.5 Guidelines

Megatron developers have studied how various combinations affect throughput, and based on these studies have emerged some guidelines for distributed training:

  • The different parallelism patterns interact in complex ways: the parallelization strategy affects traffic, the computational efficiency of the core, and the idle time that workers wait for pipeline refreshes (pipeline bubbles). For example, tensor model parallelism is efficient in multi-GPU servers, but large models must adopt pipelined model parallelism.

  • The SCHdule for pipeline parallelism has an impact on traffic, pipeline bubble size, and memory used to store activation. Megatron proposes a new interleaving Schdule that can improve throughput by up to 10% over the previous proposed Schdule with slightly higher memory footprint.

  • The value of hyperparameters, such as microbatch size, has an impact on memory footprint, core effect performed on worker, and pipeline bubble size.

  • Distributed training is communication intensive. Using slower inter-node connections or more communication-intensive partitions can hinder performance.

The Tensor Model Parallelism

2.1 the principle

We use GEMM to see how to do model parallelism, where XA = Y, for the model, X is the input, A is the weight, and Y is the output. From the point of view of mathematical principle, for linear layer, the matrix is divided into blocks to calculate, and then combine the results. For non-Linear layer, no additional design is done.

2.1.1 Row Parallelism

Let’s start by looking at Row Parallelism, which splits A into two parts by Row. To ensure operation, we also split X into two parts by column, where the last dimension of X1X_1X1 is equal to the first dimension of A1A_1A1, theoretically:


X A = [ X 1 X 2 ] [ A 1 A 2 ] = X 1 A 1 + X 2 A 2 = Y 1 + Y 2 = Y XA = \begin{bmatrix}X_1& X_2\end{bmatrix} \begin{bmatrix}A_1 \\ A_2\end{bmatrix} = X_1 A_1 + X_2 A_2 = Y_1 + Y_2 = Y

So, X1X_1, X1 and A1A_1A1 can be computed on the first GPU, X2X_2X2 and A2A_2A2 can be computed on the second GPU, and then add up the results.

So let’s do the calculation. The first step is to take the dot product of the horizontal red arrow and the vertical arrow, to get the green in Y.

Step three, compute a new green.

The fourth step evaluates a line of output.

Step 5, continue, resulting in a Y1Y_1Y1.

In step 6, we get the blue Y2Y_2Y2. At this point, we can add up Y1,Y2Y_1, Y_2Y1,Y2 to get the final output Y.

2.1.2 Column Parallelism

And then we’re going to look at Column Parallelism, which is splitting A by Column.

The final calculation results are as follows:

GTC 2020: Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

2.2 the Model of the Parallel Transformers

Here Transformer has Model Parallel, specifically intra-layer segmentation, namely Tensor Model Parallel.

2.2.1 the Transformer

Since the publication of Google’s Attention paper in 2018, the model architecture in recent years has been completed on the basis of Transformer. The number of layers in the model means the number of Transformer blocks in the model, so the computation of the language model is mainly in Transformer. Transformer, on the other hand, is essentially a lot of matrix computing and is suitable for GPU parallel operation.

The Transformers layer is composed of Masked Multi Self Attention and Feed Forward. The Feed Forward part is an MLP network composed of multiple fully connected layers. Each fully connected layer is composed of a matrix multiplication operation and a GeLU activation layer or Dropout.

Megatron’s Feed Forward is a two-layer multilayer perceptron (MLP). The first layer changes from H to 4H and the second layer changes from 4H back to H, so Transformer has the following architecture, with the purple block corresponding to the fully connected layer. Each blue block represents a Transformer layer that has been copied N times, and the red x L represents the blue copy L times.

2.2.2 segmentation Transformer

Distributed tensor computing is an orthogonal and more general approach that divides tensor operations over multiple devices to speed up computation or increase model size. FlexFlow is a deep learning framework for doing this kind of parallelization and provides a way to select the best parallelization strategy. More recently, Mesh TensorFlow has introduced a language for specifying general distributed tensor computation in TensorFlow. The user specifies the parallel dimensions in the language and compiles a computational graph using the appropriate collection primitives. We take the similar insights of Mesh TensorFlow and leverage transformer’s attention Heads computational parallelism to parallelize the Transformer model. However, Megatron does not have a framework and compiler to implement model parallelism, and instead makes some targeted modifications to the existing PyTorch Transformer implementation. Megatron’s approach is simple, doesn’t require any new compilers or code rewrites, just fully implemented by inserting some simple primitives,

Megatron is to slice both Masked Multi Self Attention and Feed Forward to parallelize. Using the structure of the Transformers network, it creates a simple parallel implementation of the model by adding some synchronization primitives.

2.2.3 segmentation MLP

Let’s start with the MLP block. The first part of the MLP block is GEMM, followed by GeLU:


Y = G e L U ( X A ) Y = GeLU(XA)

One option for parallelizing GEMM is to split the weight matrix A along the row and input X along the column:


X = [ X 1 X 2 ] . A = [ A 1 A 2 ] X = \begin{bmatrix} X_1& X_2 \end{bmatrix}, A = \begin{bmatrix} A_1 \\ A_2 \end{bmatrix}

The result of partitioning becomes Y=GeLU(X1A1+X2A2)Y =GeLU(X_1A_1 + X_2A_2)Y=GeLU(X1A1+X2A2). Each of the two terms in parentheses can be done on a separate GPU. Then sum manipulation is performed through all-reduce operations. Since GeLU is a nonlinear function, Then GeLU(X1A1+X2A2)≠GeLU(X1A1)+GeLH(X2A2)GeLU(X_1A_1 + X_2A_2) \neq GeLU(X_1A_1) + GeLH(X_2A_2)GeLU(X1A1+X2A2)=GeLU(X1A1)+GeLH(X2A2), so this scheme needs to add a synchronization point before the GeLU function. This synchronization point allows information to be exchanged between different Gpus.

Another option would be to split A along the column and get A=[A1, A2]A=[A_1, A_2]A=[A1, A2]. This partition allows GeLU nonlinear to be applied independently to the GEMM output of each partition:


[ Y 1 Y 2 ] = [ G e L U ( X A 1 ) . G e L U ( X A 2 ) ] \begin{bmatrix} Y_1& Y_2 \end{bmatrix}= \begin{bmatrix} GeLU(XA_1),GeLU(XA_2) \end{bmatrix}

This method is better because it removes the synchronization points and simply concatenates the output of the two GELUs. Therefore, we divide the first GEMM in this column parallelism and split the second GEMM along its row so that it directly gets the output of the GeLU layer without any other communication (for example, all-reduce), as shown in the figure.

The first is the GeLU operation and the second is the Dropout operation. The logic is as follows:

  1. The entire input X of MLP is placed on each GPU through F.
  2. For the first fully connected layer:
    1. Using column segmentation, the weight matrix is segmented onto two Gpus to obtain A1,A2A_1, A_2A1 and A2.
    2. Matrix multiplication on each GPU yields the output of the first full connection layer, Y1Y_1Y1 and Y2Y_2Y2.
  3. For the second fully connected layer:
    1. The weight matrix is segmented on two Gpus by row segmentation to obtain B1,B2B_1, B_2B1 and B2.
    2. The previous output Y1Y_1Y1 and Y2Y_2Y2 just meet the requirements, and can directly do the correlation calculation with the relevant part of B (B1,B2B_1, B_2B1,B2), without communication or other operations, we can get Z1,Z2Z_1, Z_2Z1,Z2. One on top of the other.
  4. Z1,Z2Z_1, Z_2Z1, and Z2 do all-reduce through G (which is a synchronization point), and then make the final output Z by dropout.

Then on top of the GPU, the output of the second GEMM is contracted before being passed to the Dropout layer. This approach splits the two GEMMs in the MLP block across the GPU and requires only one All-reduce operation (g operator) in the forward process and one All-reduce operation (F operator) in the backward process. These two operators are conjugated to each other and can be implemented in PyTorch in just a few lines of code. As an example, the implementation of the f operator looks like this:

Implementation of f operator. G is similar to f in that it uses identity in the backward function and all-reduce in the forward function.

2.2.4 Self attention

As shown in the figure below.

  • First, for self-attention blocks, Megatron takes advantage of the parallelism inherent in multi-head attention operations by partitioning GEMM associated with key (K), query (Q), and value (V) in column parallelism to perform matrix multiplies corresponding to each attention head locally on a GPU. This allows us to split each attention head parameter and workload within the GPU, and each GPU gets a partial output.
  • Secondly, for the subsequent fully connected layer, since there are some outputs on each GPU, the weight matrix B is divided into rows and directly calculated with the input Y1,Y2Y_1, Y_2Y1 and Y2. Then, the final result Z is obtained by all-reduce operation and Dropout in G.

Figure: TRANSFORMER block with model parallelism. F and g are conjugate. F uses an identity operator in forward propagation and all Reduce in backward propagation, while G uses All Reduce in forward propagation and identity operator in backward propagation.

2.2.5 communication

Subsequent GEMMs from the output of the linear layer (after the Self attention layer) are parallelized along their lines and directly capture the output of the parallel attention layer without the need for communication between gpus. This approach for MLP and self-attention layers fuses the two GEMM groups, eliminating intermediate synchronization points and leading to better scalability. This enables us to execute all GEMM in a simple Transformer layer using only two All-Reduces in the forward path and two All-Reduces in the reverse path (see figure below).

Figure: Communication operations in the Transformer layer. There are a total of four communication operations in forward and back propagation of a single model parallel Transformer layer.

The Transformer language model outputs an embed whose dimension is the hidden size (H) multiplied by the vocabulary size (v). Since the vocabulary of modern language models is in the tens of thousands (for example, the vocabulary used by GPT-2 is 50257), parallelizing the output embedded in GEMM is very beneficial. However, in the Transformer language model, to share weights between the output embedding layer and the input embedding layer, both need to be modified.

We parallelize the input embedding weight matrix EH×vE_{H× V}EH×v along the vocabulary dimension E=[E1, E2]E=[E_1, E_2]E=[E1, E2] (by column). Because each partition now contains only a portion of the embedded table, an All-reduce (G operator) is required after input embedding. For output embedding, one approach is to perform parallel GEMM[Y1, Y2]=[XE1, XE2]GEMM[Y_1, Y_2]=[XE_1, XE_2]GEMM[Y1, Y2]=[XE1, XE2] to obtain logit, Then add an All-Gather Y=all− Gather ([Y1, Y2])Y= All-Gather ([Y_1, Y_2])Y=all− Gather ([Y1, Y2]) and send the result to the cross entropy loss function. However, in this case, the all-Gather will pass B × S × VB × S × VB × S × V elements (b is batch size, S is sequence length) due to the size of the vocabulary. In order to reduce the communication scale, we fuse the output of parallel GEMM[Y1, Y2]GEMM[Y_1, Y_2]GEMM[Y1, Y2] with the cross entropy loss, thus reducing the dimension to B ×sb× SB × S.

2.2.6 summary

Our model parallel approach is designed to reduce the range of communication and control GPU computing. Rather than having one GPU compute dropout, Layer Normalization, or residual Connection and broadcast the results to other Gpus, we chose to replicate the calculation across gpus.

Model parallelism is orthogonal to data parallelism, so we can use both to train large models. The following figure shows a set of Gpus for mixing model parallelism and data parallelism.

  • A model needed to occupy 8 cards, the model was copied for 64 points, a total of 512 activated.
  • Model parallelism. Multiple Gpus within the same server form model Parallel groups, such as Gpus 1 through 8 in the figure, and contain model instances distributed across these Gpus. The remaining Gpus may reside within the same server or in other servers that run parallel groups of other models. Gpus in a parallel group in each model execute all-reduce among all Gpus in the group.
  • Data parallelism. Gpus with the same location in each model parallel group (e.g. GPU 1,9,… 505) Form a data Parallel group, that is, processes with the same model parameters are assigned to the same data parallel group. For data parallelism, each All-reduce operation is performed on one GPU in each model parallel group.
  • All communication is done through pyTorch calling NCCL.

During back propagation, we run multiple gradient All-reduce operations in parallel to specify weight gradients in each different data parallel group. The total number of Gpus required is the product of the number of models and data parallel groups.

Mixed model and data parallel GPU grouping, 8-way model parallel and 64-way data parallel.

0x03 Parallel Configuration

Let’s look at how to mix and match parallelism.

3.1 Symbol Description

The notations used for the rest of this article are described below.

Tensor and Pipeline Model Parallelism

Both tensors and pipelining model parallelism can be used to divide model parameters across multiple Gpus. As mentioned above, the pipeline parallelism will produce size for use with periodic refreshes (𝑝 – 1) / 𝑚 (𝑝 – 1) / 𝑚 (p – 1)/m line of bubbles. Let’s assume 𝑑 = 1 (data parallel size), so 𝑡 · 𝑝 = 𝑛. In this case, the pipeline bubble size is:


p 1 m = n / t 1 m \frac{p-1}{m} = \frac{n/t -1}{m}

If we fixed 𝐵, 𝑏, and 𝑑 (𝑚 = 𝐵 / (also fixed 𝑏 𝑑)), as 𝑡 increase line bubbles will be reduced accordingly.

The traffic between different Gpus is also affected by 𝑝 and 𝑡. Pipeline model parallelism has cheaper point-to-point communication. On the other hand, the tensor model parallelism uses more bandwidth-consuming All-Reduce communication (there are two All-reduce operations in each forward and backward pass).

  • Using pipelined parallelism, the total amount of communication performed for each microlot between each pair of continuous devices (propagating forward or backward) is 𝑏𝑠h, 𝑠 is the sequence length, and h is the hidden size.
  • Using tensor model parallelism, tensors with a total size of 𝑏𝑠h need to be all-reduce twice among 𝑡 model copies in forward propagation and backward propagation of each layer.

Thus, we see that tensor model parallelism increases the traffic between devices. Therefore, when 𝑡 is greater than the number of Gpus in a single node, it is not economical to perform tensor model parallelism on slower inter-node links.

Thus:

Conclusion #1: When considering different forms of model parallelism, when using 𝑔-GPU servers, it is generally appropriate to keep the tensor model parallelism within 𝑔, and then use pipelining parallelism to scale to larger models across servers.

3.3 Data and Model Parallelism

Then consider data parallelism and model parallelism.

3.3.1 Pipeline Model Parallelism.

We given 𝑡 = 1 (tensor – model – the parallel size), then each line of the batch number is 𝑚 = 𝐵 / (𝑑 ⋅ 𝑏) = 𝑏 ‘/ 𝑑, 𝑚 = 𝐵 / (𝑑 𝑏) = 𝑏’ / 𝑑, m = B/B (d ⋅) = ‘B/d,, Here 𝑏 ‘:=𝐵/𝑏𝑏’ :=𝐵/ 𝐵 b ‘:= b /b. A given number of GPU for n, the number of pipeline stages is 𝑝 = 𝑛 / (𝑡 𝑑) = 𝑛 / 𝑑, pipelining bubble size is:


p 1 m = n / d 1 b / d = n d b \frac{p-1}{m} = \frac{n/d -1}{b’/d} = \frac{n-d}{b’}

As 𝑑 becomes larger, 𝑛 − 𝑑 becomes smaller, so the pipeline bubbles become smaller. Since the memory footprint required for model training may be larger than the memory capacity of a single accelerator, it is not possible to increase 𝑑 all the way to 𝑛. However, the all-Reduce communication required for data parallelism does not increase with higher data parallelism.

We can also analyze the impact of the addition of batch size 𝐵. For a given parallel configuration, such as batch size 𝐵 increases, 𝑏 ‘= 𝐵 / 𝑏 increases, (𝑛 – 𝑑) / 𝑏’ will be reduced accordingly, so as to increase throughput. There is also less all-reduce required for data parallelism, further improving throughput.

3.3.2 Data and Tensor Model Parallelism.

Using tensor model parallelism, all-Reduce communication needs to be performed for each microbatch. This can be very expensive between multiple GPU servers. Data parallelism, on the other hand, requires only one All-reduce execution per batch. Furthermore, with tensor model parallelization, each model parallel rank performs only a subset of computations in each model layer, so for layers that are not large enough, modern Gpus may not be able to perform these submatrix computations with maximum efficiency.

Conclusion #2: When using data and model parallelism, the total model parallelism size should be 𝑀 = 𝑡 · 𝑝 so that model parameters and intermediate metadata can be put into GPU memory. Data parallelism can be used to extend training to more Gpus.

3.4 Microbatch Size

The choice of microbatch size 𝑏 also affects the throughput of model training. For example, on a single GPU, if the microbatch size is large, throughput per GPU can be increased up to 1.3 times. Now, with fixed parallel configurations (𝑝,𝑡,𝑑) and batch sizes (𝐵), we want to determine the best microbatch size ().

The data parallel traffic will be the same regardless of the microbatch size. Given that functions tf(b)t_f(b)tf(b) and TB (b)t_b(b) TB (b) map the microbatch size to the forward and backward computing time of a single microbatch, the total computing time of a batch is (as before, 𝑏 ‘is defined as 𝐵/𝑑) under the condition of ignoring communication costs.


( b / b + p 1 ) . ( t f ( b ) + t b ( b ) ) T_f (b) + t_b(b)

Thus, the size of the microbatch affects both the arithmetic strength of the operation and the bubble size of the pipe (by affecting 𝑚).

Rule of thumb #3: The optimal microbatch size 𝑏 depends on the throughput and memory footprint characteristics of the model, as well as the pipe depth 𝑝, data parallel size 𝑑, and batch size 𝐵.

3.5 contrast

Let’s look at a comparison of the various parallel mechanisms.

3.5.1 Tensor versus Pipeline Parallelism.

We observed that the parallelism of the tensor model is best within the node (DGX A100 server) because it reduces traffic. The pipelined model, on the other hand, uses cheaper point-to-point communication in parallel and can be executed across nodes without limiting the entire computation. However, pipelining parallelism consumes a significant amount of time in pipelining bubbles, so the total number of pipelining stages should be limited so that the number of microbatches in the pipelining is a reasonable multiple of pipelining depth. Therefore, peak performance is achieved when the tensor parallel size is equal to the number of Gpus in a single node (8, DGX A100 nodes). This result shows that neither tensor model parallelism (Megatron V1) nor pipeline model parallelism (PipeDream) alone can match the performance of the two techniques used together.

3.5.2 Pipeline versus Data Parallelism.

Experiments show that for each batch size, throughput decreases with the increase of pipeline parallel scale. Pipeline model parallelism should be mainly used to support large-scale model training that is not suitable for a single worker, and data parallelism should be used to expand the training scale.

3.5.3 Tensor versus Data Parallelism.

Let’s look at how the parallelism of the data and tensor models affects performance. When the batch amount is 1 and the batch amount is 1, the data parallel communication is not frequent. The tensor model parallelism requires all-to-all communication for each microbatch in the batch. This all-to-all communication and the communication of tensor model parallelism dominate the end-to-end training time, especially when the communication needs to be carried out on multiple GPU nodes. In addition, as the parallel scale of the tensor model increases, we perform smaller matrix multiplications on each GPU, reducing the utilization of each GPU.

We should note that although data parallelism can lead to efficient scaling, we cannot use data parallelism alone to deal with large models with limited training batches because of a) insufficient memory capacity and b) the scaling limitations of data parallelism (for example, gpT-3 has a training batch of 1536. Therefore, data parallelism only supports up to 1536 Gpus; However, there are about 10,000 Gpus to train this model).

0 x04 conclusion

Megatron uses PTD-P (pipeline parallelism between nodes, tensor parallelism within nodes, and data parallelism) to achieve high aggregation throughput (502 petaFLOP/s) while training large models with terabytes of parameters.

  • The Tensor model is applied in parallel to the intra-Node Transformer layer so that it runs efficiently on the HGX based system.
  • Pipeline model parallelism is used in inter-Node Transformer layer, which can effectively utilize multiple network card designs in a cluster.
  • Data parallelism is supported on top of the first two, allowing training to be extended to larger scales and faster speeds.

0xEE Personal information

★★★★ Thoughts on life and technology ★★★★★

Wechat official account: Rosie’s Thoughts

0 XFF reference

Megatron Papers and Code Analysis (2)

Megatron Papers and Code Analysis (1)

Megatron-lm megatron-LM

Megatron-lm megatron-LM

Megatron learning summary

GTC 2020: Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

www.DeepL.com/Translator

Developer.nvidia.com/gtc/2020/sl…

NVIDIA Megatron: A Distributed Training Framework for Large Transformer Language Models

NVIDIA Megatron: A Distributed Training Framework for Large Transformer Language Models