• DeepSpeed: Extreme-scale Model training for everyone
  • DeepSpeed Team Rangan Majumder, Vice President Junhua Wang, VP, Distinguished Engineer
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: zhuzilin
  • Proofreader: samyu2000, luochen1992, lsvih

DeepSpeed: Superscale model training tool available to all

We launched DeepSpeed in February of this year. This open source deep learning training optimization library includes a new video memory optimization technology, ZeRO (ZeRO redundancy optimizer), which greatly advances large model training capabilities by increasing scale, speed, cost control, and availability. DeepSpeed has helped researchers develop the Turing Natural Language Generation Model (Turing-NLG), which at the time of publication was the world’s largest language model (with 17 billion parameters) with the best accuracy. We released Zero-2 in May — supporting model training with 200 billion parameters, up to 10 times faster than the latest technology — and a series of computational, IO, and convergence optimizations to facilitate the fastest BERT training. Since then, we have continued to innovate at a high rate, pushing the boundaries of speed and scale of deep learning model training.

Today, we’re excited to share some new advances that will not only push deep learning training to the extreme, but also make the technology more widely available — from data scientists training on supercomputers to low-end clusters and even a single GPU. Specifically, DeepSpeed has added four new systems technologies to further expand our AI at Scale initiative. They also drive innovation in Microsoft’s AI products and platforms. These technologies provide extremely efficient utilization of computing, video memory, and communication, and help us train models with billions to trillions of parameters. These technologies also support extremely long input sequences and can be used on high-end clusters of single-card Gpus, kilocalpus, and low-end clusters of slow Ethernet.

  • Training trillion parameter models with 3D Parallelization: DeepSpeed implements a flexible combination of three parallel methods: data parallelism supported by ZeRO, pipeline parallelism, and tensor slice model parallelism. 3D parallelism accommodates the needs of different workloads to support very large models with terabytes of parameters, while achieving near-perfect video memory scalability and throughput scaling efficiency. In addition, its improved communication efficiency enables users to train models with billions of parameters 2-7 times faster on conventional clusters with limited network bandwidth.
  • Zero-offload enables a SINGLE GPU card to train models 10 times larger: To train large models using both CPU and GPU memory, we extended Zero-2. Our customers can run models with up to 13 billion parameters on a machine with a single Nvidia V100 GPU without running out of video memory, scaling the model up to 10 times the size of existing methods while maintaining competitive throughput. This feature democratizes model training with billions of parameters, and opens a window for many deep learning practitioners to explore bigger and better models.
  • Performing 10x sequences at 6x speed with DeepSpeed Sparse Attention: DeepSpeed provides the Sparse Attention kernel — an instrumental technique that supports long sequence model input, including text, image, and speech. It supports an order of magnitude longer input sequences than classic dense Transformer and achieves up to a sixfold improvement in execution speed while maintaining comparable accuracy. It is also 1.5-3 times faster than the latest sparse implementations. In addition, our sparse kernel flexibly supports sparse formats, enabling users to innovate by customizing sparse structures.
  • 1 bit Adam reduces traffic 5 times: Adam is an effective (and perhaps most widely used) optimizer for training scenarios in large-scale deep learning models. However, it is often incompatible with communication efficiency optimization algorithms. Therefore, communication overhead can be a bottleneck when scaling distributed across devices. We introduce a new 1-bit Adam algorithm and its efficient implementation. The algorithm can reduce the traffic up to 5 times and achieve convergence rate similar to Adam. We observed a 3.5-fold increase in distributed training speed in communication constrained scenarios, which allowed the algorithm to be extended to different types of GPU clusters and network environments.

This post will delve deeper into these four technologies. We’ve published these exciting optimization techniques in the open source project DeepSpeed.

3D parallelism: Extended to trillion-parameter models

With the rapid growth of computing on modern GPU clusters, training powerful terabyte models with amazing capabilities is no longer out of reach and may be possible in the near future. DeepSpeed combines three powerful technologies to train trillion-scale models and scale to thousands of Gpus: data parallel training, model parallel training, and pipeline parallel training. The symbiosis of all three allows deep learning training to scale far beyond what is possible with each strategy alone. 3D parallel solves two basic challenges of training trillion-parameter models simultaneously: video memory efficiency and computational efficiency. Therefore, DeepSpeed can be scaled to fit the largest models in video memory without sacrificing speed.

Understand the challenges of training large models for video memory and computational efficiency

Video memory efficiency: The video memory required to train the trillion-parameter model far exceeds the size of a single GPU. When using Adam optimizer for mixed precision training, about 16TB of video memory is required to store model state quantities (parameters, gradients, and optimizer state quantities). For comparison, the most advanced Nvidia A100 GPU has only 40 GB of video memory. You need 400 of these Gpus just to store model state.

The additional video memory consumed by the activation function increases with the batch size. When batch is set to 1, training the trillion-parameter model generates more than 1 TB of video memory for the activation function. Checkpoint activation and computer replacement reduced the memory to about 20 GB, but it was still too much for training.

The amount of model state must be effectively divided among multiple GPU devices and video memory activated before such a large model can start training without running out of video memory.

** Computational efficiency: ** It has been estimated that it would take approximately 5000 Zflops to train an end-to-end trillion-parameter model (i.e. 5 followed by 24 zeros; This estimate is based on OpenAI’s Study Law of Scaling. This means that training such a model requires 4000 A100 sheets to run at 50% computational efficiency for about 100 days.

Although large supercomputing Gpus clusters can have more than 4,000 Gpus, achieving high computing efficiency at this scale is still a challenge due to batch size limitations. Computational efficiency increases as the ratio of computation time to communication time increases. The proportion is proportional to the batch size. However, there is an upper limit for the batch size of the training model, beyond which convergence will significantly deteriorate.

Actually one of the largest models, GPT-3 has a training batch size of about 1500. If around 4000 Gpus are used, even if we are free to set the Batch size to 4000, the batch size on each card is only 1, which will affect scalability.

Understand the trade-offs between data parallelism, model parallelism, and pipeline parallelism

Data parallelism is a widely used technique in deep learning. In this technology, each batch of input training data is divided equally between data parallel workers. After backpropagation, communication and gradient specification are required to ensure that the optimizer performs the same update on each worker. Data parallelism has several obvious advantages, including high computational efficiency and low implementation effort. However, the batch size of data parallelism increases with the number of workers, and we often cannot keep increasing the batch size without affecting the convergence.

  • ** Video memory efficiency: ** Data parallelism will replicate models and optimizers among all workers, so the video memory efficiency is not high. DeepSpeed developed ZeRO, a series of optimizers designed to improve the efficiency of video memory for data parallelism. This work relies on Phase 1 of ZeRO, which divides the amount of optimizer state between workers to reduce redundancy.

  • ** Computing efficiency: ** As we improve the parallelism, the amount of computation performed by each worker is constant. Data parallelism can scale almost linearly on a small scale. However,, the communication cost of the statute gradient between workers is positively correlated with the size of the model, so when the model is large or the communication bandwidth is low, the computing efficiency will be limited. Gradient accumulation is a common strategy to evenly distribute communication costs. It further increases the Batch size, accumulating gradients locally using Micro-Batch multiple times for forward and back propagation before gradient specifications and optimizer updates.

Model parallelism is a broad class of techniques. It divides the layers of the model among multiple workers. By its very nature, the computation and communication of model parallelism vary from model structure to model structure, so there is a lot of work to implement. DeepSpeed borrowed Nvidia’s Megatron-LM to provide large-scale model parallelism for the Transformer based language model. Model parallelism will reduce the video memory usage proportionally according to the number of workers, and it is also the most efficient among the three kinds of parallelism. But it comes at the cost of being the least computationally efficient.

  • ** Video memory efficiency: ** Model parallelism reduces video memory usage proportionally based on the number of workers. Crucially, this is the only way to reduce the amount of active video memory in a single network layer. DeepSpeed further improves video memory efficiency by dividing and activating video memory between model parallel workers.

  • ** Computational efficiency: ** Due to the need for additional communication activation values in each forward and back propagation, the computational efficiency of model parallelism is very low. Model parallelism requires high communication bandwidth and cannot be well extended to nodes with limited communication bandwidth. In addition, the parallel worker in each model will reduce the amount of computation executed between each communication stage, thus affecting the calculation efficiency. Model parallelism is often used in conjunction with data parallelism to create a trade-off between memory and computational efficiency.

The pipelining parallel training engine is also included in the DeepSpeed release! Pipelining parallelism divides the layers of a model into stages that can be processed in parallel. When a phase completes a micro-batch forward transfer, the active memory is communicated to the next phase in the pipeline. Similarly, when the next phase completes the backpropagation, the gradient will be backcommunicated through the pipe. Multiple micro-batches must be computed simultaneously to ensure that the phases of the pipeline can be computed in parallel. Several approaches have been developed for tradeoffs between memory and computational efficiency and convergent behavior, such as PipeDream. DeepSpeed uses gradient accumulation to achieve parallelism and maintains the same convergence with traditional data parallelism and model parallelism training at the same total batch size.

  • ** Video memory efficiency: ** The video memory reduced by the parallel pipeline is proportional to the number of stages of the pipeline, so that the size of the model can expand linearly with the number of workers. However, pipelining parallelism does not reduce the display memory footprint of the activation function at each layer. In addition, each worker must store activation values for each micro-batch running simultaneously. This results in roughly the same amount of active memory for the first phase of the pipeline as the total active memory for a single mirCO batch. A trillion-parameter model would require about 19 GB of active memory for video memory for a Micro Batch, which is almost half of the total video memory for the newly launched Nvidia A100 GPU.

  • ** Computational efficiency: ** pipelined parallelism has the lowest traffic, because its traffic is only proportional to the size of the activation value of each layer at each stage boundary. However, it cannot be extended indefinitely. As with model parallelism, increasing the pipeline size reduces the amount of computation per pipeline stage, which reduces the ratio of computation to communication. In order to achieve good computational efficiency, pipelining parallelism also requires perfect balance of computational load at each stage.

In addition, pipeline parallelism causes bubble overhead at the beginning and end of each batch due to the need to refill or empty the pipeline. Training using 4-x or 8-x gradient cumulative steps (as well as batch sizes) of the number of pipeline phases achieved 81% and 90% scalability, respectively, compared to just one pipeline phase.

High memory efficiency and high computing efficiency are achieved simultaneously through 3D parallelism

Data, model, and pipeline parallelism all play specific roles in improving memory and computational efficiency. Figure 1 illustrates our 3D strategy.

** Video memory efficiency: ** First divide the layers of the model into different pipeline stages, and further divide the layers of each stage in parallel through the model. This 2D combination reduces memory consumption for the model, optimizer, and activation functions. However, we cannot divide the model indefinitely without introducing communication overhead, which limits computational efficiency.

** Computational efficiency: ** In order to expand the number of workers beyond the scale supported by model and pipeline parallelism without sacrificing computational efficiency, data parallelism supported by ZeRO (zero-DP) is used. Zero-dp can not only further improve video memory utilization efficiency by partitioning the optimizer state quantity, but also scale to any number of Gpus with minimal communication overhead by utilizing mapping relationships based on communication topology.

3D mapping based on communication topology (Figure 2) : By utilizing two key architectural attributes, we carefully map each dimension in 3D parallelism to worker to achieve maximum computational efficiency.

  1. Optimization of intra-node and inter-node communication bandwidth: model parallelism is the most expensive communication strategy among the three strategies, so we give priority to placing model parallelism worker groups in nodes to take advantage of greater intra-node bandwidth. Here we perform model parallelism of tensor tangent based on Nvidia Megatron-LM. When the model parallel group does not occupy all workers in the node, we choose to place the data parallel group in the node. Otherwise, data is parallel across nodes. Pipelining parallel has the lowest traffic volume, so we can schedule pipelining phases across nodes without being limited by communication bandwidth.
  2. ** Increase bandwidth through parallel communication: ** The amount of gradients required to communicate for each data parallel group decreases linearly with the scale of pipeline and model parallelism, so the total traffic is less than using data parallelism alone. In addition, each data parallel group will communicate independently within a small part of workers locally, and inter-group communication can be parallel with each other. As a result, the effective bandwidth for data parallel communication is increased by reducing traffic and increasing locality and parallelism.

Figure 1: an example of 32 workers performing 3D parallelism. Each layer of neural network is divided into four pipeline stages. Layers in each pipeline stage are further divided among the four model parallel workers. Finally, each pipeline phase has two parallel instances of data, and ZeRO divides the amount of optimizer state between the two copies.

Figure 2: The mapping of worker in Figure 1 to Gpus on the system with eight nodes (each node has four Gpus). Gpus of the same color are on the same node.

Learn more about 3D parallel training trillion-parameter models

Using 8-channel model parallelism, 64-channel pipeline parallelism, and 8-channel data parallelism, it is possible to extend and train a trillion-parameter model on 4096 Nvidia A100 Gpus.

By combining model and pipeline parallelism, 3D parallelism can achieve excellent memory efficiency and efficient computation across multiple nodes. Model parallelism improves the storage efficiency of active memory and model state quantity within nodes, while pipelined parallelism can store model state efficiently across nodes without sacrificing computational efficiency compared to model parallelism alone. In the example of the one trillion parameter micro-batch size, after checkpoint activation and the above 3D parallelism, the model state quantity will consume 30 GB of video memory and 2.5 GB of memory for the partition activation value. With a total memory footprint of 32.5 GB, you can accommodate and train such a model using a nvidia A100 GPU with 40 GB of ram.

The combination of model parallelism and pipeline parallelism enables pipeline parallelism to achieve high computational efficiency with minimal Bubble overhead under a very small batch. Under 8-way model parallelism, using micro-batch to 1 microbatch per model will result in an effective micro-batch size of 1/8 for each GPU. Therefore, using gradient accumulation steps 8 times higher than pipeline parallelism can only make the total cumulative batch size on each GPU 1, and pipeline parallelism can achieve 90% computing efficiency. When combined with data parallelism, this gives a total effective Batch size of 4096 on 4096 Gpus and still achieves 90% pipelined efficiency.

But how does data parallelism affect computational efficiency? Doesn’t data parallelism require a large batch per GPU to be efficient?

Model parallelism can reduce the effective Batch size on each GPU to less than 1. This enables pipeline parallelism to hide pipeline Bubble overhead even under a small batch. Note that by using pipeline parallelism across nodes, we can have data parallel nodes in each phase of the pipeline communicate independently with each other and in parallel with the other pipeline phases. In fact, in the fully connected network topology common in high-end GPU clusters, this has important implications for the efficient communication bandwidth available for data parallel training. Since each node in the pipeline stage can communicate with its corresponding data parallel node in parallel, the effective communication bandwidth is proportional to the number of pipeline stages. By setting up 64 parallel pipelining stages, the effective bandwidth becomes 64 times that of the bandwidth to and from a single node. Pipeline parallelism brings data parallelism such a large effective bandwidth, which enables data parallelism to achieve efficient expansion even in small batch cases where the ratio of computation to communication is very low.

Train trillion-parameter models with linear extensibility

DeepSpeed can train a trillion-parameter language model with just 800 Nvidia V100 Gpus (Figure 3). We show model size and training throughput, and observe that both video memory and computing efficiency grow linearly with the model size. We can train about 1.4 billion parameters on each GPU in various configurations, which is the maximum model size that a single GPU can support without running out of memory, indicating perfect video memory scalability. We also achieved near-perfect linear computational efficiency scaling, with a throughput of 47 Tflops per V100 GPU. This is impressive scalability and throughput for the aforementioned hardware.

Figure 3: Graph of model size (in billion parameters) and training throughput (in Pflops) versus the number of Gpus. DeepSpeed can train models with a trillion parameters using 800 Nvidia V100 Tensor Core Gpus with 32 GB of memory. Each configuration uses the 16-way model parallelism provided by NVIDIA Megatron-LM, with the remaining Gpus responsible for pipeline parallelism. The trillion-parameter model has 298-layer Transformer, whose hidden layer size is 17408, and the training sequence length is 2048 and batch size is 2048. For smaller models, we scaled down the number of Transformer layers and the batch size based on the number of Gpus.

How does 3D parallelism speed up training gpT-3 scale models

Figure 4: System performance training for a GPT-3 scale model with 180 billion parameters using 800 Gpus in parallel using 2D and 3D. The model has 100 Transformer layers, a hidden layer size of 12288 and 96 Attention heads. The batch size and sequence length used in training are 2048 and 2048 respectively. Zero-1 can also be used in parallel with data. P, M and D represent pipeline, model and data parallel dimensions respectively.

In Figure 4, we use the latest GPT-3 model architecture with over 175 billion parameters as a benchmark for 3D parallelism:

  • We first evaluated the 2D configuration (C1-C3). C1 and C2 are configured using only pipelining and model parallelism — they can train models, but have low throughput and GPU utilization due to over-decomposition of models. C3 attempts to use only pipelinization and data parallelism, but it cannot solve the problem of insufficient video memory without reducing activation through Megatron’s model parallelism.
  • 3D configuration (C4-C10) successively increases pipeline parallelism; Balancing the parallelism in the middle of the configuration can achieve the best performance, the realization of video memory, computing and communication efficiency.
  • The best 3D approach can achieve 49 Tflops per GPU, more than 40% of the theoretical peak of the hardware.
See how hybrid parallelism can train GPT-2 on a low bandwidth cluster 7 times faster

We train a 1.5 billion parameter GPT-2 model and show the communication advantage of hybrid parallelism in Figure 5. To highlight the communication phase of the training, the training is performed on a cluster of four nodes with low bandwidth between nodes:

  • Model parallelism has no advantage in this case because the model is small and the bandwidth within the node is low.
  • Pipelined parallelism is an order of magnitude less traffic than configuration data and model parallelism. When batch was smaller, the training speed was seven times faster.
  • Data parallelism uses the gradient accumulation to increase the batch size to evenly spread the communication cost, but the performance of the case with pipeline parallelism is still twice as high as that of data parallelism when the batch size is larger.
  • Hybrid pipelining and data parallel configuration avoid gradient communication bottlenecks by limiting data parallel groups to gpus within nodes, so gradient communication benefits from faster intra-node bandwidth.

Figure 5: The relationship between throughput and batch size when training GPT-2 with sequence length of 1024 (1.5B parameter). Four nodes were used, each equipped with four V100 Gpus with 16 GB of memory for training. Gpus are connected with intra-node bandwidth of 50 Gbps and inter-node bandwidth of 4 Gbps per second. DP indicates that zero-1 data parallelism is enabled. All methods extend the batch size by increasing the number of steps accumulated by the gradient.

Zero-offload: Single GPU trains models 10 times larger

Zero-offload increases the maximum model size that can be efficiently trained with fewer GPU resources by simultaneously utilizing the computing and storage resources of both GPU and host CPU. It allows us to train models with up to 130 billion parameters on a single V100, 10 times the current highest level, while maintaining a high training throughput of 30Tflop per GPU.

By enabling a single GPU to train models with billions of parameters, Zero-offload makes large-model training accessible to deep learning practitioners with limited hardware resources.

Figure 6: Maximum model size that can be trained on a single GPU using the default PyTorch and zero-offload.

The core technology behind zero-offload is to unload optimizer state and gradients into CPU memory on a zero-2 basis. This approach allows Zero-offload to minimize the loss of computational efficiency resulting from copying to the CPU while achieving the same or sometimes greater efficiency than Zero-2. The following diagram shows the architecture of Zero-offload:

Figure 7: Overview of zero-offload.

See how Zero-offload trains billions of parameter models on a single GPU

Training models with billions of parameters like GPT and T5 requires multiple Gpus to store model and state quantities. Large model training mostly solves the problem of video memory limitation through model parallelism across GPU. Recently, we released ZeRO, an efficient video memory optimizer that will distribute model state quantities (optimizer state quantities, gradients, and model parameters) across multiple parallel Gpus, allowing multi-billion parameter models to be trained without using model parallelism. However, ZeRO still needs a large amount of GPU with parallel data to store the state quantity of the partitioned model, so only a few people have the condition to conduct such model training.

Zero-offload makes it possible for a single GPU to do large model training, thus making it civilian. To train models with billions of parameters without using multiple Gpus, Zero-offLoad inherits Zero-2’s approach of dividing optimizer state quantities and gradients. Unlike Zero-2, zero-offload does not store a portion of the optimizer state quantity and gradient on each GPU, but moves both to native memory. The Optimizer state is kept in memory throughout the entire training process. Gradient is calculated on GPU and averaged by Reduce-Scatter during reverse calculation. After that, each data parallel process offloads its own averaged gradient to CPU (g offload in FIG. 7) and abandons the part that is not its own responsibility.

Once the gradient is on the CPU, the partitioned optimized state quantity is updated on the CPU in parallel (p update in Figure 7). After the update, the partitioned parameters are moved back to the GPU and updated with the All Gather operation (G Swap in Figure 7). Zero-offload also improves training efficiency by using different CUDA streams to overlap communication (such as G Offload and G Swap) and computation (such as back propagation and P Update).

Advantages of Zero-offload in terms of model size, training speed and scalability

** 10x Model scaling: ** On a single 32GB V100 GPU, Figure 6 shows that PyTorch can train a model with up to 1.3 billion parameters, while Zero-Offload can train a model with up to 13 billion parameters, 10 times more than PyTorch. This is because zero-offload keeps the optimizer state that consumes most of the GPU memory in native memory throughout the training process, and also moves the calculated gradient to the CPU during backpropagation. Thus, the GPU memory saved can be used to train larger models.

** Efficient training throughput: ** As shown in Figure 8, when training the 10 billion parameter model, even if only a single GPU is used for training, zero-offload still enables each GPU to have a throughput of more than 30 Tflops, and its throughput increases linearly with the number of Gpus.

Zero-offload is a perfect complement to Zero-2, enabling efficient training of large models on a small number of Gpus. Zero-offload makes it possible to train large models on 1 to 16 Gpus by utilizing CPU memory to reduce the GPU memory required for models. Zero-offload performs slightly better than Zero-2 on 32 Gpus; The performance gains come from zero-offload GPU memory savings, which allow us to train models in larger batches, so GPU computing efficiency can be improved despite the cost of copying to the CPU. With more Gpus (e.g. 64 and 128), Zero-2 performs better than Zero-offload because both can now run batch of similar size, zero-2 has no overhead of moving data to the CPU, And optimizer updates are much faster on the GPU than on the CPU. In summary, Zero-offload complements the Zero-2 and extends the ZeRO family’s optimization range from a single device to thousands of devices with large model training optimizations.

Figure 8: Comparison of zero-offload and zero-2 training throughput of a 10-billion-parameter GPT-2 model trained on 128 Gpus.

DeepSpeed Sparse attention mechanism: Executes 10 times as long sequences 6 times as fast

Deep learning models based on attentional mechanisms (for example, Transformers) are very effective at capturing the relationship between tokens in input sequences, even if the distance between the two is long. As such, they are often used in conjunction with text, image, and voice-related inputs. The sequence length of these inputs can be up to thousands of tokens. However, although the attention module effectively captures dependencies within long sequences, in practical applications, support for long sequences of inputs is limited by computational overhead and video memory. The amount of computation and video memory requirement increases quadratic with respect to sequence length \(n\).

To address this limitation, DeepSpeed provides a sparse attention kernel — an instrumental technique that reduces the computation and video memory requirements of attentional computing by several orders of magnitude through blocky sparse computing. This tool not only alleviates the memory bottleneck of attentional computation, but also makes sparse computation very efficient. Its API can be easily integrated into any Transformer based model. In addition to providing a variety of sparse structures, it can also flexibly handle any user-defined block sparse structure.

More specifically, sparse attention (SA) can be designed to calculate local attention between nearby tokens, or to obtain summary tokens by using local attention calculation, and thus global attention. In addition, SA supports both random attention and any combination of local, global, and random attention, as shown in the blue, orange, and green blocks in Figure 10. This causes SA to reduce the memory footprint to \(O(wn)\), where 1\(<w≤n \) is a parameter whose value depends on the attention structure.

Figure 10: Variable sparse structure

** Efficient implementation on gpus: ** Although the basic implementation of sparse attention saves video memory, it can be computationally worse than dense computing. This is mainly due to sparse data resulting in decentralized memory access. Developing efficient sparse kernels is often challenging, especially on gpus. DeepSpeed provides an efficient sparse attention kernel developed in Triton. These kernels have a block-sparse paradigm structure that enables aligned memory access, reduces GPU thread branching, and balances the workload on the processor.

System performance: As shown in Figure 11, SA supports 10x sequence length and up to 6.3x computation speed. The figure on the left shows the maximum sequence length that can be run in bert-base and bert-Large. Our experiment had the following three Settings: dense mode, dense mode with checkpoint activation, and sparse (SA) mode with checkpoint activation. Compared with the dense patterns of Bert-Base and Bert-Large, the sequences of SA are 10 and 16 times longer, respectively. In addition, compared with dense mode, SA reduces the total amount of computation and improves the training speed: the improved efficiency increases with increasing sequence length, up to 6.3 times for Bert-Base and 5.3 times for Bert-Large.

Figure 11: Maximum sequence length supported by BERT model (left); Time to train Bert-Base (center) and Bert-Large (right) with different sequence lengths on a single Nvidia V100 GPU.

See how SA can make it as accurate as or better than full dense attention

Relevant work involving Sparse Transformer (Longformer, BigBird) showed greater accuracy than full attention, consistent with our experience. In addition to lower memory overhead and faster computation, we also observed higher accuracy and faster convergence of SA in the production model. The figure below illustrates the accuracy of training a Bert-based production model for long text comprehension (sequence length 2048). The experiment was performed in three Settings: dense training from the beginning, SA training from the beginning, and SA training continued from the use of a 512 intensive checkpoint. We have observed that for pre-training from scratch, SA converges faster and with better accuracy than dense Settings. In addition, continuing pre-trained checkpoint training with SA is even better in terms of timing and accuracy.

Figure 12: Accuracy of long text comprehension applications

See how SA compares to the latest LongFormer

We compared SA with Longformer, a recent sparse architecture and its implementation. In our experiment, SA uses “Fixed” sparsity. Both implementations are equally accurate. In terms of system performance, SA outperforms Longformer in both training and inference:

  • Running pre-trained MLM on Wikitext103 is 1.5 times faster
  • Bert-base reasoning speed increased by 3 times (Batch size 1, sequence length 2,048)

Flexibility in handling any blocky sparse structure: The DeepSpeed Sparse Attention suite is not targeted at any particular sparse structure, so it effectively supports model researchers exploring any blocky sparse structure. Currently, we add popular sparse structures such as Fixed (from OpenAI Sparse Transformer), [BigBird](arxiv.org/pdf/2007.14… .pdf) (from Google) and BSLongformer (block sparse implementation of AI2 Longformer). We also defined a template with a “mutable” structure, as shown in Figure 10, that can be used to simply customize any blocky sparse structure with random, local or global attention patterns.

1 bit Adam: reduces traffic by 5 times and improves training speed by 3.4 times

Extended training for large models such as BERT and GPT-3 requires careful optimization based on model design, architecture, and system functionality. From a system perspective, communication efficiency has become a major bottleneck, especially on commercial systems that use standard TCP and have limited network bandwidth.

Communication compression is an important technique to reduce training time on such systems. One of the most effective methods of compressing communication is error compensation compression, which can provide stable convergence speed even at 1-bit compression. However, the latest error compensation techniques are only suitable for some simple optimizers with gradient linearity, such as stochastic gradient descent (SGD) and Momentum SGD. These techniques cannot be integrated with nonlinear optimizers such as Adam, which bring the best convergence rates and accuracy for many tasks, including training Bert-like models.

For powerful optimizers like Adam, it is challenging to develop compression techniques based on error compensation for their dependence on the nonlinear characteristics of gradients (in terms of variance), thus limiting the practical value of advanced communication compression techniques.

Understand the background of classical compression techniques

One method of communication compression is 1-bit compression, which can be expressed as:

In this compression, we use 1 bit for each number, reducing memory requirements by 32 times. The problem is that such a direct approach would greatly reduce convergence and have little practical value. Recent studies have shown that by using error compensation compression, we can expect to guarantee nearly the same convergence rate under communication compression.

The idea of error compensation can be summarized as: 1) compress, 2) remember the compression error, and 3) add the compression error back in the next iteration. For SGD, error compression is equivalent to:

\(C(⋅)\) is a 1-bit compression operator. The advantage of this error compression is that the historical values of the compression errors \(e_t\) and \(e_T-1 \) will eventually cancel each other out, which makes:

This strategy has been proven to be applicable to all linearly gradient dependent optimization algorithms, such as SGD and Momentum SGD.

Understand the challenges of applying error compensation to Adam

We provide an overview of Adam’s algorithm below. Update rules are as follows:

As shown in the formula above, the variance term \(v_t\) and the gradient \(g_t\) have a non-threaded relationship. If we perform ordinary error compensation for Adam, we will find (see Figure 13) that Adam will not converge.

Figure 13: Error compensation compression does not apply to Adam due to nonlinear dependence on gradients

Compressed communication with 1 bit Adam

In order to compress communication when using the Adam optimizer, we developed a 1-bit Adam, which resolves nonlinear dependencies in gradients through preprocessing. It is observed that the amplitude of the variance of the nonlinear term (\(v_t\)) decreases significantly after several training cycles, after which setting \(v_t\) as a constant does not change the convergence rate. Therefore, the proposed 1-bit Adam optimizer consists of two parts (as shown in FIG. 14) : the warm-up phase, which is essentially the original Adam algorithm. In the compression stage, the variance term is kept constant and the remaining linear term (i.e. momentum) is compressed into a 1-bit representation.

The compression stage of the algorithm is controlled by threshold parameters (see Figure 14). When we detect that the change in “variance” drops below a certain threshold, we switch to the compression phase. Our research shows that the warm-up phase requires only 15-20% of the total training steps.

Learn more about the underlying mechanics of 1-bit Adam

The weight of 1-bit Adam is updated according to the following formula. For the ith worker, in the compression stage:

Figure 14: Comparison of the flow of distributed training using classical Adam algorithm and 1-bit compressed Adam algorithm

System challenge of 1-bit Adam

In addition to algorithmic challenges, there are two systemic challenges to applying 1-bit Adam in a training system. First, we need an efficient kernel with the ability to convert momentum into a one-bit representation. Second, we need efficient communication schemes to transfer compressed momentum between different Gpus. The purpose of compression is to reduce the overall training time so that bandwidth-constrained commodity systems can be used to train large models. We address these challenging issues in DeepSpeed and fully optimize the 1-bit Adam implementation for training scenarios on systems with limited communication efficiency.

Advantages of 1-bit Adam on communication-constrained systems

1-bit Adam provides the same convergence capability as Adam, and can reduce communication traffic by up to 5 times. When it is used for bert-large pre-training tasks, the throughput can reach up to 3.5 times. When it is used for SQuAD fine-tuning tasks, Up to 2.7 times of high throughput. The end-to-end throughput improvement resulted from the 6.6-fold (left of Figure 15) and 6.2-fold (right of Figure 15) speed increases observed during the compression phase. It’s worth noting that our 1-bit Adam optimizer scales very well on 40 Gb Ethernet systems, with performance comparable to Adam’s on 40 Gb InfiniBand QDR systems. We note that the effective bandwidth on a 40 Gb Ethernet network is 4.1 Gbps based on the iPerf benchmark, while InfiniBand provides 32 Gbps near-peak bandwidth based on the InfiniBand Perftest microbenchmark.

Figure 15: Bert-Large pretraining (left) and 1-bit Adam extensibility with SQuAD Fine-Tuning (right) on an NVIDIA V100 GPU. The batch size of BERT pre-training is 16/GPU, and SQuAD fine-tuning is 3/GPU.

In-depth study of the evaluation results of 1-bit Adam

Same convergence as Adam: A major problem with using 1-bit Adam is speed of convergence. We find that 1-bit Adam can achieve the same convergence speed and equivalent performance when using the same number of training samples, as shown in Figure 16.

Figure 16: Using the same number of training samples, 1-bit Adam converges just like Adam.

Table 1 shows the detailed results for bert-Base and Bert-Large. We see that the performance of 1-bit Adam is comparable to the original model for both uncompressed and compressed conditions, and some is better than the original model.

Table 1: Verify the correctness of 1-bit Adam on various test tasks

Up to 5 times less traffic: 1 bit Adam provides the same convergence capability as Adam, and reduces traffic by 16 times during the compression phase (for 16 bit (FP16) training). For the BERT pretraining model, we observed that the warm-up phase was only 15% of the end-to-end training time, resulting in a five-fold reduction in overall communication.

The ratio of original Adam to 1-bit Adam traffic can be expressed as follows:

1 / (WarmUp + (1-WarmUp)/16)

1 bit Adam makes training Bert-Large 3.5 times faster: We provide results for training Bert-Large on two systems with limited bandwidth limits: 1) 40 Gbps Ethernet (Figure 17 left) and 2) 40 Gbps InfiniBand QDR (Figure 17 right). In the compression phase, we found a 6.6-fold increase in throughput using Ethernet, a 2-fold increase in throughput using InfiniBand, and a 3.5-fold and 2.7-fold increase in end-to-end speed (including the preheating and compression phases), respectively. The 1-bit Adam benefits mainly from the reduction in traffic (due to the compression of momentum communication) and our custom AllReduce operation, which is implemented with an efficient 1-bit non-blocking Gather and an AllGather operation.

It is worth noting that BERT pre-training can also be done using LAMB rather than Adam optimizer to reduce traffic by increasing the total batch size. However, the 1-bit Adam avoids this demanding hyperparameter callback. In our experience, it is often more difficult to call a large batch. In addition, 1-bit Adam is also suitable for work with small critical batch volumes that cannot be well converged over large batches, such as many fine-tuning tasks.

Figure 17: Performance of bert-Large training using 1-bit Adam over 40 Gbps Ethernet (left) and InfiniBand (right) in the compression phase

One-bit Adam speeds up SQuAD fine-tuning tasks by 2.7 times: one-bit Adam provides scalability not only for large-scale training tasks, but also for tasks such as SQuAD fine-tuning. As figure 18 shows, 1-bit Adam scales well on Both Ethernet-based and InfiniBand based systems and provides up to 6.2x throughput (in compression phase) on Ethernet-based systems, resulting in an end-to-end 2.7x speed increase (25% in warm-up phase, The compression phase accounts for 75%). For SQuAD fine-tuning, we observe that F1 gets the highest score when the total batch size is 96. Batch size larger than this value reduces the convergence rate and requires additional hyperparameter adjustment. Therefore, to scale up to 32 Gpus, we run small batches with values of 3-4 on each GPU. This makes the communication intensity of fine-tuning task large and difficult to expand. The 1-bit Adam solves the scalability problem well, reducing traffic by 3.4 times without increasing the batch, thus achieving end-to-end acceleration of 2.7 times.

Figure 18: Performance in the compression phase when 1-bit Adam is used in SQuAD Fine-tuning tasks over 40 Gbps Ethernet (left) and InfiniBand (right).


Visit DeepSpeed and the Github repository for code, tutorials, and documentation for these new technologies! We also integrated some of the technology into the ONNX Runtime.

About our wonderful collaborators:

  • We acknowledge our academic collaborator, Philippe Tillet from Harvard University. He worked with us to develop kernel for sparse attention algorithms using the Triton compiler.
  • Zero-offload was developed with Jie Ren, an intern from UC Merced. We would also like to thank Dong Li from UC Merced and Bharadwaj Pudipeddi and Maral Mesmakhouroshahi L2L Work from Microsoft for their discussion on this topic.
  • The 1-bit Adam was developed by Hanlin Tang, an intern from the University of Rochester.
  • We would also like to thank Nvidia for their strong cooperation, especially the Megatron-LM team.

About the DeepSpeed team

We are a group of researchers and engineers interested in large-scale system performance optimization — Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Reza Yazdani Aminabadi, Elton Zheng, Arash Ashari, Jing Zhao, Minjia Zhang, Niranjan Uma Naresh, Shaden Smith, Ammar Ahmad Awan, Conglong Li, Xiong He (Team Lead). Recently we focus on deep learning system, optimize the training speed, convergence speed and development speed of deep learning system!

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.