Introduction: How was the technical framework behind EPL designed? How can developers use EPL? What are EPL’s future plans? Let’s take a closer look today.

The author wang Lin, sayin ocean source | | ali technology to the public

A takeaway

Recently, Ali Cloud machine learning PAI platform and Dharma Institute intelligent Computing Laboratory released a “low-carbon version” of the giant model M6-10T, model parameters have jumped from trillion to 10 trillion, the scale is far beyond the industry previously released trillion model, become the world’s largest AI pre-training model. At the same time, it achieved the industry’s ultimate low carbon and high efficiency, using 512 GPU to train 10 trillion models with usable level within 10 days. Compared with the previously released large model GPT-3, M6 achieves the same parameter scale with only 1% energy consumption.

The M6 model training uses EPL(Easy Parallel Library, originally named WHALE), a distributed training framework developed by Aliyun Machine learning PAI platform. EPL provides an easy-to-use and efficient distributed training framework by uniformly abstracting and encapsulating different parallel strategies, supporting multiple parallel strategies in a distributed training framework, and comprehensively optimizing video memory, computing and communication.

How was the technical framework behind EPL designed? How can developers use EPL? What are EPL’s future plans? Let’s take a closer look today.

What is EPL

EPL(Easy Parallel Library) is a self-developed distributed deep learning training framework that integrates multiple Parallel strategies and is flexible and Easy to use.

1 Project Background

In recent years, with the popularity of deep learning, the scale of model parameters has also grown rapidly. OpenAI data shows that:

  • Before 2012, the calculation time of the model doubled every two years, consistent with Moore’s Law;
  • After 2012, the computing time of the model doubled every 3.4 months, far exceeding the speed of hardware development.

In the past year, parametric models of billions of yuan and billions of yuan have been released successively. Google, Nvidia, Ali and Zhiyuan Research Institute have even released a trillion parameter model. With the increase of model parameters, the model effect is gradually improved, but it also brings more challenges to the training framework. Currently, some distributed training frameworks such as Horovod, Tensorflow Estimator and PyTorch DDP support data parallel, while Gpipe, PipeDream and PipeMare support stream parallel. Mesh Tensorflow, FlexFlow, OneFlow, MindSpore and other operators support splitting, but there are still some challenges when training a super-large model:

  • How to be simple and easy to use:

  • High access threshold: it is difficult and costly for users to implement the distributed version of the model, and domain expert experience is required to realize efficient distributed parallel strategy;

  • Difficulty in optimal strategy: as researchers design more and more flexible models and more and more parallel acceleration methods, it is difficult for users to find the most suitable parallel strategy without the support of automatic parallel strategy exploration.

  • High migration cost: Different models are suitable for different hybrid parallel strategies, but different frameworks may need to be switched when the parallel strategies are switched, resulting in high migration cost.

  • How to improve the cost performance:

  • Resources needed to train trillion scale models in the industry: Nvidia 3072 A100, Google 2048 TPU V3, resource costs are very high;

  • How to reduce cost and increase efficiency, combine various techniques and methods to reduce the need for resources and improve the speed of training;

In order to cope with the current challenges of distributed training, ali Cloud Machine learning PAI team independently developed the distributed training framework EPL, which abstractions and encapsulates different parallelization strategies in a unified manner and supports multiple parallel strategies in a set of distributed training framework. At the same time, EPL provides a simple and easy-to-use interface. Users only need to add a few lines of annotations to complete the configuration of parallel strategy without changing the model code. EPL can also create an efficient distributed training framework through all-round optimization of video memory, computing and communication without user feeling.

2 Main Features

  • Multiple parallel strategies are unified: multiple parallel strategies (data/flow/operator/expert parallelism) and their combinations are nested in a distributed training framework.

  • The interface is flexible and easy to use: users need only add a few lines of code to use EPL’s rich distributed parallel strategy without modifying the model code;

  • Automatic parallel strategy exploration: automatic split strategy is explored when operator split, and model split strategy is explored when flow parallel.

  • Better distributed performance: multi-dimensional video memory optimization, computing optimization, combined with the model structure and network topology scheduling and communication optimization, to provide efficient distributed training.

3 Open source address see the end of the article

Main technical features of EPL

  • EPL enables every algorithm engineer to easily train distributed large-scale model tasks through rich parallelization strategies, easy-to-use interfaces, multi-dimensional video memory optimization technology and optimized computational communication acceleration technology.

Rich parallelization strategies: EPL provides a variety of parallelization strategies and their combination strategies, including data parallelism, pipelining parallelism, operator split parallelism and combinatorial nesting of parallel strategies. Rich choice of strategy makes different model structure can find the most suitable distributed training mode.

  • Ease of use: The user’s model programming interface and training interface are based on TensorFlow. The user only needs to make simple marks on the existing single-card model to realize different distributed strategies. EPL designed two simple policy interfaces (replicate/split) to express distributed strategy and hybrid parallelism. The way of marking distributed policy enables users to implement and transform distributed policy with only a few lines of code without learning new model programming interface, which greatly reduces the threshold of using distributed framework.
  • Video memory optimization: EPL provides multi-dimensional video memory optimization technology, including Gradient Checkpoint technology, ZeRO data parallel video memory optimization technology, CPU Offload technology, etc., to help users train larger models with less resources.
  • Communication optimization technology: EPL deeply optimized the distributed communication library, including hardware topology awareness, communication thread pool, gradient grouping fusion, mixed precision communication, gradient compression and other technologies.

1 Technical Architecture

EPL framework is shown in the figure below, which is mainly divided into the following modules:

  • Interface layer: The user’s model programming interface is based on TensorFlow, and EPL provides an easy-to-use interface to express the parallelization strategy, allowing users to combine various hybrid parallel strategies.

  • Intermediate representation layer: Translates user models and parallel policies into internal representations, using TaskGraph, VirtualDevices, and policy abstraction to express various parallel policies;

  • Parallelization engine layer: Based on intermediate expression, EPL can make strategy exploration for computing graph, optimize video memory/computing/communication, and automatically generate distributed computing graph.

  • Runtime execution engine: transform the distributed execution graph into TFGraph, and then call TF Runtime for execution;

2. Expression of parallel strategy

EPL divides the model into multiple Taskgraphs by strategy annotation and parallelizes them on this basis. EPL has two types of strategies: replicate and split. Through these two parallelization interfaces, different parallelization strategies can be expressed, for example:

1. Data parallelism: The following example is an example of data parallelism where each copy of the model is evaluated using a single card. If the user applies for 8 cards, it is a data parallel task with 8 degree of parallelism.

In the following example, the model is split into two taskgraphs, “STAGe0” and “stage1”. The user can configure the pipeline. Num_micro_batch parameter to set the pipeline micro Batch number. In this example, “stage_0” and “stage_1” form a model copy that requires 2 GPU cards. If the user applies for eight cards, EPL will automatically nest a layer of data parallelism of degree 4 outside the pipeline (four pipeline copies are executed in parallel).

3. Operator split and parallel: In the following example, EPL will split the model definition under split scope and place it on different GPU cards for parallel computation.

4. Meanwhile, EPL supports the combination and nesting of the above parallel policies to form a variety of hybrid parallel policies. For more examples, please refer to the documentation and examples in the open source code.

3 Video memory optimization

As models grow, GPU memory often becomes a bottleneck in training large models. EPL provides multi-dimensional video memory optimization technology, which greatly optimizes training video memory digestion.

  • Recomputation (Gradient Checkpoint) : Normal DNN forward processes generate activation that is used for Gradient computation in backward processes. Therefore, forward activation remains in video memory until a gradient is generated. Activation size depends on model structure and batch size and is usually very high. Gradient Checkpoint (GC) changes time for space by reserving some activation during forward propagation and recalculating the released activation during backpropagation. An important part of GC is how to select the appropriate checkpoint, which can save video memory and ensure performance without affecting convergence. EPL provides automatic GC function. Users can enable GC optimization with one click.

  • ZeRO: In parallel data scenarios, each card stores a copy of the model, optimizer state, and so on. This information is the same on each card, and there is a lot of redundancy. As the model gets larger, it’s easy to exceed the video memory limit of a single card. In a distributed scenario, the optimizer State and gradient fragments can be stored on different cards through an idea similar to DeepSpeed ZeRO, so as to reduce the consumption of persistent memory for a single card.

  • Auto Mixed Precision (AMP) : In conventional AMP, it is necessary to maintain a weight buffer of FP16, which is not a small overhead for a model with a large number of parameters. EPL provides a video memory optimized AMP version, FP16 only cast when in use, thus saving video memory.

  • Offload: Offload extends the storage space for training from video memory to memory or even disk, allowing large models to be trained with limited resources.

At the same time, EPL supports the combination of various video memory optimization technologies to achieve the ultimate optimization of video memory. Ali Cloud machine learning PAI team turned on THE AMP technology of GC+ZeRO+ video memory optimization on the T5 model, and the video memory was reduced 2.6 times under the condition that the performance remained unchanged.

4 Application Scenarios

EPL is suitable for models of different scenarios and already supports image, recommendation, voice, video, natural language, multi-mode and other business scenarios within Alibaba. At the same time, EPL also supports models of different sizes and has completed the M6 model training of a maximum scale of 10 trillion. The following examples are taken as M6 and Bert models.

1 trillion /10 trillion M6 model pre-training

Training a trillion /10 trillion parameter model requires a lot of computational power. In order to reduce the computing power requirement, the Mixture (MoE) structure is implemented in EPL. The main characteristic of MoE is sparse activation, and the Gating(Router) is used to select top-K expert for input calculation (k is usually 1 and 2), thus greatly reducing the computing power requirement.

EPL supports Expert Parallelism (EP), which splits the Experts into multiple devices, reducing the video memory and computing power requirements of a single device. At the same time, data parallelism is beneficial to improve the concurrency of training. Therefore, the mixed parallel strategy of data parallelism + expert parallelism is adopted to train M6 model: Expert parallelism is adopted in MoE layer, and data parallelism is adopted in other layers.

EPL provides a simple and easy to use interface for mixed parallel training of models. Only a few lines of annotations are needed to configure the parallel strategy without any modification of the model itself. For example, THE M6 model adopts the strategy of data parallelism + expert parallelism and only needs to add annotations as shown in the figure below:

At the same time, in order to save training resources and improve training efficiency, we adopted EPL video memory optimization technology and computing communication acceleration technology, including automatic Gradient Checkpointing to save activation video memory occupation, CPU Offload technology is used to optimize the display memory usage of the Optimizer States and weight, DP+EP hybrid parallel strategy is used to reduce computing power requirements, and hybrid precision and compiler optimization technologies are combined to improve training efficiency.

With the help of EPL framework, the pre-training of trillion-M6 model was completed on 480 V100 in 3 days for the first time. Compared with previous models of the same scale trained in the industry, this time only 480 V100 32G Gpus were used to successfully train the trillion model M6, saving over 80% of computing resources and improving the training efficiency by nearly 11 times. Further use of 512 Gpus in 10 days was able to train 10 trillion models with usable levels.

2. Flow parallel acceleration Bert Large model training

For Bert Large model, the structure diagram is as follows:

As the Bert Large model consumes a lot of memory, the batch size on Nvidia V100 16G graphics card is usually only about 2-8 (the specific value depends on the Embedding size, Sequence Length, etc.). Too small Batch size will lead to large convergence fluctuation and poor convergence effect of the algorithm. At the same time, through the data parallel mode training, the communication ratio is high, and the distributed acceleration effect is not ideal.

Analysis of Bert Large model, which consists of encoder with 24 layers of repeating structure, can be accelerated by using flow parallelism. Here, we train Encoder Layer 1 8, Encoder Layer 9 16, and Encoder Layer 17~24 in Bert Large on different cards respectively. The computation figure after parallelization is shown as follows:

In this way, the video memory overhead of each card during training will be reduced, so the batch size can be increased to improve convergence acceleration. In addition, for the scenario where the model is too large and the video memory of a single card cannot be put down, the distributed training is carried out through the model parallel mode of Layer splitting. The MODEL can be divided into stages through the EPl. Replicate interface, and the parallelization performance can be improved through pipelined parallel execution scheduling, as shown in the figure below:

The above example is a flow micro Batch MUMber of 5. It can be seen from the time axis after pipelined parallel optimization that multiple cards can be computed in parallel at the same time. When the 5 micro Batches end, each card will perform local accumulative gradient before update. Compared with pure model parallelism, the GPU utilization is improved by alternate execution of streams. EPL also adopts backward-preferred scheduling optimization strategy to improve pipeline-parallel performance and reduce GPU idle time and video memory overhead.

For higher level scaling, EPL also supports nesting data parallelism outside of pipelining to improve training throughput. EPL automatically deduces the parallelism of nested data parallelism. The latest test results show that on a 32-card GPU scale, EPL’s flow + data parallelism is used to optimize the Bert Large model, and the training speed is improved by 66% compared with data parallelism.

Five Roadmap

We decided to build an open source ecosystem based on the following considerations:

  • EPL originated in ali cloud internal business needs, to support the business scenario, large-scale, diversity in the process of service internal business also accumulated a lot of experience, in the EPL itself as business needs iteration gradually perfect at the same time, we also hope to be able to open to the community, its accumulated experience and understanding the feedback to the community, We hope to have more and better exchanges and cooperation with developers of deep learning training frameworks or practitioners of deep learning, so as to contribute our technical strength to the industry.

  • With the help of open source work, we hope to receive more user feedback in real business scenarios to help us continue to improve and iterate, and provide input for the direction of subsequent work input.

  • At the same time, we hope to attract some like-minded students, companies or organizations to participate in the construction of open source work, and continue to improve the deep learning ecosystem.

Going forward, we plan to Release releases every two months. EPL’s near-term Roadmap is as follows:

  • Continuous performance optimization and stability improvement;
  • General operator split function;
  • Basic version of automatic split strategy exploration;
  • Automatic flow parallel strategy exploration;

In addition, in the medium and long term, we will continue to invest in several exploratory directions such as integrated software and hardware optimization and automatic strategy exploration. We also welcome feedback and improvement suggestions from various dimensions as well as technical discussions. Meanwhile, we welcome and look forward to the participation of colleagues who are interested in the construction of open source community.

  • Fully automatic model parallel strategy exploration;

  • Efficient strategy exploration algorithm and accurate CostModel evaluation;

  • Parallel strategy exploration under eager Model;

  • More new hardware support, adaptation and collaborative optimization;

  • Efficient operator optimization and integration, the ultimate video memory optimization, soft and hard communication optimization;

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.