| author: Deng Zhe also | kuang MegEngine infrastructure team of engineers

In the field of deep learning in recent years, many research institutions and researchers have improved the performance of the model by increasing the number of model parameters, and achieved remarkable results, which amazed the industry again and again. This objectively makes “expanding the size of the model” almost once become the only indicator pursued by each other. Over the years, the number of parameters in the most advanced models has increased hundreds of times, but the size of video memory per GPU has barely increased. As a result, the training of large models often depends on the huge number of GPU cards. Therefore, many researchers with outstanding ideas and research enthusiasm are unable to continue their deep learning research simply because of lack of funds, and the important scientific research achievements in recent years are almost monopolized by a few head research institutions. In the long run, this trend may not be good for the science of deep learning. As the developer of deep learning training framework, we not only help users to utilize more GPU cards (i.e., distributed training) in a training task, but also adopt various technical means to increase the utilization efficiency of video memory on each GPU and reduce the capital cost of researchers. Common ways to increase video memory efficiency include:

  • Operators with non-overlapping life cycles share video memory;

  • Reduce video memory footprint through additional data transfer;

  • Reduce video memory footprint through additional calculations.

Most of the existing methods require the computational graph to be static. As more and more frameworks support dynamic graph mode, it becomes an important indicator to evaluate the performance of deep learning frameworks to make the best use of the limited video memory resources during dynamic graph training. In the recent v1.4 release, MegEngine introduced DTR[1] technology and further engineering optimizations to provide a way to reduce video memory footprint through extra calculations, allowing small video memory to train large models and enjoy the training benefits of a larger batch size. On 2080Ti, the maximum batch size of resnet-50, ShuffleNet and other networks can be more than three times the original. This article will focus on how to use DTR to optimize dynamic map video memory in MegEngine from an engineering perspective.

I. Background introduction

1.1 calculation chart

In the field of deep learning, neural network models can essentially be represented by a computational graph. Its training process can be divided into three parts: forward propagation, back propagation and parameter update.

Taking y=wx+b as an example, its forward calculation process is that input X and parameter W are first multiplied to get the intermediate result P, and then p and parameter B are added to get the final output Y on the right side.

Back propagation needs to find the derivative of y with respect to w and b. First, the derivative of y with respect to p is 1, and the derivative of p with respect to W is x. Using the chain rule, we can find that the derivative of y with respect to w is x.

Note that the intermediate result of forward propagation is used in the process of back propagation. Therefore, when the network structure is too large, the video memory capacity will significantly restrict the size of batch size.

1.2 Static picture video memory optimization

Therefore, for the training scenario of large network structure, video memory optimization on static graph can be divided into three directions:

  1. Static memory allocation. Now that you have the whole tensor, you can analyze the life cycle of every tensor and every operator. Operators with no overlapping life cycles can share video memory.
  2. Gradient checkpoint (use computing for video memory). Set some gradient checkpoints, and the rest of the intermediate results will be released first. If the forward result is not in video memory in the future during back propagation, find the nearest gradient checkpoint and restore the liberated tensor.

  1. Memory swapping, ie swapping bandwidth for video memory. Swap temporarily unused data from the GPU to the CPU, and swap it back again when needed.

Dynamic graph memory optimization and DTR strategy

2.1 Dynamic graph memory optimization

Compared with static graph, the most obvious change in dynamic graph memory optimization is that dynamic graph cannot obtain the global computational graph information in advance. Because you can’t get the lifetime of each tensor, static video memory allocation is no longer available; Gradient checkpoints are still possible, and the optimal checkpoint can still be found; Memory swapping is still available in dynamic diagrams. Therefore, dynamic graph memory optimization has two directions:

  1. Computing for video memory, Sublinear video memory optimization for dynamic graphics;
  2. Swap bandwidth for video memory, swap content between GPU and CPU.

The tensor from resnet-50, the output tensor of convolution, BatchNorm, and ReLu, compares the time cost of recalculating and exchanging bandwidth for them. It can be found that the switching time is generally about two orders of magnitude larger than the calculation time. The time required to exchange data between CPU and GPU depends on the speed of PCIe, and when eight 2080Ti graphics cards are trained at the same time, the switching speed assigned to each card is only about 3GB/s. Therefore, it can be determined that the main direction of optimization in dynamic graphs is still computing for video memory.

To achieve this goal, MegEngine has three steps.

  1. Implementation infrastructure: Documenting the computational path that generates each tensor, enabling the framework to release and restore the tensor;
  2. The user provides policies. The interface to release the tensor is explicitly called by the user. The framework doesn’t need to provide any policies.
  3. Framework finding Policy: The framework automatically finds the policy and executes it, without user intervention, so that the user is completely unaware of video memory optimization.

Before we can understand how the framework releases and restores the tensor, we need to understand the path that the tensor takes. During online training, each tensor comes from only two things:

  1. Loaded by external data, e.g., input data;
  2. Is the output of an operator, for example, the output of the convolution layer.

For the output of the tensor, we can record the Compute Path for the tensor as follows:

  • Every tensor has a producer, and if the producer is empty it means it was loaded in by external data, otherwise it’s a computational path where:
    • Op is the operator that created this tensor;
    • Inputs inputs for this operator at tensor;
    • Outputs means the output tensor generated by this operator;
    • Compute_time represents the actual running time of this operator;
  • Users stores all the computational paths that depend on that tensor as input;
  • Ref_cnt represents the number of compressors that depend on that tensor as input.

Here’s an example of how you can use computational history to release and restore the tensor:

First define two tensor a and B in MegEngine and calculate C =a+ B. Every gray rectangle in this diagram represents video memory, and let’s say you can only put three tensor in video memory. At this point, there is just enough space to put down C and record its computational path (shown in the yellow box above). Then calculate d=a* B, because there is no space to put D in the video memory, c needs to be released from the video memory first. When releasing C, the computing path of C is still reserved in the host, but the video memory occupied by C can be released, and then there is free space for D (corresponding to the first green box in the figure). If the user wants to print(c), the framework finds that C is not in video memory and needs to restore it immediately. Before recovery, if it is found that the video memory is full, it must first release D, and then restore C according to the calculation path of C and return it to the user (corresponding to the gray box in the figure). If the user continues with print(d), it releases c and restores D (corresponding to the last green box in the figure).

You can see from this example that the user has no sense that the tensor he’s using is in video memory. When the user wants to access a temporarily liberated tensor, the framework will restore it to the user on the spot, and the user will think that the tensor he’s accessing has always been in video memory.

2.2 DTR strategy

To enable the framework to automatically calculate strategies, we introduced DTR in MegEngine V1.4, the technique from the paper Dynamic Tensor Reconstruction, which is completely dynamic heuristic strategies. The core of it is that when video memory goes beyond a threshold, you dynamically select some tensor and release it until it goes below the threshold. You will evaluate your choices based on the three-way tensor:

  1. The smaller the cost of recalculation, the better;
  2. The larger the occupied video memory, the better;
  3. The longer you stay in video memory, the better.

In addition, the DTR paper suggested that, in addition to the cost of recalculation, the extra cost is mainly to find the optimal tensor that should be released. Because the tensor stays in video memory for a constant amount of time, the optimal tensor can only be calculated live when it needs to be released.

In this regard, the paper proposes two runtime optimization techniques:

  1. Forget about the small tensor, when the tensor is less than 1% of the average tensor in the candidate set, you don’t put it in the candidate set;
  2. Every time you need to release your tensor, you take SQRT (N) random samples of your tensor and you go through them.

Engineering implementation in MegEngine

Core of dynamic Diagram — Tensor Interpreter

Before WE get to the DTR implementation, the Tensor Interpreter, which is at the heart of the MegEngine dynamic diagram, translates Python code into these four basic operations, which are explained and performed in sequence:

  • Put: Load external data from host into video memory, get a tensor
  • ApplyOp: Performs an operator that takes the op and the input tensor, and returns the output tensor
  • Del: Delete a tensor, free up the space it takes up in video memory
  • GetValue: to get a tensor value, you need to load data from video memory to host

3.2 Release and Restore the underlying implementation of tensor

Earlier, we mentioned that the user doesn’t know if the tensor he’s visiting is currently in video memory, but the framework ensures that when the user wants to get something from the tensor, even if it’s not in video memory, it can be immediately restored.

So, if the framework wants to free the video memory of the current tensor, then reset its pointer will free the lowest video memory. In order to recover that tensor in the future, you need to maintain some information in tensorInfo, and if you use drop you need to record the calculation history; If you use swap, you need to swap it to the CPU first to record a host tensor. In the future, if the user visits the tensor, the framework will check its corresponding tensorInfo, and if it’s no longer in video memory, it will retrieve the contents of the tensor from video memory based on the calculated history or host tensor and send it back to the user.

3.3 Operator Execution after DTR is introduced

The figure above shows the pseudocode for the DTR core. For the ApplyOp method, it used to just execute the yellow code, indicating that the op operator was performed on the input.

Now because we’ve introduced DTR, that input will probably not be in video memory anymore. So you have to label them first, you don’t want to release the input tensor until this operator is done. Then AutoEvict() is called to control the current video memory usage by checking the current video memory usage, and if it keeps going beyond the threshold then you keep calling FindBestTensor() and then using the heuristic evaluation function to find the optimal tensor to release.

After AutoEvict(), the current video memory usage is below the threshold. Then check that every tensor you input is in video memory. If it is not then call the Regenerate() to restore it. Regenerate(x) is a process of recalculating X inputs. Read the x history — op and Inputs for recalculation. Then call ApplyOp recursively to restore x.

3.4 Tensor’s deletion

When a tensor is no longer used by users and frameworks, the tensor can be removed, freeing up video memory. MegEngine controls the removal of the tensor by reference counting, and when the reference count goes to zero, the tensor automatically sends a deleted statement to the interpreter. The problem with that is that if you remove that tensor, it does save video memory immediately, but it makes your overall strategy very limited.

For example, if you have a tensor of 9 megabytes then you have a tensor of 25 megabytes through a convolution operator, then you have a tensor of 25 Megabytes through the Elemwise operator, Then through the BatchNorm operator and the Elemwise operator, you get a tensor of 25 MEgabytes.

Notice that because the Elemwise operator here is always additive, its inputs (the two red tensor) won’t be used when you take derivatives. So the derivative doesn’t have to hold on to the two red tensor pieces, they’ll actually be released immediately after the forward calculation. The good thing about this is that you can save video storage immediately, but after DTR, if you remove the red tensor, then the green tensor will never be removed, because the red tensor has been lost, Once you release the green tensor you’ll never get it back. The solution is to use free instead of delete in the forward process, which is called “fake delete” — keep tensorInfo, but just free the corresponding video memory below tensorInfo. Then you only need to keep the 9MB tensor and then you can take the next four 25MB tensor and restore them at any time in the future.

This is a pseudo-code implementation of removing the tensor from MegEngine. When the interpreter receives the Del command, it calls the Free() function on tensorInfo to make a true or false delete, depending on whether the current state is calculated forward. The implementation of fake delete is very simple, marked with delete mark, release the video memory managed by tensorInfo. Then we call RemoveDep() to check for all the tensor that depends on the tensor as input. If they’re not in video memory, You must call Regenerate now because once the current tensor is really deleted the tensor will not be restored.

And then once you’ve done that, you can actually release your tensorInfo for that tensor. Then you need to recursively check the history of x and put in the tensor, and if the tensor has ref_CNt =0 and it’s marked delete, then you can do true delete.

3.5 Comparison of training time

The following figure shows the training of MegEngine’s DTR implementation on Resnet-1202 compared to PyTorch’s implementation in the original paper. Note that the graphics cards used in the experiment are different, so MegEngine is statistically a little faster. MegEngine is better for video memory management because you can still run batchsize=100 training on 11G graphics cards. In addition to the maximum batchsize=140 that we tried in the paper, we tried a larger batchsize, which can also be run.

The following is a comparison of training time for enabling different video memory optimizations on the MegEngine framework. The baseline is the result of running without any video memory optimizations in dynamic graph mode. The first is two common models — Resnet-50 and ShuffleNet. It can be found that the limit batch size after DTR optimization is enabled exceeds the static figure Sublinear and baseline. In addition, the time consuming of the batch size is the same as that of Sublinear.

The above two models are static, so we can compare them with Sublinear video memory optimization of the static graph, while the following SPOS network is special, it is a large network with multiple paths from input to output that can be updated. During the training process, each round will be randomly sampled to update a certain path, which leads to the statement executed in each round may be different. For this kind of network, implementation in dynamic graph is more natural. Therefore, only the results of dynamic graph DTR optimization are compared with the Baseline. Regardless of single card or eight cards, the maximum batch size of the dynamic graph is 100. If DTR is turned on, it can be 250 or even larger.

3.6 Fragmentation problems and optimization methods

In the process of implementing DTR, we found that with the increase of batch size, the tensor will occupy more video storage each time. When the tensor increases, the number of tensor stored in video storage will decrease, and the number of re-calculations will increase. The tensor is going to be created and released more and more frequently, which can lead to a very serious fragmentation problem. For example, although there is currently 1 GB of free video memory, it is scattered in many small free blocks. If one 1 GB of video memory cannot be satisfied, the defragmentation operation will be triggered, which has a significant impact on performance.

For this problem, we propose three possible optimizations:

  • Parameter in situ update

There have been very few inplace operations in MegEngine before, and if the model itself is very large, every time you update the parameters you have moved a huge tensor position, which may result in more fragments. The solution is to turn on the INPLACE_UPDATE environment variables and update these parameters in place to reduce some of the fragmentation.

  • Improved valuation function

We’ve made a small improvement to the heuristic valuation function of DTR by introducing some fragmentation-dependent information. We want the tensor that we’ve removed to be as big as possible not only the video memory that it takes up, but also the sum of the free video memory blocks at both ends of the video memory.


f ( t ) = Computing time Alpha. The stay time Beta. ( Video memory + free segment size ) gamma Delta t. Recalculation times F (t) = \ frac {\ text calculation time-consuming} {^ \ alpha} {\ text stay {time} ^ \ beta (\ text} {+ free memory segment size) ^ \ gamma} \ delta ^ {\ text {heavy computation times}}

In addition, we also introduce the penalty coefficient of recalculation times, hoping that the recalculation times of each operator should be as uniform as possible. And for the four attributes in the function, some hyperparameters are added, so that we can change these hyperparameters to make the heuristic strategy focus on different attributes.

  • Static planning strategy

A more effective method is to treat the network with the same execution sequence in each round as a static graph. When the network is static, static video memory allocation can be applied, which not only completely avoids defragmentation, but also significantly reduces the peak value of video memory and tries to increase the batch size.

For example, on resnet-50, when batchsize=400, the peak value of dynamically allocated video memory is 9595MB, and the peak value of statically allocated video memory is 8549MB, which is about 10% lower. Once the video memory is statically allocated, fragmentation problems no longer occur.

Iv. Future work direction

If you want to fully exploit the capability of recalculation — training larger models and batch sizes as large as possible — one possible approach is to focus on model statics. Because although the dynamic graph is very easy to write, when the batch size is large, it is greatly affected by the fragmentation problem. On static graphs, you can enjoy all the benefits of static graph optimization, such as static video memory allocation, graph optimization techniques, and so on.

More broadly, we hope to abstract a set of video memory optimization strategies that are applicable to both static and dynamic graphs, as shown in the following figure:

Either a Sublinear optimization for a static graph or a DTR optimization for a dynamic graph can be thought of as a Seq2Seq transformation of the execution sequence by adding some drop and recompute statements to the sequence. The difference is that the static graph calculates the optimal release and recalculation sequence after obtaining the whole running sequence. A dynamic diagram inserts drop and Recompute statements on the spot while interpreting the execution sequence. There are two ways to execute the sequence: dynamic Graph Imperative Runtime interpretation execution and static Graph Computing Graph compilation execution. Profile records runtime information about each operator, how long each tensor stays in video memory, and so on. Then the user can adjust the sequence of calculations according to the result of Profile. This allows users to customize policies for different models without understanding the underlying execution logic or modifying the framework’s source code.

References:

[1] Kirisame M, Lyubomirsky S, Haan A, et al. Dynamic tensor Rematerialization [J]. ArXiv Preprint arXiv:2006.09616, 2020.