Abstract:A deep learning compiler can act as a common component and bridge between the framework and the hardware, with the ultimate goal of being able to develop it once and automatically generate optimal code for any device.
This article was shared from Huawei Cloud Community “Introduction to Deep Learning Compiler” by Luchangli.
In the last decade or so, deep learning has developed rapidly, and many deep learning algorithm development frameworks have emerged in the industry. At the same time, because deep learning has a wide range of application scenarios and a huge demand on computing power, we need to run deep learning algorithms on various general and dedicated hardware, such as various types of CPU, GPU, TPU, NPU, etc. This leads to an explosion of combinations between frameworks and hardware, as shown in Figure 1. For example, to support GPU computing, TensorFlow needs to develop a GPU version of all operators in TensorFlow. If it wants to support D chips, it needs to develop a D chip version of each operator. This process is no doubt very time-consuming and labor-intensive.
At the same time, we now have so many algorithmic networks, YOLO, Bert, GPT, and so on. And these algorithm networks are composed of operators of different types, different shapes and different connection relations. Eventually they run on different kinds and models of hardware. This makes it expensive to develop and implement the optimal operator manually for each scenario. Two examples are given here, as shown in Figure 2. Operator fusion is a common performance optimization method. Before fusion, data needs to be read from memory to cache and then written from cache to memory before and after each operator calculation. After merging, memory reads and writes between operators can be avoided, thus improving performance. The traditional method is to manually develop the fusion operator according to the operator connection relation, but it is almost impossible to completely enumerate the operator connection relation of different types in different networks. Another example is operator tuning. The operator implementation process has many parameters that affect the performance, but the traditional manual operator development method is difficult to express and maintain these parameters, and to tune these parameters to achieve the optimal performance of different shapes and hardware.
Deep learning compilers are designed to solve these problems by serving as a common component and bridge between the framework and the hardware, with the ultimate goal of being able to automatically generate optimal code for any device with only one development. For example, operators developed for CPUs can be used almost identively for GPUs and D-chips, thus significantly reducing costs.
Here’s a brief overview of the components and capabilities of the deep learning compiler, as shown in Figure 3. First of all, its front end is to get the calculation graph from different frameworks and use the data structure of the High level IR to represent it. Then, in this stage, a series of graph optimization is carried out, such as constant folding, operator fusion, equivalent replacement, etc. So here’s an example of an equivalent substitution, where the original graph looked like this, and we computed it in a different way, and it got the same result, but it might have better performance. Then, for each operator in the calculation diagram, DSL, a domain specific language, is used to describe the calculation process of the operator and optimize the operator. For example, tiling, multicore, double-buffer and other operators are optimized. Because the calculation process of operators is usually implemented by multiple cycles, for example, matrix multiplication is a triple cycle. The deep learning compiler can easily transform the loop and tune the parameters of these transformations, so as to obtain the optimal operator implementation of different shapes and hardware. Finally, specific code is generated for different hardware based on low level IR.
Finally, I’ll take a look at some of the industry’s existing compiler projects. Currently, the most ecological, open source, framework independent project is the first TVM, and has been adopted by many companies. TVM process is shown in Figure 3A. TVM can import models of various frameworks, such as TensorFlow PB, OnNX, Torchscript, etc., which are uniformly represented by High level IR of TVM called Relay. DSL of Tensor Expression is used for calculation description and scheduling of each operator in IR. This DSL uses Einstein’s Notation to describe the operator COMPUTE, which is typically represented as multiple for loops. Then, based on the idea of Halide, schedule is used to perform various transformations of this multiple for loop, such as loop merge, split, sequence transformation, and so on. Finally, the Lower to Low-Level IR generates the specific Device code and makes the inference.
Here is a brief introduction to the specific TVM how to generate the optimal operator code. The operator needs to be described by COMPUTE, and then the corresponding multiple FOR loops need to be scheduled, that is, schedule. Operator generation and tuning of TVM has undergone three generations of development. The first generation of TVM/AutoTVM, this generation requires users to write Compute of operators and schedule of operators. The difference between AutoTVM and TVM is that you can define some variable parameters in schedule, and then use genetic algorithm for parameter tuning. For example, if you split a loop into two segments, where you can do the splitting is optimized. The second generation Autoscheduler (ANSOR), this generation only needs the user development operator Ompute, ANSOR internal automatic scheduling according to some rules. Due to the need to be familiar with the expression mechanism of TVM and the underlying hardware principle at the same time, schedule development is often very difficult. Therefore, ANSOR can significantly reduce the workload and development difficulty of developers. The disadvantage is that it takes a long time to tune an operator, usually 1 hour to tune an operator. Taking the convolutional network as an example, ANSOR can exceed TensorFlow operator performance in some scenarios, which falls short of TensorRT implementation. The third generation of Meta Schedule (Autotensorir) is in its infancy and is expected to be optimized for tuning speed and performance. It is not available yet.
The landing of TVM includes Huawei D chip TBE operator development tools, on the basis of TVM added D chip code generation support. TVM adopted the Halide calculation + scheduling route, and there was another compiler adopting the Polyhedral algorithm route, such as Tensor Comprehensions, Tiramisu, AKG developed by Huawei, etc. This approach, like ANSOR, requires only Compute, the user development operator, and requires no development schedule, so it is also user-friendly. AKG has been used in Mindspore’s graphics fusion. Other deep learning compilers are TensorFlow’s XLA, TensorRT, etc., which you may have used.
In summary, deep learning compilers have many advantages. For example, it is easy to support new hardware, avoid repeated development, adopt a series of automatic optimization instead of manual optimization, and achieve the ultimate cost performance. At present, deep learning compilers also have some shortcomings and are still in a state of rapid development. For example, the tuning time is long and complex operators cannot be effectively generated. In a model, the ratio of operators generated by the deep learning compiler to those called by the library is low, which still requires continuous investment and optimization.
Click on the attention, the first time to understand Huawei cloud fresh technology ~