Abstract: The deep learning compiler can be used as a public component and bridge between the framework and the hardware. The ultimate goal we hope to achieve is that we only need to develop it once, and we can automatically generate the best code for any device.
This article is shared from the HUAWEI cloud community " Deep Learning Compiler Introduction ", the original author: luchangli.
In the past ten years, deep learning has developed rapidly, and many deep learning algorithm development frameworks have emerged in the industry. At the same time, because deep learning has a wide range of application scenarios and huge demand for computing power, we need to run deep learning algorithms on various general and special-purpose hardware, such as various types of CPUs, GPUs, TPUs, NPUs, etc. Then there is an explosion of combinations between the framework and the hardware, as shown in Figure 1. For example, if TensorFlow wants to support GPU computing, it is necessary to develop a GPU version of all the operators in tensorflow. If you want to support the D chip, you need to develop a D chip version for each operator. This process is undoubtedly very time-consuming and labor-intensive.
At the same time, we now have a lot of algorithm networks, such as YOLO, BERT, GPT, etc. These algorithm networks are composed of operators of different types, different shapes, and different connection relationships. Eventually they run on different types and models of hardware. This leads to a high cost of manually developing and implementing the optimal operator for each scenario. Here are two examples. As shown in Figure 2, operator fusion is a common performance optimization method. Before fusion, each operator needs to read data from the memory to the cache before and after the calculation, and then write back from the cache. RAM. After fusion, memory read and write between operators can be avoided to improve performance. The traditional method is to manually develop fusion operators based on operator connection relations, but it is almost impossible to enumerate the connection relations of different types of operators in different networks. Another example is operator tuning. There are many parameters in the operator implementation process that will affect performance, but traditional manual operator development methods are difficult to express and maintain these parameters, and these parameters are tuned to achieve different shapes and hardware Optimal performance.
The deep learning compiler was born to solve the above series of problems. It can be used as a public component and bridge between the framework and the hardware. The ultimate goal is that we only need to develop it once, and it can be automatically generated for any device. Optimal code. For example, the operator developed for the CPU can be used for the GPU and the D chip almost intact, thereby significantly reducing the cost.
Here is a brief introduction to the components and functions of the deep learning compiler, as shown in Figure 3. First of all, its front end gets the calculation graph from different frameworks, and uses this High level IR data structure to represent it, and then performs a series of graph optimizations at this stage, such as constant folding, operator fusion, equivalent replacement, etc. Here is an example of equivalent replacement. The original calculation graph is like this. We change the calculation method to it. The result remains the same, but the performance may be better. Then, for each operator in the calculation graph, a domain-specific language of DSL is used to describe the calculation process of the operator and optimize the operator. For example, tiling, multi-core, double-buffer, etc. are optimized for operators. Because the calculation process of the operator is usually realized by multiple cycles, for example, matrix multiplication is a triple cycle. The deep learning compiler can easily perform various transformations on the loop, and tune the parameters of these transformations, so as to obtain the best operator implementations of different shapes and hardware. Finally, specific codes are generated for different hardware based on low level IR.
Finally, we will introduce the existing compiler projects in the industry. At present, TVM is the most comprehensive, open source, and framework-independent project, which has been adopted by many companies. The TVM process is shown in Figure 3a. TVM can import models of various frameworks, such as TensorFlow pb, onnx, TorchScript and other models, and are uniformly represented by High level IR called Relay by TVM. Each operator in IR uses the DSL of Tensor expression to perform calculation description and scheduling. This DSL uses Einstein's notation to describe the operator's compute, and the operator compute is generally embodied in multiple for loops. Then, based on the Halide idea, use schedule to perform various transformations on this multiple for loop, such as loop merging, split, order transformation and so on. Finally, lower to low-level IR generates specific device code and inferences.
Here is a brief introduction on how TVM generates the optimal operator code. The above introduces that the operator needs to be described by the compute, and then the multiple for loops corresponding to the compute need to be scheduled and transformed, that is, schedule. The operator generation and tuning of TVM has experienced three generations of development. The first generation of TVM/AutoTVM, this generation requires users to write the operator's compute and operator's schedule. The difference between AutoTVM and TVM is that you can define some variable parameters in the schedule, and then use, for example, genetic algorithms for parameter tuning. For example, if a loop is divided into 2 segments, then where to split can be optimized. The second generation of AutoScheduler (Ansor), this generation only requires the user to develop the operator ompute, and Ansor automatically performs scheduling transformations according to some rules. Since scheduling development needs to be familiar with the expression mechanism of TVM and the underlying hardware principles at the same time, schedule development is often very difficult, so Ansor can significantly reduce the workload and development difficulty of developers. The disadvantage is that Ansor tuning takes a long time, often taking 1 hour. Only one operator can be tuned. Taking convolutional networks as an example, Ansor can exceed the performance of TensorFlow operators in some scenarios, and there is a certain gap between the implementation of TensorRT. The third-generation Meta Schedule (AutoTensorIR) is only in its infancy. It is expected that the tuning speed and performance will be optimized. It is not yet available, and we will wait and see.
The implementation of TVM includes Huawei's D chip TBE operator development tool, which adds D chip code generation support on the basis of TVM. TVM uses the Halide calculation + scheduling route, and there is another compiler that uses the polyhedral algorithm route, such as Tensor Comprehensions, Tiramisu, and Huawei's self-developed AKG. This method, like Ansor, only requires the user to develop the operator compute, without the need to develop the schedule, so it is also more user-friendly. Among them, AKG has been used in MindSpore's graph fusion. Other deep learning compilers include TensorFlow's XLA, TensorRT, etc. You may have used them.
In short, deep learning compilers have many advantages. For example, it is easy to support new hardware, avoid repeated development, and adopt a series of automatic optimization instead of manual optimization, which can achieve the ultimate cost performance. The current deep learning compiler also has some shortcomings, and it is still in a state of rapid development. For example, the tuning time is long, complex operators cannot be effectively generated, and the proportion of operators generated by the deep learning compiler in a model that can exceed library calls is relatively low. It still requires continuous investment and optimization.