Fast Run: Amazing feature to improve the reasoning performance of Megengine models

Author: cond., s | kuang apparent MegEngine architects

The background,

For deep learning frameworks, network training/reasoning time is highly valued by users. In actual production conditions, the NN network designed by users is very different, even if it is the same kind of mathematical calculation, the parameters are also different. Without targeted optimization, the framework loses its competitiveness completely. Therefore, in a kind of mathematical calculation, developers will develop a variety of efficient algorithms, which are applicable to different parameters, to ensure the performance of the network. Then the developers had to solve a new problem, how to get the fastest algorithm to perform the calculation once the parameters were determined.

Most frameworks rely on a priori empirical selection algorithm, and Megengine has also concluded that it has excellent priori empirical values to achieve automatic selection algorithm during calculation. But experience does not guarantee that the fastest algorithm is chosen. In many real-world scenarios, users expect the ultimate performance of the network. To this end, Megengine has designed a special process that automatically selects the fastest algorithm for each computation, thus ensuring the shortest uptime for the entire network. At the same time, the calculated parameters and their corresponding algorithm information and device information can be recorded into memory or files. When users run the network again, they can directly obtain the algorithm with the best performance. This performance improvement process is called Fast Run, and it allows Megengine users to get the best performance when running on different networks.

Two, Fast Run brief

At present, almost all mainstream frameworks use the concept of Operator to abstract mathematical calculation, such as convolution Operator, matrix multiplication Operator, etc. Megengine also uses the concept of operators. In addition, at the bottom level, we developed a computing library called Meggdnn to perform the actual mathematical calculations. Megdnn only provides mathematical computing power. Meggdnn’s top layer is also organized according to the concept of operators. Meggdnn operators are encapsulated for different backends. A Meggdnn operator may have multiple algorithms of this operator. Megengine abstracts the Algorithm into Algorithm, and an Algorithm object can complete the calculation of this operator.

Take the convolution operator as an example. On ARM, Megengine implements the very general IM2COL algorithm, Winograd algorithm with excellent performance under specific conditions, and Direct and Direct convolution algorithm with high performance in the case of small size convolution. On CUDA, there are methods to call CUDNN library functions, and so on. The relation from Megengine operator to Megdnn operator to algorithm is shown in the figure below:

A Megengine operator may hold one or more Megdnn operators to perform a calculation, and a Megdnn operator needs to select one of several algorithm objects to perform a calculation. For maximum computational performance, it is necessary to select the fastest algorithm for the Megdnn operator before starting the network computation.

The idea of Fast Run is very straightforward. Before the network calculation, all the feasible algorithms in each MegDNN operator are profiled once, and the performance data are recorded, and the fastest algorithm is set to the MegDNN operator. The premise of Fast Run is that the running time of the algorithm is stable, so it is meaningful to compare the Profiling data of each algorithm.

Finally, determine the time point of Fast Run execution. Megengine has a unified memory management. Each Megengine operator needs to apply for sufficient computation time memory from the memory planning unit before the computation starts. This memory includes the memory needed by its internal Meggdnn operator for calculation, and the memory needed by Meggdnn operator for calculation is completely determined by the algorithm. This requires that Megdnn has already identified the algorithm to be used at this point. Naturally, MegEngine chooses to execute the Fast Run process before invoking the interface. In this way, when the Fast Run process is completed, each MegdnN operator is set with the best performance algorithm.

The cost of Fast Run execution is obvious in that it significantly increases the time taken for the first network execution. The process of Fast Run is as follows:

Fast Run can be used in the following two ways. The difference is the Cache file written in the figure above:

Offline Fast Run is divided into two steps and completed in different processes. The first step is to perform the entire network computation. In this process, Fast Run will write the performance data of each algorithm to a special data structure. Finally, the data will be unified into a Cache file, and then the process exits, which is called “search”. Second, load the same network and read the Cache file through Megengine’s interface. As you can see, offline Fast runs can even take place on different devices. Online Fast Run, online Fast Run completed in the same process. The process of the first half is the same as that of the offline Fast Run. After the Fast Run, the performance data of each algorithm is stored in a data structure in memory. At this point, the process does not exit. Later, different input data can be loaded into the network. At this time, the algorithm with the best performance has been set in each MegDNN operator. Also, you can initialize another network or read the algorithm from the current data structure just as you do in the second half of an offline Fast Run. In general, Fast Run provides the function of searching and recording. Its function is to select the algorithm with the best performance under the current parameters for each MegdnN operator in the network. Since Fast Run performs the same operation for each MegDNN operator, it can be used for both forward reasoning and back propagation. Currently, Megengine supports Fast Run on three backends, CUDA, CPU, and ROCM, which are widely used by Megengine users during training and deployment.

Three, Fast Run principle

In Fast Run, Profiling a MegdnN operator and setting the algorithm will go through 4 steps. The process is shown as follows:

Some details need to be noted in this process:

1. Recursive reference search: operator nesting is commonly found in Megdnn. For example, in Convolution operator, the IM2COL algorithm uses Megnn’s Matmul operator to perform matrix multiplication. Therefore, the performance of Convolution is directly affected by the performance of Matmul. It can be seen that before Profiling a Convolution operator, the performance data required for the Matmul operator to perform is known. In order to solve this problem, Fast Run uses a recursive way to solve the problem of operator nesting when searching for parameters. As shown in the dotted box in the figure above, a Meggdnn operator, after obtaining all available algorithms, will call the interface of each algorithm to ask whether the algorithm depends on suboperators and save relevant results. If the final relevant results are not empty, the suboperators will be profiled first. After that, When Profiling a top-level operator, the suboperator used will have the best algorithm stored in the Cache.

2, Fast Run performance data storage: Fast Run performance data access cannot be separated from the Cache. Megengine offers two persistentcaches that differ from where the data is stored (in memory or file). The structure of the Cache looks like this:

In MegEngine, the PersistentCache object is singleton, and both caches are thread-safe. The Cache maintains a collection of mappings from Category information to a collection, where Category is a backend record information. Category is a string, which is obtained by combining backend information and operator type. Backend information is distinguished by devices. For example, the backend information of CUDA is composed of device name, NVIDIA driver version and CUDA runtime library version information. When the CPU is the back end, only the device name is logged. Only three types of Megengine, CUDA, CPU and ROCM, have corresponding categoty generation, which is also the reason why Megengine only supports Fast Run in the three backends of CUDA, CPU and ROCM at present. Operator type consists of operator name and Cache version information.

A Category maps to a collection that maintains a mapping of the information of a single Megdnn operator to the Profiling results of all its available algorithms. The key value of this set consists of the dimensions of all input Tensor of the Megdnn operator and all parameters of the operator (which can completely determine whether an algorithm is usable or not). The value value is an array that holds the time, extra space required, and so on for each profiled algorithm, sorted. Sorting is done in ascending order of elapsed time, and each algorithm in the sequence must use less memory than the one before it-so that there is no algorithm in the sequence that is both slower and uses more memory than the other. Fast Run results can exist in different backends in a Cache, as long as they have different categories.

On some common models, Fast Run is turned off and turned on while reasoning, and the performance is as follows:

From the use of Fast Run in the project landing, it can significantly reduce the network running time in most scenarios.

B) Fast Run B) Fast Run

Megengine has a large number of configurable parameters, many of which are engineering solutions that have undergone a lot of practice in industry. Some of these parameters are closely related to the use of Fast Run, and their use is detailed here.

4.1 Start Fast Run

To use Fast Run at source code level, you can refer to the executable load_and_run that comes with Megengine. If you only focus on testing the model with load_and_run, you need to use the following two parameters:

— Full-run /– Fast run, the user is required to select one of the two modes for search. The difference between the two modes is that the size of available algorithm set of MeggdN operator generated during Profiling is different. — Full-run will Profiling all available algorithms in the Meggdn operator, including the simplest algorithm (Meggdn operator has at least one algorithm to ensure that it can be used under any parameters and runs slowly). –fast-run rules out naive algorithms. If you want to reduce the time cost of Profiling, you can choose to use the –fast-run mode. At this time, it should be noted that if there is an operator with too special parameters in the network, the operator may face the situation that no algorithm is available (the optimized algorithm is unavailable and the naive algorithm is excluded). Megengine will say “no algorithm available” and exit.
–fast-run-algo-policy specifies the path to the Cache file where performance data will be read into memory and held by a persistentCache object that is globally unique. All performance data in the PersistentCache is written to this file before the process exits.

Both parameters can be used separately or together:

Using –full-run/–fast-run alone, the Profiling data is kept in memory.
When used together, the performance data from the file is first read into memory. If the file is empty, the performance data is written back to the file after all MegdnN operators have completed the search. If the file is not empty and one of the Meggdnn operators is able to retrieve performance data from the Cache, the search will not take place. If the remaining performance data cannot be retrieved, the search will take place. This enables breakpoint search, which Megengine calls “continued search”. If the program fails to exit for some reason during the Fast Run, “Continue Search” will enable the Fast Run to be connected next time. “Continuous search” also allows performance data from multiple models to be combined in a single Cache file. If all MegdnN operators can retrieve performance data from the Cache, then the search will not occur, and the network has the best performance.
If you use –fast-run-algo-policy alone, the performance data in the file will be read into the memory first. If there is no record in the Cache, the algorithm of Meggdnn operator is not set with the experience value, and the performance may not be optimal.

When using Fast Run, can cooperate — verbose, program will print Fast when Run the debug information in detail, including MegDNN operator’s name, the size of the input and output information, set the algorithm name, etc. If the performance is not as expected, such as when the loaded model does not match the Cache file, a “second search” often occurs, giving the illusion of a long network execution time. Therefore, it is highly recommended to use the –verbose parameter at this point to see if the program is working as expected.

4.2 Algorithm attributes

Some algorithms in MegDNN have unique properties that will affect the setting of algorithms to the MegDNN operator. The properties currently used are:

Reproducible: An algorithm with a Reproducible property that ensures that the calculated results are bit aligned. In Fast Run, support is provided for the Reproducible property when reading algorithm information from the Cache. Set –reproducible, Fast Run selects the best performing algorithm from the Cache that has the reproducible attribute. In the Profiling stage, it does not distinguish whether the algorithm is REPRODUCIBLE or not, so the algorithm in the Cache has both REPRODUCIBLE and non-reproducible attributes, so it has a certain generalizability.
Naive: Only the most NAIVE algorithms in Megdnn have the NAIVE property. The difference between –full-run and –fast-run is that –fast-run uses this property to screen out the slowest naive algorithms.

4.3 Weight pre-treatment

In some algorithms, auxiliary conversion of data is required during calculation. Where, the conversion to weight can be one-time to save running time. For example, Winograd algorithm, whose weight can be adjusted before convolution calculation, saves a considerable part of runtime performance overhead. MegEngine in GraphCommonOptimizeOptions provides weight_preprocess options to support the weight conversion function in advance. Once the weight_preprocess is set, the performance data for algorithms that can convert weight ahead will not include the weight conversion time. Simply put, setting the weight_preprocess during the search phase will affect the performance data of the algorithm, and the order of the performance data of the algorithm in the Cache may be different. If the Cache is searched with weight preprocessing on, it is necessary to enable weight preprocessing at deployment time for better performance, otherwise there is a risk of performance degradation. Fast Run and weight prehandling are not required relationships and can be used separately. In general, though, the two can be used together for better performance.

4.4 Fast Run version

The version information of the Fast Run is represented as a string in the category of the Cache. The Cache has compatibility, which allows different versions of Megengine to have different search results in the same Cache and see different categories in the Cache. However, users still need to pay attention to the version of Fast Run in the process of using it. In general, if Megdnn’s algorithm is deleted or properties are changed, the version information for Fast Run will change. If the version information of FAST RUN has changed, it needs to be searched again.

The attached:

Github: Megengine Tianyuan

Website: MegEngine- Deep Learning, Easy Development

Welcome to join MEGENGINE technical exchange QQ group: 1029741705