Netease Youdao Open source EMLL: high-performance end-to-side machine learning computing library that greatly improves computing performance

Introduction to the

With the continuous development of artificial intelligence technology today, we have higher and higher requirements for the performance of computing. Most of the traditional computing and processing are based on the cloud side. All the images, audio and other data are transmitted to the cloud center through the network for processing and then the results are fed back. However, with the exponential growth of data, cloud-based computing has shown many deficiencies, such as real-time data processing, network condition constraints, data security, etc., so the end-to-end reasoning becomes more and more important.

Against this background, the netease Youdao AI team independently designed and developed a high-performance end-to-side machine learning computing Library — EMLL(Edge ML Library), which has been open source recently.

EMLL is designed to accelerate end-side AI reasoning. It provides a high-performance machine learning computing library based on end-side processors and supports data types such as FP32, FP16 and INT8. It has been applied in NMT, ASR and OCR engines of intelligent hardware products such as Netease Youdao Dictionary Pen, Interpreter Wang and Super Dictionary to greatly improve computing performance. Improve user experience.

Open source address: github.com/netease-you…

I. End side AI

End-to-end AI has the following advantages:

Low latency
Ensuring data privacy
Network independent

Side AI Challenges:

The computing power of the processor is limited and far lower than that of cloud computing. How to meet the requirements of increasingly complex side AI performance is crucial
Memory size and bandwidth are limited and have a critical impact on performance

ARM processor dominates in smart devices and is the mainstream platform for the landing of end-side AI. Npus, DSPS, and Gpus can provide higher computing capabilities and have certain application scenarios for end-side AI. However, due to the poor ecological environment, they still need time to mature.

The most time-consuming calculations for end-side AI are FC and convolution calculation, while the underlying core calculation is matrix multiplication. The performance of the underlying computing library plays a decisive role in whether end-side AI can land.

Second, ARM third-party BLAS library

Eigen

Linear algebra operations C++ template library, matrix operations can be done directly with symbols.

OpenBLAS

It is an open source and high performance BLAS library maintained by the Institute of Computer Science of the Chinese Academy of Sciences. It is based on The GotoBLAS of Kazushige Goto and supports Fortran BLAS and CBLAS interface invocation.

ARM Compute Library

ARM’s official computing library supports common AI operations. Matrix multiplication is encapsulated in the form of model reasoning layer, which needs to be initialized before it can be invoked.

Table 1- Matrix multiplication characteristics of each ARM BLAS library:

The performance of matrix multiplication on the conventional matrix scale is better, and the performance of matrix multiplication on the flat matrix is poor. The bottom-layer computing of end-side AI is mainly the multiplication of flat matrix, and the performance of the third-party computing library is poor, which does not give full play to the performance of the hardware, which is not conducive to the application of AI on end-side platforms.

Table 2 GEMM computing efficiency of ARM Cortex-A53 quad-core third-party library:

Note: C(M, N) = A(M, K) * B(K, N), the above value takes the best value of the full row main sequence and the full column main sequence, and the test is repeated 128 times on the same matrix. The calculation efficiency is worth dividing the GEMM calculation FLOPS value by the hardware theory FLOPS.

Iii. Features of EMLL

A high performance

The matrix multiplication function implemented by EMLL is specially optimized for the calculation of flat matrix which is common in end-side ARTIFICIAL intelligence, and is specially optimized for common ARM processors. For cortex-A7/A35/A53/A55/A76 processors, this library uses assembler level optimization based on their pipelining characteristics.

In most cases, EMLL has a significant performance improvement compared with Eigen and ARM Compute Library, especially in the flat matrix multiplication commonly used by AI side. The following figure shows the performance results of single-precision matrix multiplication for some typical matrix sizes in end-side AI.

Figure 1. Multiplication performance of EMLL matrix

Ease of use

The function interface used by EMLL tries to be concise and direct in the design of parameters. Matrix multiplication removes the infrequently used LD* parameter, and the passing of matrix and vector is passed through the pointer and the integer dimension respectively. This library does not rely on third-party computing libraries.

scalability

For matrix multiplication and quantization functions, the EMLL library extracts their architecture-independent code as generic macros that can significantly save the amount of code required to support new CPU architectures.

Fourth, EMLL performance optimization method

To optimize the performance of the computing library on the end devices, it is necessary to consider both access efficiency and computational efficiency. The following uses (dense) matrix multiplication as an example to describe the optimization method used by EMLL.

block

In the process of matrix multiplication, memory accesses are frequently needed. When the matrix size is large, the CPU cache capacity is insufficient to hold all the contents of the matrix, and cache loss frequently occurs during memory acquisition, which reduces the efficiency of the program. At this point, EMLL will do the necessary dismantling of the matrix multiplication problem, cutting the larger matrix into smaller matrices, which is the method of partitioning. After segmentation, each subtask only calculates the contribution of a small chunk of matrix to the result, and only intensively accesses the region of this small chunk of matrix, which greatly improves the cache hit ratio. For the multiplication between two large matrices, EMLL refers to the existing optimization work [1] and makes full use of the multi-stage CPU cache through multi-level partitioning, mainly using the following two segmentation methods:

FIG. 2 Partitioning method

L1-l3 represents the CPU cache used by different matrix blocks

The registers of the CPU can be considered the “fastest cache.” In order to make full use of registers, EMLL is further split on the basis of the above blocks. The small matrix on the left is divided into the minimum matrix A1 of m×k, and the small matrix on the right is divided into the minimum matrix B1 of K ×n. To calculate the multiplication of this pair of minimum matrices, 2×m× N × K times of element access is required if the triple loop is used directly. If registers are not used, both are memory access operations. If the register is used, only two small matrices need to be put into the register before the multiplication begins, and the subsequent multiplication will not be fetched, so that the fetched is reduced to (M + n) ×k times.

To sum up, large scale partitioning can improve the CPU cache utilization at all levels, and small scale partitioning can use CPU registers to reduce the number of memory accesses, both of which have significant performance benefits.

rearrangement

As mentioned above, to make full use of the registers, the reads of the submatrix blocks are divided into smaller chunks m×k or K ×n (1 < m, n, k < 20), which are read one by one in the calculation. In general, the matrix is stored in memory as row main order or column main order. Regardless of the storage mode, there are many skip access cases when reading in small chunks. Skip access is bad for performance. Here are three:

Consumption of additional cache bandwidth: L2/L3 caches interact with L1 data in the form of cached rows. When L2/L3 cache data is skipped, the cache row data usage is low and the transmission bandwidth is wasted.
Unable to take full advantage of vectozed load units: Many SIMD-enabled cpus have vectozed load units that allow a single instruction to load several elements with contiguous addresses. Skip access does not take advantage of this feature.
Increase the cost of page table query: The retrieval operation often involves the translation of virtual addresses to physical addresses, which requires page table query. A page table has a limited range of override addresses. If the step size of the jump is too large, new page tables need to be queried frequently.

In a multiplication of two submatrix blocks, each submatrix block will usually be read multiple times, in the same order each time. B’s submatrix block will be read multiple times if it is multiplied by A block with more rows than M; A submatrix block will be read multiple times if it is multiplied by block B with more columns than n. Referring to the existing optimization work 1, EMLL rearranges the elements of the two sub-matrix blocks according to the read order of the calculation (i.e., read smaller blocks as described in the previous paragraph) before the calculation, so that the access to the two sub-matrix blocks during the calculation is all sequential access, which is the optimization method of rearrangement. Although there is an additional overhead to rearrange the elements before the computation, multiple visits to the matrix blocks during the computation are more profitable if they are sequenced, resulting in an overall performance boost.

For a matrix of special size, the cost of rearrangement may be greater than the benefit, and it is necessary to selectively rearrange or not rearrange [2]. When the number of rows M of source matrix A is small and the number of rows M of source matrix B is large, the number of repeated reads of the subblocks of B is greatly reduced, and the benefit of rearranging the subblocks of B is greatly reduced, even starting to be lower than the cost. This is very common in end-side AI reasoning. EMLL will judge the size of M. When M is less than a threshold, the matrix B is no longer rearranged, but the calculation order is adjusted, and all elements of B are read in one order. Similarly, when the number of columns N of source matrix B is significantly smaller, EMLL no longer rearranges matrix A, adjusts the calculation order, and reads the order of the elements of A once. With special treatment of special size matrices, EMLL significantly outperforms open source libraries such as Eigen and OpenBLAS on these sizes.

Assembly optimization

In order to improve the efficiency of data calculation, the mainstream CPU supports the “single instruction multiple data” (SIMD) processing mode, which means that one instruction performs the same operation on multiple data. Calling the SIMD instruction set can increase the throughput of the data calculation without increasing the instruction throughput. The ARM platform provides the NEON instruction set to support SIMD operations.

When m = n = 4 and k = 1, multiply between the smallest pieces of the matrix and add the result. If using a scalar calculation, 16 multiplications and 16 additions are required. The NEON instruction set provides a broadcast-mode fusion multiplier and addition operation that requires only four instructions to complete the same task, as shown in the following figure. Most of the other m, n, and K values can also be accelerated using NEON instructions. The NEON directive can be explicitly called in assembler mode, or it can be called in the compiler’s intrinsics function, which is more readable but has more uncertainty in performance metrics.

In order to save cost and power, processors in the middle and low end of the platform usually cut out the ability to execute out of order in the execution core, and instead execute instructions strictly in the order of the instruction stream, such as ARM’s Cortex-A7, A35, A53, A55, etc. Some processors can execute two contiguous instructions simultaneously under the premise of sequential execution. For these processors, if there are data dependencies or execution unit conflicts between the instructions, the order of the instructions will have a significant impact on the performance. To achieve the ultimate performance, the relevant instructions need to be rearranged at the assembly level. For two instructions that have data dependencies (for example, the input of one operation instruction depends on the result of another load instruction), keep them as far away as possible to avoid pipeline idle due to waiting for dependencies.

5. EMLL function

Supported calculation functions

Table 3 Supported calculation functions:

Supported Architectures

armv7a, armv8a

Supported end – side operating systems

Linux, Android

Vi. Application cases

Netease Youdao Dictionary pen is a learning intelligent hardware of netease Youdao Polishing. With efficient and accurate word search and rich and authoritative content, it has become an excellent product for the application of AI technology in the field of learning. Netease Youdao dictionary pen, with the function of “multi-line scan and translation”, supports the intelligent learning hardware of whole paragraph translation.

Netease Youdao Super Dictionary creates an efficient intelligent English learning system and strengthens end-side functions. It provides functions such as taking photos to learn English, looking up and translating words, memorizing words, listening practice, dialogue translation, voice assistant, etc.

Netease Youdao Translator Wang supports translation into 43 languages, and visits 191 countries and regions around the world. He supports online translation in 21 languages and side photo translation in seven languages, and interprets signs and menus instantly.

Netease Youdao Dictionary pen, Super Dictionary, translation wang are embedded with netease Youdao self-developed neural network translation NMT, optical character recognition OCR, speech recognition ASR, speech synthesis TTS and other industry-leading AI technology, and support offline function.

Netease Youdao self-developed terminal machine learning computing database has been used in netease Youdao Dictionary pen, Super Dictionary, Translator wang and other intelligent hardware products, bringing the following benefits:

End-to-end performance is 1.3 to 2.43 times faster than using the Eigen library, significantly reducing the latency of the end-side inference engine. In addition to the performance improvement brought by Youdao Smart hardware, we have also done a performance test on a phone equipped with Snapdragon 855, and the end-to-end performance has increased by 25%-55% compared with eigen. The effect is obvious.
After EMLL is adopted in the end-side inference engine, larger AI models can be put online to improve quality and real-time performance. For example, the BLEU of end-side NMT is improved by two points, and the accuracy of end-side ASR is improved by 4.73%.
EMLL ensures real-time performance on lower-end chips, such as cortex-A7, which is not possible with the Eigen library. With EMLL, latency is greatly reduced and real-time performance is guaranteed. EMLL allows smart hardware to choose from more chips, thus reducing costs and improving market competitiveness.

Table 4 Test platform:

Figure 3 End-to-end performance acceleration ratios of NMT, ASR, and OCR using EMLL and EIGen on different platforms

EMLL, a high-performance end-to-side machine learning computing library, has been applied in several intelligent hardware products of netease Youdao and achieved remarkable results, greatly improving performance and bringing better product experience to users.

In the future, netease Youdao will continue to maintain and optimize EMLL to help more enterprises, scientific research institutions and other partners improve the AI computing capabilities on the side. Welcome every developer friend to use and put forward valuable opinion.

reference

[1] Eigen：eigen.tuxfamily.org/

[2] OpenBlas: github.com/xianyi/Open…

[3] ARMComputeLibrary: github.com/ARM-softwar…

[4] Goto K., et al. Anatomy of High-Performance Matrix Multiplication[J]. ACM Trans. Math. Softw., 2008, 34(3), Now – the word.

[5] Frison G., et al. The BLAS API of BLASFEO: optimizing performance for small matrices[J]. ACM Trans. Math. Softw., 2020, 46(2), 15:1-15:36.