NetEase Youdao open source EMLL: high-performance end-side machine learning computing library, greatly improving computing performance

Introduction

Today, with the continuous in-depth development of artificial intelligence technology, our requirements for computing performance are getting higher and higher. Traditional computing and processing are mostly based on the cloud side. All image, audio and other data are transmitted to the cloud center through the network for processing and the results are fed back. However, with the exponential growth of data, relying on cloud-side computing has shown many deficiencies, such as real-time data processing, network conditions, and data security. Therefore, end-side reasoning is becoming more and more important.

In this context, the NetEase Youdao AI team independently designed and developed a high-performance end-side machine learning computing library-EMLL (Edge ML Library), which has been open sourced recently.

EMLL is designed to accelerate end-side AI inference, providing a high-performance machine learning computing library based on end-side processors, supporting data types such as fp32, fp16, int8, etc. It is already available in NetEase Youdao Dictionary Pen, Translator King and Super Dictionary and other intelligent hardware The product's NMT, ASR, and OCR engines are applied to greatly improve computing performance and enhance user experience.

Open source address: https://github.com/netease-youdao/EMLL

1. End-to-side AI

End-side AI has the following advantages:

Low latency
Guarantee data privacy
Does not rely on the network

End-side AI challenges:

The computing power of the processor is limited, which is far lower than the computing power of the cloud. How to meet the increasingly complex end-side AI performance requirements is crucial
Memory size and bandwidth are limited, which is critical to performance

ARM processors dominate smart devices and are the mainstream platform for end-side AI landing. NPU, DSP, GPU can provide higher computing power, and there are certain application scenarios on the end-side AI, but the ecological environment is poor, and it takes time to reach maturity.

The most time-consuming calculations for end-side AI are fully connected (FC) and convolution calculations. The core calculation of the bottom layer is matrix multiplication. The performance of the bottom-layer computing library plays a decisive role in whether the end-side AI can be implemented.

Two, ARM third-party BLAS library

Eigen

C++ template library for linear algebra operations, matrix operations can be done directly with symbols.

OpenBLAS

An open source high-performance BLAS library maintained by the Institute of Computing Technology of the Chinese Academy of Sciences, based on Kazushige Goto's GotoBLAS, supports Fortran BLAS and CBLAS interface calls.

ARM Compute Library

The calculation library officially launched by ARM supports common AI operations, in which matrix multiplication operations are encapsulated in the form of the model inference layer and need to be initialized before they can be called.

Table 1-Features of matrix multiplication of each ARM

The matrix multiplication on the conventional matrix scale is well optimized, and the performance is better, and then the performance on the flat matrix is poor. The bottom-level computing of end-side AI is mainly the multiplication of flat matrices, and the performance of third-party computing libraries is poor, and the performance of the hardware is not fully utilized, which is not conducive to the landing of AI applications on the end-side platform.

Table 2 ARM cortex-A53 quad-core third-party library GEMM calculation efficiency:

Note: C(M, N) = A(M, K) * B(K, N), the above values take the best value of the main sequence of the whole row and the main sequence of the whole column. The test is repeated 128 times on the same matrix, The calculation efficiency is obtained by dividing the FLOPS value calculated by GEMM by the theoretical FLOPS value of the hardware.

Three, EMLL characteristics

high performance

The matrix multiplication function implemented by EMLL is specially optimized for the calculation of the common flat matrix in the end-side artificial intelligence, and has been specifically optimized for the common ARM processors. For cortex-A7/A35/A53/A55/A76 processors, this library uses assembly-level optimizations based on their pipeline characteristics.

In most cases, EMLL has significantly improved performance compared to Eigen and ARM compute Library third-party libraries, especially in the flat matrix multiplication commonly used by end-side AI. The following figure shows the performance results of single-precision matrix multiplication in the case of some typical matrix sizes in the end-side AI.

Figure 1 EMLL matrix multiplication performance

Ease of use

The function interface used by EMLL strives to be concise and direct in parameter design. The matrix multiplication removes the infrequent LD* parameters, and the transfer of matrices and vectors is passed through pointers and integer dimensions respectively. This library does not rely on third-party computing libraries.

Scalability

For matrix multiplication and quantization functions, the EMLL library extracts their architecture-independent codes as general-purpose macros. These macros can greatly save the amount of code required when supporting the new CPU architecture.

Four, EMLL performance optimization method

Optimizing the performance of the computing library on the end-side device needs to be considered from the perspectives of memory access efficiency and computing efficiency. The following uses (dense) matrix multiplication as an example to introduce the optimization method adopted by EMLL.

Block

Frequent memory access is required in the calculation process of matrix multiplication. When the size of the matrix is large, the CPU cache capacity is not enough to hold all its contents, and frequent cache misses will occur when accessing memory, which reduces program efficiency. At this time, EMLL will perform necessary disassembly of the matrix multiplication problem, and divide the larger matrix into smaller matrices. This is the method of block division. After segmentation, each subtask only calculates the contribution of a small matrix to the result, and only intensively visits the area of this small matrix, which greatly improves the cache hit rate. For the multiplication between two larger matrices, EMLL refers to the existing optimization work [1], and makes full use of the CPU multi-level cache through multi-level block, and mainly adopts the following two segmentation methods:

Figure 2

L1-L3 represent the CPU cache used by different matrix blocks

The registers of the CPU can be regarded as the "fastest cache". In order to make full use of the register, on the basis of the above block, EMLL further splits, the small matrix on the left is split into the smallest matrix a1 of m×k, and the small matrix on the right is split into the smallest matrix b1 of k×n. To calculate the multiplication of this pair of minimum matrices, if the triple loop is used directly, 2×m×n×k element accesses are required. If registers are not used, they are all memory access operations; if registers are used, only the multiplication is required. Put two small matrices into the register before the start, and subsequent multiplications will no longer fetch memory, reducing the memory fetching to (m + n) × k times.

In summary, large-scale block can improve the utilization of CPU caches at all levels, and small-scale block can use CPU registers to reduce the number of memory accesses, both of which have obvious benefits for performance.

rearrange

As mentioned above, in order to make full use of the register, the reading of the sub-matrix block is divided into smaller blocks m×k or k×n (1 <m, n, k <20), and these small blocks are read one by one in the calculation. Piece. Generally, the storage method of the matrix in the memory is row-major or column-major order. Regardless of the storage method, there will be many jump accesses when reading in small blocks. Jump access is not good for performance, here are three points:

Consume additional cache bandwidth: The data interaction between L2/L3 cache and L1 is in the form of cache lines. When jumping to access the data in the L2/L3 cache, the utilization rate of the cache line data is low, and the transmission bandwidth is wasted.
Cannot make full use of vectorized load units: Many CPUs that support SIMD are equipped with vectorized load units, which support one instruction to load several elements with consecutive addresses. If it is a jump access, this feature cannot be used.
Increase the overhead of page table query: memory fetching operations often involve the conversion of virtual addresses to physical addresses, and page tables need to be queried. The coverage address range of a page table is limited. If the jump step length is too large, new page tables need to be queried frequently.

In the multiplication of two sub-matrix blocks, each sub-matrix block is usually read multiple times, and the order of each reading can be the same. The sub-matrix block of B will be read multiple times when the number of rows in the block A multiplied by it is more than m; the sub-matrix block of A will be read more times when the number of columns in the block B multiplied by it is more than n Times. EMLL refers to the existing optimization work 1. Before the calculation starts, the two sub-matrix blocks are first rearranged in the order of reading during the calculation (that is, read in smaller blocks as described in the previous paragraph), so that the calculation is correct The accesses of the two sub-matrix blocks all become sequential accesses, which is the optimization method of rearrangement. Although rearranging elements before calculation will have additional overhead, multiple accesses to the matrix block during the calculation process are more profitable after being sequenced, thus bringing overall performance improvement.

For a matrix of special size, the cost of rearrangement may be greater than the benefit, and it needs to be selectively rearranged or not [2]. When the number of rows M of the source matrix A is small and the source matrix B is large, the number of repeated readings of the sub-blocks of B is greatly reduced, and the benefit of rearranging the sub-blocks of B is greatly reduced, even starting to be lower than the cost. This situation is very common in end-to-side AI reasoning. EMLL will judge the size of M. When M is less than a threshold, the matrix B will not be rearranged, but the calculation order will be adjusted, and all elements of B will be read sequentially. Similarly, when the number of columns N of the source matrix B is obviously small, EMLL no longer rearranges the matrix A, adjusts the calculation order, and reads the elements of A in one order. Through special processing of special size matrices, the performance of EMLL in these sizes significantly exceeds that of open source libraries such as Eigen and OpenBLAS.

Assembly optimization

In order to improve the efficiency of data calculation, mainstream CPUs now support the "Single Instruction Multiple Data" (SIMD) processing mode, that is, one instruction performs the same operation on multiple data. Calling the SIMD instruction set can improve the throughput of data calculation without increasing the instruction throughput. The ARM platform provides the NEON instruction set to support SIMD operations.

When m = n = 4 and k = 1, do the multiplication between the smallest matrix blocks and accumulate the result. If scalar calculation is used, 16 multiplications and 16 additions are required. The NEON instruction set provides the fusion multiply-add operation in broadcast mode, and only 4 instructions are needed to complete the same task, as shown in the figure below. Most of the other values of m, n and k can also be accelerated by NEON instruction. NEON instructions can be called explicitly through assembly or through intrinsics provided by the compiler. The latter is more readable but has greater uncertainty in performance indicators.

In order to save cost and power consumption, the processor equipped with the low-end platform on the end side usually cuts off the ability of out-of-order execution in the execution core, but executes them in strict accordance with the order of instructions in the instruction stream, such as ARM's cortex-A7, A35, A53, A55 etc. Some types of processors can execute two adjacent instructions at the same time under the premise of sequential execution. For these processors, if there are data dependencies or execution unit conflicts between instructions, the order of instructions will have a significant impact on performance. If you pursue extreme performance, you need to rearrange related instructions at the assembly level. For two instructions with data dependence (for example, the input of an operation instruction depends on the result of another load instruction), they should be kept as far away as possible to avoid the pipeline idle due to the waiting of the dependency.

Five, EMLL function

Supported calculation functions

Table 3 Supported calculation functions:

Supported architecture

armv7a, armv8a

Supported end-side operating systems

Linux, Android

Six, application cases

NetEase Youdao Dictionary Pen is a learning intelligent hardware polished by NetEase Youdao. With its efficient and accurate word search and rich and authoritative content, it has become an excellent product for the application of AI technology in the learning field. NetEase Youdao dictionary pen, with "multi-line scanning translation" function, intelligent learning hardware that supports the entire translation.

NetEase Youdao Super Dictionary builds an efficient and intelligent English learning system, strengthens the end-to-side functions, and provides functions such as taking photos to learn English, looking up and translating words, memorizing words, listening practice, dialogue translation, and voice assistants.

NetEase Youdao Translation King supports translation between 43 languages, travels to 191 countries and regions around the world, supports online translation in 21 languages, side-to-side photo translation in 7 languages, and instant translation of signs and menus.

NetEase Youdao Dictionary Pen, Super Dictionary, and Translation King are all embedded with industry-leading AI technologies such as neural network translation NMT, optical character recognition OCR, speech recognition ASR, speech synthesis TTS, etc. independently developed by Netease Youdao, and support offline functions.

Netease Youdao's self-developed end-side machine learning computing library has been used in intelligent hardware products such as Netease Youdao Dictionary Pen, Super Dictionary, Translation King, etc., bringing the following benefits:

The end-to-end performance is 1.3 to 2.43 times faster than using the eigen library, and the effect is significant, which greatly reduces the delay of the end-side inference engine. In addition to the better performance improvement brought by Youdao smart hardware, we also did a performance test on a mobile phone equipped with Snapdragon 855, and the end-to-end performance was increased by 25%-55% compared to eigen, and the effect was obvious.
After the end-side inference engine adopts EMLL, larger AI models can be launched to improve quality and ensure real-time performance. For example, the end-side NMT quality (BLEU) is increased by 2 points, and the end-side ASR accuracy is increased by 4.73%.
EMLL can guarantee real-time performance on lower-end chips. For example, using the Eigen library on cortex-A7 cannot achieve real-time performance. After using EMLL, the delay is greatly reduced and real-time performance is guaranteed. EMLL can make smart hardware more chip choices, thereby reducing costs and improving market competitiveness.

Table 4 Test platform:

Figure 3 End-side NMT, ASR, OCR use EMLL and eigen on different platforms end-to-end performance acceleration ratio

EMLL's high-performance end-side machine learning computing library has been practically applied in many smart hardware products of NetEase Youdao and has achieved significant results, greatly improving performance and bringing users a better product experience.

In the future, NetEase Youdao will continue to maintain and optimize EMLL to help more companies, scientific research institutions and other partners improve end-side AI computing capabilities. Developers and friends are welcome to use and provide valuable comments.

references
[1] Eigen：http://eigen.tuxfamily.org/
[2]OpenBlas: https://github.com/xianyi/OpenBLAS
[3]ARMComputeLibrary: https://github.com/ARM-software/ComputeLibrary
[4] Goto K., et al. Anatomy of High-Performance Matrix Multiplication[J]. ACM Trans. Math. Softw., 2008, 34(3), 12:1-12:25.
[5] Frison G., et al. The BLAS API of BLASFEO: optimizing performance for small matrices[J]. ACM Trans. Math. Softw., 2020, 46(2), 15:1-15:36.