Time-consuming analysis and optimization exploration of CANN AICPU operator

Abstract: This article uses GreaterEqual as the test operator. The calculation logic of this operator is relatively simple (output = input1 >= input2), which aims to reduce the calculation time as much as possible, so that the operator can use data operations and operators as much as possible. Dispatch as the main body.

This article is shared from the HUAWEI cloud community " CANN AICPU operator time-consuming analysis and optimization exploration ", author: DavilSu.

1. Analysis purpose

In the actual development of CANN operators, there are often cases where the functions of the operators are normal, but the performance is much lower than that of the TensorFlow benchmarking operators. In response to this problem, this paper uses GreaterEqual as the test operator. The calculation logic of this operator is relatively simple (output = input1 >= input2). It aims to reduce the calculation time as much as possible, so that the operator can use data operations and operators as much as possible. Dispatch as the main body.

2. Test code and platform introduction

This test platform is the Ascend server provided by OpenLab, equipped with Ascend910A, and the CANN Toolkit version number is 5.0.2alpha005.

The self-developed test code is modified with reference to the commit cac625f243dfe7b04dbb2a82059cd0e4349f77d1, which is optimized for broadcast operation performance. Self-developed and set the parallel threshold: 8K with broadcast operations, and 32K without broadcast operations.

GreaterEqual's TensorFlow benchmarking operator is TensorFlow version 1.15, and the canndev benchmarking operator commit is d660e086717b94b8cfb3f35a8e08046ca0461772. This version of the operator tries to use the broadcast operation of the Eigen library to avoid the insufficient performance of the canndev source warehouse Bcast, but parallel computing is not enabled. Accelerate.

For the test data, I set up two batches of data involving broadcast operations and those not involving broadcast operations. The test data involving broadcast operations is divided into two types: the number of elements that need to be broadcasted Tensor is 1 and the number of elements is not 1, and I tested int8, Int16, int32, int64, uint8, float16, float32, float64, there are 8 data types supported by TensorFlow benchmarking operators. Each data type is set to 128B, 256B, 1K, 2K, 4K, 8K, 16K, 32K, 64K. , 128K, 256K, 1M, 2M, 8M, a total of 14 data scale gradients, the corresponding relationship between the detailed data scale and shape is as follows:

3. Single-threaded performance analysis

This part aims to test the performance gap between CANN operator and TensorFlow operator for single-threaded data processing. In order to avoid the influence of broadcast operation on the test results, this test data adopts data batches that do not involve broadcast operation.

Figure 1 Time-consuming ratio of single thread

It can be seen that for small data scales with a data volume of less than 2K, the CANN operator has certain performance advantages compared to TensorFlow, but as the amount of data increases, the performance of the CANN operator has significantly deteriorated, especially the data of uint8 Type, the degree of degradation is very serious, the performance degradation is as high as 6.57 times. For the non-C++ standard float16 data type, both are replaced by the half data type in the Eigen library, and the test results are relatively close in performance.

Figure 2 Time-consuming calculation of 1K data

I also tested the time it takes to calculate 1K data when CANN and TF single core calculate 16K-8M data volume.

It can be seen that as the space occupied by the data type increases, the time consumption increases proportionally. The strange thing is that CANN's int8 and uint8 are similar in time to int16. This feature is also reflected in the time-consuming ratio of int8 and uint8. The performance degradation of int8 and uint8 is much higher than that of other data types. It may be because int8 and uint8 are extended to 16 bits are then calculated. The performance of CANN in the two data types of float32 and float64 is also very strange. As the amount of data increases, the time consumption fluctuates greatly. The specific situation tried to analyze and optimize in the vectorized code and performance analysis part.

4. Performance comparison between self-developed operators and realized operators in the main warehouse

The GreaterEqual operator in Canndev's main warehouse tried to use the broadcast operation of the Eigen library to avoid the problem of insufficient broadcast performance in the canndev source warehouse, but did not enable parallel computing for acceleration. The self-developed operator uses the Bcast class in the canndev warehouse for broadcasting, refines and specializes whether broadcasting is needed, and sets parallel thresholds for different data scales.

This part separately tested two batches of data involving broadcast operations and those not involving broadcast operations, aiming to test the pros and cons of the method provided by canndev and the broadcast operation provided by Eigen, as well as the performance advantages of self-developed operators.

Figure 3 Excluding the time-consuming ratio of broadcast operations

Figure 4 Time-consuming ratio with broadcast operation

It can be seen from the results that when the broadcast operation is not turned on, the performance of the self-developed operator is better than the existing operator. When the amount of data is small, the pointer is directly manipulated, and the existing operator is not checked by Eigen's broadcast method. For processing, the performance has certain advantages. Because of the large amount of data, the performance is much better than the existing operators due to the multi-threading.

However, after the broadcast operation is turned on, because the parallel threshold is set at 8K, the small data volume is the same as single-threaded processing data. It can be seen that the current Bcast performance of CANN is inferior to the broadcast implemented by Eigen. After the data volume is greater than 8K, due to the parallel processing of multiple threads Advantages, the performance of self-developed operators far exceeds that of existing operators.

The broadcast operation implemented by TensorFlow has greater performance advantages than the broadcast implemented by Eigen and the broadcast implemented by CANN. Both are single-threaded 8-26 times ahead of the broadcast implemented by Eigen, and even more ahead of CANN.

5. Parallel threshold comparison

Since the reference operator is the Less operator optimized for broadcasting, I set a control group with the same threshold as the Less operator (calculated as 2K with broadcasting operations and 7K without broadcasting operations) to verify its parallel threshold Is it reasonable? In order to avoid the influence of broadcast operation on the test results, this test data adopts data batches that do not involve broadcast operation.

The test results are as follows:

Figure 5 Less operator threshold and self-developed operator threshold time-consuming ratio threshold

It can be seen that the parallel threshold setting of the Less operator is unreasonable. There is an obvious time-consuming increase in the 8K data scale. The main body of time-consuming is the time-consuming of parallel communication rather than the calculation. The self-developed operator is relatively flat. The threshold is determined by the dichotomy. The loop test shows that the parallel speedup ratio at the critical point is close to 1.

6. Vectorized code and performance analysis

When performing single-threaded performance analysis, I noticed a very strange phenomenon, int8 and int16 are very close to the time (Figure 2), which caught my attention, when the processor is processing data, the time consuming will be different from the processing time. Whether the data is a fixed-point number or a floating-point number, the bit width of the data, the instructions for processing the data call, etc. are related. When processing the same amount of int8 and int16 data, it should take longer than int8 for int16. Observing the execution time of TensorFlow operators, int8 and uint8 are also less time-consuming than int16.

Modern processors often support SIMD (Single Instruction Stream and Multiple Data Streams). By packing data in a vector register, multiple data calculations are performed in one arithmetic instruction, thereby achieving DLP (Data Level Parallelism) and accelerating data-intensive The effect of the calculation. The calculation process of the GreaterEqual operator does not include a branch selection structure, and the calculation logic is simple and repetitive, which is suitable for acceleration by SIMD.

Looking at the data, it is found that the AICPU in the Ascend910 processor is a 16-core TaiShan core. Through system query, it supports the AArch64 instruction set, which also includes the NEON instruction set.

I tried to embed assembly code in the C++ implementation code to implement manual vectorization, and the performance was indeed greatly improved. Although manual vectorization can theoretically achieve the highest degree of vectorization, the SIMD extended instruction set provided by different processors is different, and the characteristics of different applications are also complex and changeable. The readability of SIMD vectorization code is poor, and the The degree of transplantation is low, and it is difficult to continue optimization. Considering that the operator code may need to be migrated to CPUs of different architectures such as x86-64 and ARM in the future, the compiler is finally selected to automatically generate a vector program for the SIMD extension of the target processor. Automatic vectorization programmers do not need to care about the SIMD extension component structure and instruction set provided by the bottom layer, but only need to clearly express the parallelism existing in the program, which largely solves the problem of low portability of high-performance code.

Query the content of the canndev main warehouse code. The keywords related to vectorization optimization only appear in TFPlugin. Check the compilation options of CmakeLists.txt only for O2 optimization. Since the compiler for compiling AICPU code is GCC, by consulting the GCC documentation, the compilation options included in O2 include the following options in addition to the optimization options of O1:

You can see that Table 3 does not contain vectorized optimization compilation options, so we add -ftree-vectorize (including -ftree-loop-vectorize and -ftree-slp-vectorize) to CmakeLists.txt. Turn on automatic vectorization, and the optimization results are as follows:

Figure 6 Single-threaded vectorized calculation of 1K data time-consuming

Observing the results in Figure 6, we can see that the performance of the single-threaded vectorized optimization code has been greatly improved. At the same time, we can also observe that the calculation time for fixed-point or floating-point numbers of the same symbol type increases proportionally with the doubling of the data bit width, which also corresponds to the fixed length of the vector register of the SIMD extension unit, NEON The length of the vector register is 128bit, so we should not design the parallel threshold according to the number of elements, but should be determined according to the total size of the element data.

Figure 7: Proportion of time consuming whether FP16 develops temporary variables or not

Trying to convert the half data in Tensor to float and store it in a temporarily opened float array, but the performance deteriorates. The reason is that the cost of assignment after element-wise data type conversion is far greater than the performance improvement brought by vectorization.

Figure 8 Time-consuming ratio of single-threaded vectorization or not

Figure 9 Ratio of time-consuming comparison of multi-threaded vectorization or not

It can be seen from Figure 9 that after vectorization, the performance of all C++ native data types is better than that of TensorFlow operators.

Looking at Figure 10, after vectorization optimization, the operator performance has been effectively improved, but we can see that some data types are not as optimized as they are when the data volume is 128K. This is because the vectorization optimization version code parallel threshold is Set according to the data size, here can be more fine-grained parallel threshold setting for different data types.

Figure 10: Whether vectorization includes broadcast operation (the number of elements of Tensor to be broadcast is 1) time-consuming ratio

I also tested the special case of single-element broadcast operations after vectorization optimization. It can be seen that since the broadcast operation is not called, but the single-element pointer is directly dereferenced, the compiler can correctly implement vectorization optimization for this situation. Therefore, performance has also been significantly improved.

But unfortunately, because the broadcast operation is required, access to the elements in the Tensor needs to call the GetBroadcastXIndex and GetBroadcastYIndex methods of the Bcast class to calculate the address offset after the broadcast operation, which includes more complicated calculations, and the compiler cannot do it. For vectorization optimization, the cost of opening up temporary space and assigning values is far greater than the performance improvement brought by vectorization, so how to optimize this process remains to be studied.

It can be seen from Figure 11 that after the -ftree-vectorize compilation option is turned on, the compiler not only performs automatic SIMD optimization, but also unrolls the loop, which helps reduce loop overhead, provides instruction-level parallelism, and optimizes the scheduling of the instruction pipeline.

For the float16 data type, by reading the source code of Eigen library version 3.3.9, you can see that when the computing device is a CPU, most calculations (except operator/) are converted to float and then calculated, and finally The calculation result is converted to the half data type. The code snippet is as follows:

Figure 12 The half data type operator>= function definition in the Eigen library

This implementation involves two data type conversions, and because it is not called the ARM native data type, it cannot be SIMD optimized, and it is not conducive to loop unrolling. The actual calculation efficiency is much lower than other native data types.

By consulting the official documentation of the ARM architecture, I found that Armv8.2-A includes half-precision floating-point instructions, which avoids the need for conversion to single-precision floating-point, and thus produces higher-performance code. It also shows that AICPU can call the data type __fp16 to achieve native support for half-precision floating-point calculations. Of course, the GCC compiler's current support for FP16 is inferior to Clang. Currently, it can only optimize operators like Add which are basically similar to the instruction set instructions. For the GreaterEqual operator, GCC<=11.1 is converted to float and then compared. And Clang>=9.0.0 can generate the corresponding half-precision floating-point number SIMD instruction set code.

But __fp16 is an Arm C language extension. On the x86-64 platform, for FP16, only native storage is supported, and calculations need to be converted to float. GCC7.3 cannot be compiled, and Clang can be compiled. To ensure the portability of the code, it is not recommended to use this data type.

Is there a high-portability, high-performance implementation solution? When I was reading the Eigen update log, I found that in the Eigen 3.4-rc1 version updated on 2021/04/19, Eigen::half is implemented in __fp16 natively supported by ARM, and improved all back-end vectorization support and ARM The scheduling of NEON instruction set in matrix calculation.

Figure 14 Eigen update log

Figure 15 Definition of Eigen::half when the architecture of Eigen3.4.0 Half.h is ARM64

By observing the disassembly code in Figure 16, it can be seen that the compiler has successfully called the SIMD instruction set instructions of fp16. The code generated by Eigen::half is basically the same as __fp16, compared to the SIMD instruction set not being called and the native fp16 not being enabled. The code is more efficient. It not only eliminates two type conversions, but also increases the amount of calculation data in a loop (SIMD calculates 8 fp16 data at a time, and the SIMD instruction is not enabled, even if the loop is unrolled, it can only be in one loop Calculate 4 data, and the amount of instructions is much larger than the optimized version).

As individuals are more familiar with the source code of friends, PyTorch than TensorFlow, so the comparison object is selected as PyTorch, and they have performed some manual optimization of SIMD. For example, under the directory aten/src/ATen/cpu/vec, the Vectorized class and one A series of commonly used calculation functions, to a certain extent, avoid the embedding of SIMD functions in the implementation file to reduce the readability of the code. At the same time, the target CPU architecture is judged through a series of environment macro definitions, and the SIMD function of the corresponding architecture is enabled, which is further based on automatic vectorization. Optimize actual vectorization performance.

Figure 17 Files in the PyTorch aten/src/ATen/cpu/vec/vec256 directory

7. Limitations of vectorization

Of course, is it perfect to turn on vectorization? Of course not, vectorization has certain limitations.

The length of the vector register of the existing SIMD extension components is fixed. If the length of the vector register is too long and the number of loop iterations or the number of isomorphic statements in the basic block is small, the program cannot be vectorized.
SIMD has a great impact on the execution efficiency of the continuous data address. When the memory access address is not on the aligned boundary, additional shifting and merging operations are required to obtain the vector data that meets the requirements. The non-aligned memory access structure not only adds additional memory access operations, but also adds special operations (such as shift and merge operations, etc.) to obtain vector data that meets the requirements of the SIMD extension component. Since Tensor's data logical addresses are aligned, this problem has not had a big impact on Element-wise operators.
Some programs have insufficient number of iterations, or insufficient vector parallel statements in the basic block, which is insufficient to provide sufficient parallelism for the vector registers, and insufficient SIMD vectorization is required.
Add SIMD instructions by embedding handwritten assembly code or internal functions provided by the compiler in the operator implementation code. In theory, manual vectorization can achieve the highest degree of vectorization, but due to the different SIMD extended instruction sets provided by different processors The difference will cause the portability of the code to drop drastically, and it will be difficult to continue to optimize. However, automatic vectorization currently has certain limitations in the optimization of the code.
Loop unrolling will cause a certain degree of code bloat.
ARM's NEON extended floating-point calculations do not fully implement floating-point operations that comply with the IEEE 754 standard. In particular, non-regularized values will be treated as 0. To ensure calculation accuracy, the compiler option does not enable the -funsafe-math-optimizations option At the time, some of the NEON code GCC compiler for unsafe floating-point calculations will not be implemented in automatic vectorization, which further limits ARM's SIMD performance.

8. Summary and optimization suggestions

Summarize

According to the current compilation options of the canndev source code warehouse, the performance of various data types has a large performance gap with TensorFlow when the data size is above 4K, and int8 and uint8 are time-consuming abnormally, and it is possible to calculate and process according to 16bit. For the processing of Float16, canndev and TensorFlow both use the half of the Eigen library. The performance gap is the smallest among all data types, but the gap ratio is still as high as 1.3x.
At present, the GreaterEqual operator in the canndev source code warehouse does not enable multi-core, and does not specialize in the case of no broadcasting, so the performance is far lower than the self-developed operator without broadcasting. When broadcasting operations involving non-single elements are involved, since the broadcasting performance of the Eigen library is better than that of canndev’s Bcast, the performance of the GreaterEqual operator in the canndev source warehouse with a small amount of data is better than the self-developed operator, but as the amount of data increases, enable After multi-core, the performance of self-developed operators exceeds that of source code warehouse operators.
The self-developed operator is designed with reference to the Less operator in the source code warehouse. The calculation logic of the two operators is basically the same, but the parallel threshold designed by the Less operator is low, which causes an obvious time-consuming for all data types at the 8K data scale. The peak, the situation improves after moving the parallel threshold back.
At present, the compilation option of the main canndev warehouse does not enable automatic vectorization. After the automatic vectorization is turned on, the performance of the code that can be correctly vectorized is greatly improved, and when the -funsafe-math-optimizations compilation option is not enabled, the calculation accuracy has not changed significantly. .
From the perspective of assembly instructions, we explored the vectorization of operator code. The half data type of Eigen<3.4 is not implemented by the __fp16 natively supported by ARM, so vectorization optimization cannot be performed. Eigen 3.4-rc1 and later versions pass the bottom layer __fp16 is implemented, SIMD instructions can be called correctly, and the performance is greatly improved.

Optimization suggestion

Optimize the parallel threshold of the Less operator to make the parallel acceleration ratio of the critical data volume as close to 1 as possible.
Turn on the compiler's automatic vectorization option -ftree-vectorize to fully improve the calculation efficiency of the CPU in one clock cycle.
Upgrade Eigen version to 3.4 and later versions, specify the corresponding ARM architecture when cross-compiling, and enable fp16 support, such as -march=armv8.2+fp16, which can realize the native support of fp16 on the ARM platform, which is performed by the compiler SIMD optimization and loop unrolling effectively improve the performance of Eigen::half on the ARM architecture.
Optimize the implementation logic of Bcast. The current version relies on operator developers to manually determine whether broadcast operations are required, and extract three special cases for manual implementation (no need to broadcast, X is an element, and Y is an element), and the operator implementation code is flooded A large amount of redundant code should abstract operations such as judging whether broadcasting is needed, and access the elements through a unified interface.
To optimize the implementation of the method of obtaining the element index for the broadcast situation that needs to be optimized for Bcast, the current performance of Bcast in the warehouse is much lower than that of TensorFlow and behind the broadcast of the Eigen library, and the current implementation of the GetBroadcastXIndex method is not friendly to compiler optimization.

9. Conclusion

This article is only for a CANN operator developer to explore the time-consuming simple analysis and optimization scheme of AICPU operator. The analysis and optimization ideas are rough and improper. We also ask Huawei experts to give us advice, and we hope to have the opportunity to talk to relevant experts. Discuss the communication optimization plan.

Click to follow and learn about Huawei Cloud's fresh technology for the first time~

Time-consuming analysis and optimization exploration of CANN AICPU operator

1. Analysis purpose

2. Test code and platform introduction

3. Single-threaded performance analysis

4. Performance comparison between self-developed operators and realized operators in the main warehouse

5. Parallel threshold comparison

6. Vectorized code and performance analysis

7. Limitations of vectorization

8. Summary and optimization suggestions

Summarize

Optimization suggestion

9. Conclusion

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

人工智能与机器学习入门：基尼系数（Gini Index）和基于熵（Entropy）

大模型中的Token究竟是什么？从原理到作用深度解析

Open WebUI：开源AI交互平台的全面解析

DeepSeek(私有化)+IDEA+Dify+微信搭建AI助手保姆级教程

一文掌握 MCP 上下文协议：从理论到实践

人工智能与机器学习入门：决策树应用

Time-consuming analysis and optimization exploration of CANN AICPU operator

1. Analysis purpose

2. Test code and platform introduction

3. Single-threaded performance analysis

4. Performance comparison between self-developed operators and realized operators in the main warehouse

5. Parallel threshold comparison

6. Vectorized code and performance analysis

7. Limitations of vectorization

8. Summary and optimization suggestions

Summarize

Optimization suggestion

9. Conclusion

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

人工智能与机器学习入门：基尼系数（Gini Index）和基于熵（Entropy）

大模型中的Token究竟是什么？从原理到作用深度解析

Open WebUI：开源AI交互平台的全面解析

DeepSeek(私有化)+IDEA+Dify+微信 搭建AI助手保姆级教程

一文掌握 MCP 上下文协议：从理论到实践

人工智能与机器学习入门：决策树应用

DeepSeek(私有化)+IDEA+Dify+微信搭建AI助手保姆级教程