Abstract: CANN (Compute Architecture for Neural Networks) heterogeneous computing architecture is a heterogeneous computing architecture specifically oriented to AI scenarios with the goal of improving user development efficiency and releasing the ultimate computing power of Ascend AI processors.

1 Introduction

Since 2016, AlphaGo, which defeated the world's top chess players and has powerfully brought the human army,

By 2020, I will be able to write novels, scripts, and type codes. Keke will have all-round and versatile GPT-3,

In 2021, the Pangu large model that is closest to human comprehension of Chinese and has excellent generalization power...

In recent years, the field of artificial intelligence has constantly refreshed human cognition and subverted human imagination as if it was opened...

Like humans mastering a certain skill, training a smart enough AI algorithm model often requires thousands of data. Taking GPT-3 as an example, its parameter volume has reached 175 billion, the sample size is as much as 45TB, and the single training time is in months. The computing power demand has been a stumbling block on the AI track!

At the same time, with the increasing maturity of artificial intelligence applications, the demand for processing unstructured data such as text, pictures, audio, and video has increased exponentially, and the data processing process has gradually shifted from general computing to heterogeneous computing.

The basic hardware and software platform of Shengteng AI launched by Huawei. Among them, the Ascend AI processor + heterogeneous computing architecture CANN, with its inherent super computing power and heterogeneous computing capabilities, and the powerful combination of software and hardware, is gradually becoming a catalyst for the rapid implementation of the AI industry.

CANN (Compute Architecture for Neural Networks) heterogeneous computing architecture is a heterogeneous computing architecture specifically oriented to AI scenarios with the goal of improving user development efficiency and releasing the ultimate computing power of Ascend AI processors. It supports mainstream front-end frameworks and shields users from the hardware differences of serialized chips. With the advantages of full scenarios, low threshold, and high performance, it meets users' all-round artificial intelligence demands.

2. Compatible with mainstream front-end frameworks, quickly get algorithm transplantation

At present, in the field of artificial intelligence, the skills in AI algorithm model building are already well established. In addition to Huawei's open source MindSpore, there are deep learning frameworks for AI model building on the market, as well as Google's TensorFlow, Facebook's PyTorch, and Caffe.

Through the Plugin adaptation layer, CANN can easily undertake AI models developed based on different frameworks, and convert models defined by different frameworks into standardized Ascend IR (Intermediate Representation) graphical formats, shielding framework differences.

In this way, developers only need very few changes to quickly get the algorithm transplantation, experience the surging computing power of the Ascend AI processor, and greatly reduce the cost of switching platforms. Does it sound good?

3. Simple and easy-to-use development interface, so that Xiaobai can also play with AI

Relying on artificial intelligence to achieve intelligent transformation has almost become a compulsory course in all walks of life. CANN adheres to the minimalist development , and provides a set of simple and easy-to-use AscendCL (Ascend Computing Language) programming interface, shielding the bottom layer for developers The difference between the processors, you only need to master a set of APIs, you can fully apply to Shengteng's full range of AI processors.

At the same time, it can satisfy that developers can still achieve backward compatibility with in the case of future CANN version upgrades, and the operating efficiency is not compromised!

Simple AI application development interface

Artificial intelligence entrusts mankind’s vision for a better life in the future. When we face the soul torture of "what kind of trash is this, and which bucket to throw in" every day, an AI trash sorting bin application can take you out of desperation. Rescued.

AscendCL provides a set of C language API library for developing deep neural network reasoning applications. It has the capabilities of runtime resource management, model loading and execution, image preprocessing, etc., allowing developers to easily unlock image classification, target recognition and other types AI application. And it can support calling the AscendCL library through the mainstream open source framework and directly calling the AscendCL programming interface opened by CANN.

Here are 5 steps to teach you to get the AI garbage classification application:
image.png

  1. Operation management resource application: used to initialize the internal resources of the system.
  2. Load the model file and build the output memory: convert the open source model into the om model supported by CANN and load it into the memory; obtain the basic information of the model, build the model output memory, and prepare for the subsequent model inference.
  3. Data preprocessing: preprocess the read image data, and then construct the input data of the model.
  4. Model reasoning: Perform model reasoning based on the input data of the constructed model.
  5. Analyze the inference result: According to the model output, analyze the inference result of the model.

Flexible operator development interface

When your AI model has operators that CANN does not support, or you want to modify existing operators to improve computing performance, you can use CANN's open custom operator development interface to develop the operators you want.

For AI developers of different levels, CANN provides high-efficiency (TBE-DSL) and professional (TBE- ) 16153cfb214416 operator development modes, which can flexibly meet developers of different levels.

Among them, TBE-DSL is less difficult to get started. It can automatically realize data segmentation and scheduling. Developers only need to pay attention to the calculation logic of the operator itself, without knowing the details of the hardware, to develop high-performance operators.

TBE-TIK is relatively difficult. Unlike TBE-DSL, which only provides high-level abstract programming, it provides instruction-level programming and tuning capabilities. Developers need to manually complete class-instruction-level calls, which can fully tap hardware capabilities and achieve more efficient and effective Complex operators.

Convenient IR composition interface

In addition, developers can use the standardized Ascend IR (Intermediate Representation) interface, aside from the deep learning framework itself, and directly call the operator library in CANN to construct a high-performance model that can be executed on the Ascend AI processor.

4. 1200+ high-performance operator library, building a source of surging computing power

The model built based on the deep learning framework is actually composed of one computing unit. We call these computing units an operator (Operator, Op for short), which corresponds to a specific computing logic.

The accelerated calculation of operators on hardware constitutes the basis and core of accelerated neural networks. At present, CANN provides 1200+ deeply optimized, hardware-friendly operators. It is such a rich high-performance operator that builds a surging source of computing power, allowing your neural network to "instantly" accelerate.

  • N N (Neural Network) operator library : CANN covers the calculation types of commonly used deep learning algorithms, including TensorFlow, Pytorch, MindSpore, ONNX frameworks, and occupies the largest proportion of all operators in CANN. Users only need Pay attention to the implementation of the algorithm details. In most cases, you do not need to develop and debug the operator yourself.
  • BLAS (Basic Linear Algebra Subprograms) operator library : BLAS is the basic linear algebra assembly, which is a numerical library for basic linear algebra operations such as vectors and matrices. CANN supports general matrix multiplication and basic Max, Min, Sum, Multiply and add operations.
  • DVPP (Digital Video Pre-Processor) operator library : Provides high-performance video codec, picture codec, image cropping and zooming and other pre-processing capabilities.
  • AIPP (AI Pre-Processing) operator library : Mainly realize image size change, color gamut conversion (image format conversion), mean value/multiplication coefficient (image normalization), and fusion with model reasoning process to satisfy reasoning Enter the requirements.
  • HCCL (Huawei Collective Communication Library) operator library : Mainly provides broadcast, allreduce, reducescatter, allgather and other collective communication functions between single-machine multi-card and multi-machine multi-card, providing efficient data transmission capabilities in distributed training.

5. High-performance graph compiler, giving neural network superpowers

The hardest thing in the world is waiting, waiting for traffic lights, waiting for winter and summer vacations, waiting for takeaways, waiting for the right person...

The same is true in the field of artificial intelligence. With the rapid evolution of neural network structure, it is more and more prone to bottlenecks when using manual optimization to solve the performance problems of AI models. CANN's graph compiler is like a magician and will have a higher degree of abstraction. The calculation graph is compiled and optimized according to the hardware structure characteristics of the Ascend AI processor, so that it can be executed efficiently.

What "god operations" does a magician have?

Automatic operator fusion : Automatic fusion based on operators, subgraphs, SCOPE and other dimensions, effectively reducing computing nodes and greatly reducing computing time.
image.png

Buffer integration : Aiming at neural network computing big data throughput and memory bound problems, by reducing the number of data transfers, increasing the utilization of the cache in the Ascend AI processor, and improving computing efficiency.
image.png

We make a comparison before and after Buffer fusion:

Before fusion, after operator 1 is calculated on the Ascend AI processor, the data is transferred from the cache buffer in the Ascend AI processor to external storage, and operator 2 obtains the data from the external storage as input and moves it into the cache buffer for calculation . After the fusion, after the calculation of operator 1 is completed, the data is kept in the cache buffer, and operator 2 directly obtains the data from the cache buffer to perform the calculation of operator 2, which effectively reduces the number of data transfers and improves the calculation performance.

full map sinking : Ascend AI processor, which integrates rich computing equipment resources, such as AICore/AICPU/DVPP/AIPP, etc., thanks to the rich soil on Ascend AI processor, makes CANN not only The calculation part is lowered to the Shengteng AI processor to accelerate, and the control flow, DVPP, and communication parts can also be lowered and executed. Especially in training scenarios, this ability to execute all closed loops of logically complex calculation graphs in the AI processor can effectively reduce the interaction time with the Host CPU and improve computing performance.

Heterogeneous scheduling capabilities : When the calculation graph contains multiple types of calculation tasks, CANN makes full use of the rich heterogeneous computing resources of the Ascend AI processor, and allocates the calculation tasks to different It realizes parallel computing, improves the resource utilization of each computing unit, and ultimately improves the overall efficiency of computing tasks.
image.png

6. Automatic mixing accuracy, effectively achieving the balance of income

As the name implies, automatic mixed precision is a technology that automatically mixes half-precision and single-precision to accelerate model execution. It has an indispensable position in large-scale model training scenarios.

Single-precision (Float Precision32, FP32) is a data type commonly used in computers, and half-precision (Float Precision16, FP16) is a relatively new floating-point type that uses 2 bytes (16 bits) for storage in computers. It is suitable for scenes with low precision requirements.
image.png

Obviously, using the FP16 type will definitely bring about a loss of calculation accuracy, but for deep learning training, not all calculations require high accuracy. Therefore, operators in the calculation graph that are not sensitive to accuracy can be used to accelerate the calculation of the FP16 type, which can effectively reduce memory usage and achieve a balance between performance and accuracy.

7. E-level clusters, opening the era of AI supercomputing

As the problems that mainstream deep learning models can solve become increasingly complex, and the complexity of the model itself has begun to increase, the field of artificial intelligence needs more powerful computing power to meet the training needs of future networks.

The "Pengcheng Cloud Brain II" based on the basic hardware and software of Ascend AI has broken the computing power ceiling of 100 P-level FLOPS (equivalent calculations per second) in the industry today, allowing E-level FLOPS (equivalent calculations per second) ) The computing power scene has entered the stage of history.

It integrates thousands of Ascend AI processors, with a total computing power of 256-1024 PFLOPS, which is 25.6-10.24 billion floating-point operations per second.

How to efficiently schedule thousands of Ascend AI processors is a difficult problem faced by large-scale cluster networks.

CANN integrates HCCL (Huawei Collective Communication Library), which provides a data-parallel/model-parallel high-performance collective communication solution for Shengteng AI processor multi-machine multi-card training:

  1. The two-level topology networking of high-speed HCCS Mesh interconnection in the server and non-blocking RDMA networking between the servers, combined with the topology adaptive communication algorithm, can make full use of the link bandwidth, and divide the data transmission between the servers in parallel to each independent The network plane greatly improves the linearity of model training in super-large-scale clusters.
  2. Integrate high-speed communication engine and dedicated hardware scheduling engine, greatly reduce communication scheduling overhead, realize unified and harmonious scheduling of communication tasks and computing tasks, and accurately control system jitter.

If "Pengcheng Yunnao II" is compared to a large symphony orchestra, then CANN is an excellent conductor, working hand in hand with the rising AI processor to open a new chapter in the AI supercomputing era.

8. Write at the end

CANN has been trying to make breakthroughs since its release in 2018. brings developers a minimalist experience and releases the ultimate performance of AI hardware. become the legs that support CANN walking in the field of artificial intelligence.

I believe it will be committed to the AI track, and join hands with people who want to change the world, change the world together, and build the future together!

At the end of 2021, CANN will usher in a new and more powerful version 5.0. What surprises will it bring? Let us wait and see!

Click to follow to learn about Huawei Cloud's fresh technology for the first time~


华为云开发者联盟
1.4k 声望1.8k 粉丝

生于云,长于云,让开发者成为决定性力量