First look at CANN 5.0 hardcore technology

Abstract: In December 2021, CANN5.0 version will also officially meet with you. Through software and hardware collaborative optimization, this version will double the training performance and show the "Chinese speed" in the AI field with its strength!

This article is shared from the HUAWEI cloud community " CANN 5.0 hard-core technology first look at ", author: kourei.

introduction

In September 2018, the CANN 1.0 Huawei Ascend AI enabling platform was born;

In August 2020, CANN 3.0 version was released. As a heterogeneous computing architecture specifically oriented to AI scenarios, it bridges the upper-level deep learning framework and the underlying AI hardware platform. The development efficiency and performance are industry-leading, and it can support users' all-round manual work. Intelligent computing demands.

In the past year, CANN has joined hands with 200+ universities/research institutes to continue to promote AI research progress;

With the support of CANN architecture, the Pangu AI model with hundreds of billions of parameters brings unprecedented commercial value;

The number of developers in the Shengteng community has increased from 100,000 to 400,000, and the ecological camp is booming...

In December 2021, the CANN5.0 version will also officially meet with you. Through the collaborative optimization of software and hardware, this version will double the training performance and show the "Chinese speed" in the AI field with its strength!

Let's put a few easter eggs first, let everyone see it first!

Core technology creates ultimate performance

Compared with version 3.0, CANN5.0 can achieve a performance improvement of 30% to 140% in typical inference scenarios; large-scale cluster training and common model training can double the performance;

What are the key technologies behind the significant performance improvement of CANN 5.0?

Automatic task flow

Too long data loading operation during computing startup will block the startup speed of subsequent computing pipelines, which is just as unacceptable as the mobile phone charging power reaches 20% before it can be turned on.

CANN 5.0 implements multi-pipeline parallelism for calculation instructions and data loading. This optimization allows users to segment the loaded data. When the loaded data meets the segmented data volume, the subsequent calculation logic is immediately started, and the subsequent data continues to be loaded. When the subsequent segmented data is loaded and the pipeline is idle, the subsequent calculations will be started in turn to give full play to the multi-pipeline parallel capability of the Ascend AI processor to achieve seamless multi-pipeline connection.

Operator deep fusion

With the increasing complexity of the network structure, the performance overhead brought by the internal and external storage and handling of data and the corresponding multiple instructions of multiple operators have become more and more important to be ignored.

CANN 5.0 identifies more fusion scenarios on the basis of 3.0, reduces the number of computing nodes through multi-operator automatic fusion, and effectively reduces memory copy; and through flexible and customizable fusion rules, the operators in the calculation graph can be integrated to the greatest extent. For developers to win more computing performance benefits.

Adaptive gradient segmentation

In a large-scale cluster training scenario, thousands of iterative calculations are usually required, and each iteration includes layer-by-layer feedforward calculations in both positive and negative directions.

Most synchronization update algorithms require that before the next round of forward calculations start, the gradient data needs to be synchronized between the computing nodes to complete the weight update. This will result in a waiting gap between the two iterations, that is, communication tailing.

CANN 5.0 uses the intelligent gradient segmentation algorithm to automatically search for the optimal gradient parameter segmentation method, select the appropriate communication timing and communication volume for gradient transmission, maximize the parallel execution of calculation and communication, and minimize the communication tail time. It can promote cluster training to achieve optimal performance.

AutoTune intelligent calculation and optimization

Just as we can't expect the same beauty camera to be able to modify a peerless beauty, similarly, for different networks, if all the simple data segmentation strategies are adopted, it will often lead to the computing unit not being fully loaded, and the performance will not meet expectations.

CANN 5.0 uses intelligent data segmentation technology to tailor an optimal segmentation strategy for the network, realizes a full-load calculation of a single computing unit, makes full use of hardware resources, and brings considerable performance benefits.

At the same time, in order to solve the time-consuming problem of tuning, CANN 5.0 presets a large number of model optimization rules, which can greatly reduce the length of tuning and bring users an excellent tuning experience.

Lower the barriers for developers to use

In addition to the performance surprises, CANN 5.0 further simplifies code development and commissioning methods on the basis of 3.0, helping developers to achieve efficient AI development.

• supports automatic model migration , no need to manually modify the code, one-click model migration, immediately imagine the surging computing power brought by the Ascend 910 AI processor.

• supports mixed programming , directly call the operator function in the APP, automatically complete the compilation, loading and execution.

• supports automatic generation of operator test code , and one-click execution of the result.

Enabling super-large models, accelerating innovation

Support for large parameter models

In the past two years, there have been many large models in the industry, such as GPT-3, with parameters as high as 175 billion. A single large model requires 3TB of storage space per month, and the demand for computing power is even more amazing.

In order to solve the problem of model "put down" and let users use it in a friendly way without changing the original code, CANN5.0 is at the level of "AI compiler", in terms of optimizer, gradient, weight, etc. Parallel training of models in all dimensions.

Through the parallelization of different levels of models, the models that could not be put down are deployed on the cluster in a distributed manner, and can be trained with a high utilization rate of computing power. Taking the 8.3 billion Megatron model as an example, the memory requirement of a single card is reduced from about 180GB to less than 16G, so that the super-large model can be "put down".

Support super large picture calculation

In addition, in some application scenarios, there may be challenges with large input data specifications.

For example, in the field of remote sensing applications, it is often necessary to locate a ship in the vast sea and an aircraft from the vast sky. With the advancement of observation technology, the spatial resolution of these remote sensing images is getting higher and higher, and the average can reach CHW. :4 30000 30000 or even higher, the single sample size is often 2-3GB, and the calculation of very large pictures has become a "stuck neck" problem in the development of the remote sensing application industry.

CANN 5.0 helps Wuhan University to build the world's first special framework for remote sensing, LuojiaNet, to solve the problem of "large format, multi-channel" processing of remote sensing images. Experiments have proved that the FCN8S model has a significant improvement in accuracy when processing remote sensing data sets (image resolution of 30,000 * 30,000). There are a lot of key technologies hidden in this:

picture is large and the video memory is not enough, what should I do?

Make full use of the advantages of the cluster, complete the automatic segmentation of pictures according to the amount of data and the scale of the cluster, and deploy them to each computing node.

features large span, missing features, and edge distortion, what should I do?

Before the convolution operation of the current slice, overlap data with adjacent slice characteristics is automatically calculated to provide context information for the current slice to ensure the accuracy of the picture.

exchange overlap data efficiently?

Rely on the efficient alltoallv operator to send and receive data between adjacent nodes to realize non-blocking communication.

CANN5.0 relies on automatic decomposition and parallel technology to make the processing of super-large models as simple as ordinary models. It is believed that with the assistance of CANN5.0 version, the AI industry will continue to accelerate innovation and usher in a new explosive period.

ModelZoo fully supports mainstream models in the industry

ModelZoo is a preferred model library provided by Shengteng, and its loaded models can be directly and efficiently executed on the Shengteng AI processor. At present, CANN5.0 fully supports 400+ mainstream models in the industry including TensorFlow, PyTorch, ONNX, and the completeness of operators has been greatly improved.

Developers can move to Community to experience it.

Together, the ecological camp is booming

As a basic software platform for artificial intelligence, CANN continues to make breakthroughs in basic capabilities and key technologies, but if you want to go further, you can only work with everyone. In the past year, CANN's ecosystem for developers has been fully developed:

So far, the activity of the Shengteng community has increased three times compared with last year; currently it has gathered 400,000 developers and 3,000 core developers, and plans to develop one million developers and 10,000 core developers in 2022; cumulatively with more than 200 University research teams cooperated, and the Smart Intelligence project contributed 200+ models and 500+ operators.

Gathering is a fire, and ecological construction is the driving force that enables the sustainable development of the AI industry. Through openness, cooperation, and win-win methods, CANN will continue to work with partners to support the AI industry in an all-round and multi-dimensional manner, and help the prosperity and development of artificial intelligence!

Click to follow and learn about Huawei Cloud's fresh technology for the first time~

First look at CANN 5.0 hardcore technology

introduction

Core technology creates ultimate performance

Automatic task flow

Operator deep fusion

Adaptive gradient segmentation

AutoTune intelligent calculation and optimization

Lower the barriers for developers to use

Enabling super-large models, accelerating innovation

Support for large parameter models

Support super large picture calculation

ModelZoo fully supports mainstream models in the industry

Together, the ecological camp is booming

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

一文掌握 MCP 上下文协议：从理论到实践

AI Agent爆火后，MCP协议为什么如此重要！

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

MCP 协议为何不如你想象的安全？从技术专家视角解读

🔥吐血整理 Bolt.diy 部署与应用攻略

常见的 AI 模型格式