The Way to Accelerate AI Computing Power

1 The development trend of AI computing power

1.1 Artificial Intelligence Theory: Deep Learning

The development of artificial intelligence has not been smooth sailing so far. From the initial stage to the current deep learning stage, data, algorithms and computing power constitute the three basic elements of artificial intelligence, which jointly promote the development of artificial intelligence to a higher level of perception and cognition.

1.2 Representatives of the third artificial intelligence wave

As mentioned above, the current prosperity of artificial intelligence is inseparable from the common development of data, algorithms and computing power. At the algorithm level, the three giants of deep learning, Geoffrey Hinton, Yann LeCun and Yoshua Bengio, have contributed to the field of AI. Xiao, they reshaped AI around neural networks;

At the data level, in 2007, Li Feifei created ImageNet, the world's largest image recognition database, which made people realize the importance of data to deep learning. It is precisely because of the ImageNet recognition competition that AlexNet, VggNet, GoogleNet, ResNet and other classics were born. deep learning algorithms.

After the previous boom in artificial intelligence, it fell into a slump. One of the core reasons is that computing power is difficult to support complex algorithms, and simple algorithms are ineffective. The GPU launched by NVIDIA, founded by Huang Renxun, has eased the training bottleneck of deep learning algorithms and released the new potential of artificial intelligence.

1.3 Computing power is productivity

In the age of wisdom, computing power is productivity. What is productivity is the ability of human beings to transform nature and the ability to create value. In this table, we have an interesting discovery.

Ten years ago, most of the companies with the highest market capitalization in the world were energy companies and financial companies. The only IT company with the highest market value was Microsoft. At that time, Windows was booming, and Office was the dominant player in the world, belonging to the era of personal PCs.

At present, the most valuable companies in the world are almost all information technology and service companies. The interesting part is not here. These top companies also happen to be the companies that purchase the most servers in the world. Amazon alone purchased 13% of the world in 2017. cloud server. It is massive computing power that creates value for these companies.

This is true for businesses, and the same is true for countries. Computing power is to the age of wisdom what electricity is to the age of electricity, and both are important forms of productivity.

Then, we can analyze the economic development of a country through the calculation of power, just as the electricity in the Keqiang Index can measure the development of an industry. According to statistics, the number of national GDP and server shipments, GDP and server purchases show an obvious positive linear correlation.

Not only are the US and China far ahead of Japan and Germany in terms of GDP, but the number of servers per trillion GDP is also much higher than them, and the contribution of the digital economy is significantly higher than theirs.

The situation in our domestic provinces is completely similar. The server shipments per trillion GDP in North, Shanghai, Guangzhou, and Zhejiang are much larger than those in other provinces. Therefore, their new and old kinetic energy is converted quickly, and the quality of development runs ahead. So we can say that computing power has become an important indicator to measure the level of social and economic development.

In the face of exponentially increasing computing demands, computing technologies, products and industries are also facing new challenges. Specifically, it is reflected in the following three aspects: one is the challenge of diversification, that is, the complexity of computing scenarios and the diversity of computing architecture; Challenges to existing computer architectures caused by computing power and massive applications;

The last one is the challenge of ecologicalization. To put it simply, the current intelligent computing is in the stage of co-existence, self-contained, ecologically discrete, and at the same time the upstream and downstream of the industrial chain are disconnected.

The first challenge is diversity.

We say that the most critical task of computing is to support the business, so different business types will inevitably require different computing systems to complete. For example, for traditional scientific computing such as seismic wave simulation, the numerical precision is high, and it needs to be able to reach 64 bits; while for AI training, 16-bit floating-point type with large numerical range and low precision can be used; for AI reasoning, due to the speed, With less energy consumption, it can be processed at lower numerical precision, such as 4-bit, or even 2-bit, 1-bit integer types.

That is to say, the application of AI has introduced new computing types, and the span from inference to training is larger. At the same time, the amount of data has also increased from GB level to TB level and PB level, and the type has changed from structured to semi-structured and unstructured. more complex and diverse.

Different numerical precision calculation types have different requirements for the instruction set and architecture of the computing chip, which leads to the fact that the general-purpose CPU chips we have been using before can no longer meet the requirements of this diversified computing scenario. More and more important reasons.

The second challenge is gigantic quantification. Giant quantization is first manifested in the large number of model parameters and the large amount of training data.
Taking natural language processing as an example, after the rise of pre-training models based on self-supervised learning, the model accuracy increases significantly with the increase of model size and training data.

In the past 20 years, the number of parameters of the GPT-3 model exceeded the 100 billion mark for the first time, reaching 175 billion. According to the current development trend, the number of parameters of the model in 23 years will exceed one trillion, which is basically the number of synapses in the human brain, and the number of synapses in the human brain is about 125 trillion.

Huge models require huge amounts of memory. At present, the onboard high-speed memory capacity of a GPU is 40GB. For a huge model containing trillions of parameters, 10,000 GPUs are needed to install these parameters evenly in each GPU memory.

Considering the additional storage required for training, it actually takes at least 20,000 GPUs to start training. The architecture of existing AI chips is no longer sufficient to support the parameter storage requirements of huge models.

At the same time, huge models depend on the feeding of massive data. The current AI algorithm is essentially a qualitative change that depends on quantitative change, and it is difficult to jump from one qualitative change to another. For example, the latest huge model requires trillions of words. volume data. Massive data requires massive storage. It is a great challenge to the storage system to meet the high-performance reading of tens of thousands of AI chips at the same time in a super large-scale cluster.

The second manifestation of giant quantification is the exponential growth of computing power demand

Since the rise of deep learning in 2011, the demand for computing power has been growing exponentially. Every 3.4 months, the computing power demand doubles. Petaflops*day represents the amount of floating-point calculations used to calculate one day's computing power with 1P per second of computing power to measure computing power. Training huge models requires huge computing power: the computing power of GPT-3 will reach 3640PD in 20 years, and the computing power requirements of huge models will reach one million PD in 2023.

On the fastest supercomputing system in the world today, it takes about 2 years to complete the calculation of one million PDs. Different fields require different types of huge models: GPT-3 mainly deals with English comprehension tasks. In order to meet the accuracy requirements of different languages and different scenarios, different huge models must be trained, which further increases the demand for computing power .

Such a huge demand for computing power has brought huge challenges to computing technology and products. Solving such challenges requires innovation in various domains, from architecture to system software.

Finally, let's take a look at the ecological challenges faced by intelligent computing. The technology chain and industrial chain of AI are disconnected. I think many people will have such questions, artificial intelligence is so good, but how can this thing be combined with my business and my customer application scenarios? I want to use AI technology to make intelligent transformation, but I found that there is no one here. Understand the algorithm, understand the model, but also lack an easy-to-use AI development platform. At the same time, there are so many algorithms and models, how to find the optimal combination of different algorithms in the application.

People who understand this are often concentrated in scientific research institutions or leading companies. The best AI talents are concentrated in these places, but they lack in-depth understanding of the demand scenarios and business rules of traditional industries, and they cannot get the most critical business data to train the model, which makes the technology useless.

Survey reports by consulting agencies such as Accenture also show that more than 70% of technical research institutions and technology companies lack demand scenarios, domain knowledge and data, and more than 70% of industry users lack technical talents, AI platforms and practical capabilities.

2 Introduction to AI acceleration technology

2.1 AI Architecture

Usually, the information related to the AI architecture that users are exposed to is to apply for XX core CPU, XX CPU cards, XXGB memory and other resources, which correspond to the computing resources, storage resources and network resources of the AI architecture. The actual AI architecture includes computing nodes, management Nodes, storage nodes, computing networks, management networks, and clients, etc.

How to plan computing resources? The principle is to spend the lowest cost to meet the demand, and at the same time consider the scalability. For example, if there are more than two types of computing features, and the scale is not small, then the corresponding computing node types should also have more than two types; if the limit needs to scale Much larger than other requirements, the number of computing node types can be reduced for future expansion.

2.2 AI acceleration technology

AI has a very large demand for computing, and how to accelerate is directly related to production efficiency and cost. Here are some of the latest AI acceleration technologies.

2.2.1 Calculation

(1) Heterogeneous computing

Before the GPU was used for AI computing, the CPU was responsible for computing tasks. However, with the sharp increase in AI computing demand, the computing efficiency of the CPU could not meet the demand, resulting in a "CPU+GPU" heterogeneous computing architecture, as shown in the upper right corner of the figure below. shown.

As shown in the lower right corner of the figure below, the computing efficiency of GPU is several times to dozens of times that of CPU. Why is there such a big difference between the computing efficiency of CPU and GPU? The main reason is that there is a huge difference in the architecture of CPU and GPU. As shown in the lower left corner of the figure below, the number of computing units of GPU is much more than that of CPU, so GPU is more suitable for large-scale parallel computing.

The area of Control and Cache units in the CPU architecture is much larger than that of the GPU, so the CPU is more suitable for complex calculations that cannot be highly parallelized (such as if statements in the code, etc.).

(2) NVLINK communication

As the scale of AI computing increases, such as large-scale AI training, multiple cards or even multiple nodes are required to participate in the calculation of a task at the same time. One of the key points is how to support high-speed communication between GPUs within nodes, so that they can act as a huge Accelerators collaborate with each other.

Although PCIe is very standard, the bandwidth is very limited. As shown in the upper left corner of the figure below, the theoretical bandwidth of PCIe Gen3 is 32GB/s, the theoretical bandwidth of PCIe Gen4 is 64GB/s, and the measured bandwidth is about 24GB/s and 48GB/s respectively. .

In AI training, before a round of calculation is completed, the parameters, that is, the weight coefficients, must be updated synchronously. The larger the model scale, the larger the parameter scale. In this way, the communication (P2P) capability between GPUs has a significant impact on computational efficiency. It is relatively large, as shown in the upper right corner of the figure below. It is also an 8-card V100. Compared with the PCIe architecture, the performance of the NVLINK2.0 architecture is improved by 26%. The NVLINK2.0 Next architecture (fully interconnected, the P2P communication bandwidth between any two cards is 300GB/ s) is 67% higher than the PCIe architecture.

NVLINK is a high-speed GPU interconnection technology developed by NVIDIA, which has now developed to the third generation (NVLINK3.0), as shown in the lower part of the figure below, from NVLINK1.0 (P100) to NVLINK2.0 (V100), and then to NVLINK3. 0 (A100), the bandwidth is from 160GB/s to 300GB/s, and then to 600GB/s. The P2P communication between NVLINK1.0 and 2.0 is not fully interconnected, that is, the communication bandwidth between any two GPU cards does not actually reach the maximum. Bandwidth, and some even communicate through PCIe, so that the GPU P2P communication within the node has produced a step.

NVLINK3.0 realizes P2P full interconnection communication, and the communication bandwidth between any two cards is 600GB/s, which greatly improves the computing efficiency of multiple cards in the node.

（3）Tensor Core

The Tensor Cores of the V100 are programmable matrix multiply and accumulate units that can deliver up to 125 Tensor TFLOPS for training and inference applications. V100 contains 640 Tensor Cores. Each tensor core provides a 4x4x4 matrix processing array, which performs the operation D=a*B+C, where a, B, C, and D are 4x4 matrices, as shown in the upper part of the figure below. Matrix multiplication inputs A and B are FP16 matrices, while accumulation matrices C and D can be FP16 or FP32 matrices.

Each Tensor core can perform 64 floating-point mixed multiply-add (FMA) operations per clock cycle. This provides up to 125 TFLOPS of compute performance for training and inference applications. This means that developers can perform deep learning training using mixed precision (FP16 computations use FP32 accumulation) to achieve 3x faster performance than the previous generation and converge to the network's expected accuracy.

The GEMM performance provided by Tensor cores is several times that of previous hardware, as shown in the lower right corner of the figure below, comparing the performance of GP100 (Pascal) and GV100 (Volta) hardware.

(4) Multiple computing power

With the development of AI, various types of chips have been produced, such as CPU, GPU, ASIC, and FPGA. As shown in the upper part of the figure below, we analyze and compare from the two dimensions of versatility and performance. The dimension of versatility: CPU > GPU > FPGA > ASIC , the performance dimension is just the opposite. Different AI tasks have different requirements for chips. For example, training tasks need to be able to support various frameworks, models, algorithm libraries, etc., and require high versatility. NVIDIA GPU has high versatility because of its complete ecology. thus occupy a dominant position.

For inference tasks, it only needs to support one or several frameworks, models, algorithm libraries, etc. Because it is close to the business, there is more demand for performance and cost, so ASIC chips are more cost-effective than NVIDIA GPUs in some scenarios. It can be seen from the sales volume of various chip markets according to IDC statistics shown in the lower half of the figure below that although NVIDIA GPU still dominates the inference market, other chips can still keep up with the pace of NVIDIA GPU. In the training market, other chips are still Slow progress.

(5) Low precision

If the 32-bit floating-point number can be compressed to 16-bit, although it will lose a certain representation precision, it will bring great improvement in terms of parameter storage space and calculation amount (the number of FPU calculations).

This is the rationale for mixed precision training. The main version of the weight is stored in FP32 form. When doing inference and backpropagation operations, first replace it with FP16 for calculation. When doing weight update, the updated increment (gradient multiplied by the learning rate) is also added to FP32. on the weights represented, as shown in the upper part of the following figure.

As shown in the figure below, in some scenarios, low precision not only improves performance, but can also be reused in inference tasks to process more complex models, thereby improving the accuracy of inference tasks.

2.2.2 Network

（1）GDR

GDR (GPU Direct RDMA) means that the GPU of computer 1 can directly access the GPU memory of computer 2, as shown in the upper half of the figure below. Before understanding GDR concepts, first understand DMA and RDMA concepts.

DMA (Direct Memory Access) direct memory access is an important technology for Offloading CPU load. The introduction of DMA makes the data exchange between the original device memory and the system memory require the participation of the CPU, and becomes handed over to the DMA control for data transmission.
RDMA can be simply understood as using relevant hardware and network technologies, the network card of server 1 can directly read and write the memory of server 2, and finally achieve the effects of high bandwidth, low latency and low resource utilization.

At present, the implementation of RDMA is mainly divided into two transmission networks: InfiniBand and Ethernet. On the Ethernet, it can be divided into IWARP and RoCE (including RoCEv1 and RoCEv2) according to the difference of the protocol stack integrated with the Ethernet.

The so-called GPUDirect RDMA means that the GPU of computer 1 can directly access the GPU memory of computer 2. Before this technology, the GPU needs to move the data from the GPU memory to the system memory, and then use RDMA to transfer it to the computer 2. The GPU of the computer 2 also needs to move the data from the system memory to the GPU memory.

GPUDirect RDMA technology further reduces the number of data copies for GPU communication and further reduces communication latency.

（2）SHARP

SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) is a collective communication network offload technology.

In AI training, there are often many ensemble-type communications, which often have a huge impact on the parallel efficiency of applications due to their global involvement.

In response to this situation, NVIDIA Mellanox introduced SHARP technology from the EDR InfiniBand switch, and integrated the calculation engine unit in the switch chip, which can support 16-bit, 32-bit and 64-bit fixed-point computing or floating-point computing, and can support summation and calculation. Minimum value, maximum value, sum, OR and XOR calculations, and can support Barrier, Reduce, All-Reduce and other operations.

In a cluster environment composed of multiple switches, Mellanox defines a complete set of extensible hierarchical aggregation and reduction protocol (SHARP) offloading mechanisms. The Aggregation Manager (Aggregation Manager) constructs a logical SHARP tree in the physical topology, consisting of Multiple switches in the SHARP tree process collective communication operations in parallel and distributedly.

When the host needs to perform global communication such as allreduce, all hosts submit the communication data to their connected switches. After the first-level switch receives the data, it will use the built-in engine to calculate and process the data, and then submit the resulting data to SHARP The upper-level switch of the tree, the upper-level switch also uses its own engine to aggregate the result data received from several switches, and continues to submit it to the upper-level of the SHARP tree.

After reaching the root switch of the SHARP tree, the root switch does the final calculation and sends the result back to all host nodes. Through the SHARP method, the delay of collective communication can be greatly reduced, network congestion can be reduced, and the scalability of the cluster system can be improved (as shown in the upper part of the following figure).
For complex models, SHARP has a more significant effect on complex multi-layer networks. As shown in the lower half of the figure below, with the increase of the cluster size, after the SHARP function is enabled, the delay basically does not change. Compared with the SHARP function not enabled, the delay increases linearly ; the same for the final performance improvement is also quite different.

（3）IB（INFINIBAND）

InfiniBand Architecture is a software-defined networking architecture designed for large-scale data centers, and it is designed to achieve the most efficient data center interconnect infrastructure. InfiniBand natively supports network technologies such as SDN, overlay, and virtualization, and is an open standard high-bandwidth, low-latency, and highly reliable network interconnection. Compared with the RoCE network, IB has many advantages, as shown in the upper part of the figure below.

Of course, there is a heated debate about whether to use IB or RoCE for the AI training network. NVIDIA mainly promotes IB. Their argument is that in addition to listing various functional advantages, there are also Internet companies in the past two years. For example, the AI clusters deployed by Alibaba, Baidu, JD.com, Tencent, etc. mostly use IB networks, but they cannot provide very convincing quantitative data. From the perspective of Alibaba, because of a dedicated RoCE network optimization team, it has obtained Similar to the performance of IB, and the Benchmark performance such as SHARP listed by NVIDIA only achieves a performance improvement of about 3%-5% in actual users (now it is estimated that the effect is more significant in large models, three-layer and above network architectures).

In general, the conclusion at this stage is that IB is better than RoCE. IB implements the optimization work in the ecology (NCCL/CUDA/…), and the optimization workload is very small for users, but for RoCE, special In comparison, the current choice of IB is more suitable. Of course, the cost has increased, but the lower part of the figure below brings a greater amount of performance improvement.

Of course, in the context of cloudification, in addition to Ethernet, there is an additional set of network architecture, which increases the complexity of the overall operation, maintenance and management. Therefore, the IB&RoCE dispute can be further analyzed and more quantitative data can be listed. , do more principle analysis, so as to achieve a deep understanding of the network.

(4) Multiple network cards

As mentioned earlier, the communication bandwidth of NVLINK3.0 is 600GB/s, and the measured communication bandwidth of PCIe4.0 also reaches 48GB/s, while the current computing network is usually up to 100Gb/s (12.5GB/s). In the large model training task of multi-node multi-machine multi-card computing, the parameter communication between nodes will encounter bottlenecks. At this time, it is necessary to adopt a multi-NIC strategy, that is, two nodes are no longer connected with one network cable, but with multiple network cables. , it can be seen from the figure below that the performance improvement of multiple network cards is obvious. Since the network cost accounts for about 10% of the cost of the entire computing system, the performance improvement of more than 10% is more cost-effective for the entire computing system.

2.2.3 Storage

（1）GDS

GDS (GPUDirect Storage) is another GPUDirect technology introduced by NVIDIA. Due to the fast computing speed of GPU, as the scale of data sets and models continues to increase, it takes longer and longer for applications to load data, which affects applications. The performance of programs, and especially end-to-end architectures, can make GPUs increasingly faster due to slow I/O.

The standard path for data transfer from NVMe disk to GPU memory is to use the Bounce Buffer in system memory, which is an extra copy of the data. The GPUDirect storage technology avoids the use of rebound cache to reduce extra data copies, and uses the Direct Memory Access (DMA) engine to directly put data into GPU memory for remote or local storage.

For example, between NVMe or NVMe over Fabric, and GPU memory, a direct data transfer path is established, which can effectively alleviate the bottleneck of CPU I/O, improve I/O bandwidth and the amount of data transferred.

NVIDIA developed GPUDirect storage technology, which greatly improves the speed of GPU loading of large data sets. Nvidia mentioned that the main function of GPUDirect storage technology is to transfer data to GPU memory through direct memory access through this new system.

Of course, there are not many GDR landing scenarios. First, the file system needs to be adapted. Only through NVIDIA certification can GDR technology be supported, which limits the promotion of the technology; secondly, GDR is mainly a single-machine technology, and NVME mainly It is used to carry an intermediate state requirement with insufficient memory space and low unified storage bandwidth. The applicable scenarios are narrow, so the enthusiasm for adaptation in the industry is not high, but in any case, GDR also provides another AI architecture. an accelerated option.

（2）Burst Buffer

The Burst Buffer technology can use the local SSD hard disk of the computing node to form a temporary cache file system. This feature can improve application reliability through faster checkpoint restart; speed up I/O performance for small block transfers and analysis; provide fast temporary storage space for core external applications; for large file inputs that require persistent fast storage during computation Compute tasks create staging areas.

It has been widely used in HPC architecture before. For example, the top 10 supercomputing clusters in the world's HPC TOP500 list have adopted Burst Buffer technology. In AI architecture, some users are now trying to adopt similar technology for large-scale training. Provides a very large cache.

2.2.4 Parallel Technology

In the large-scale training of AI, a very important technology is parallel technology. Deploying deep learning models on multiple computing devices is one way to train large-scale complex models, and the importance of this approach continues to grow as demands for training speed and frequency become higher.

Data parallelism (DP) is the most widely used parallel strategy, but when the video memory of a GPU cannot hold a model, the heap model needs to be split, and the model is divided into N parts and loaded into different models. Among the N GPU nodes, model splitting is divided into tensor slice model parallelism (intra-layer model parallelism) and Pipeline model parallelism (inter-layer model parallelism) according to different splitting methods.

For example, DeepSpeed model, GPT-3 model, etc. need to be combined in multiple parallel ways to completely install the entire model.

For the GPT-3 model, its requirements for computing and I/O are very large, and the main acceleration technologies mentioned above need to be integrated, such as NVLINK, Tensor Core, IB, multiple network cards, GDR, parallel methods, etc., In order to efficiently complete large model training.

2.3 Summary

The various AI acceleration technologies mentioned above are actually working in two directions: computing and I/O. Heterogeneous computing is used to improve computing power. NVLINK, IB, GDR, GDS, BurstBuffer, multiple network cards, etc. are all to improve IO bandwidth and latency.

Because from GPU cache (7TB/s) to video memory (1.6TB/s), CPU memory (90GB/s), cache (24GB/s), NVME hard disk (6GB/s), distributed storage (5GB/s, The scale can reach dozens or hundreds of GB/s), cold inventory (2GB/s), and there are steps in IO bandwidth. Therefore, the direction of IO acceleration of AI architecture is to gradually make up for the difference in steps. Of course, the algorithm needs to be used as much as possible. The characteristics of the architecture maximize the use of the fastest IO architecture.

3 GPT-3 model pre-training computing architecture analysis

The following takes GPT-3 model pre-training as an example to conduct a simple architecture analysis.

3.1 GPT-3 Model Computational Feature Analysis

When designing the AI architecture scheme, we must first figure out the computing characteristics of GPT-3, that is, what kind of computing and I/O meet the extreme requirements of GPT-3 model pre-training.

Generally, it is analyzed in two dimensions: theoretical analysis and actual testing. Through analysis, it can be known that the I/O of GPT-3 needs to be close to 100GB/s, and the corresponding network needs 4*HDR 200 network support, that is, 4 network cards are required. The second is the Infiniband network.

The second is the computing requirement, which is evaluated by the computing power of A100 of 312TFlops: the computing requirement of GPT-2 is about 10 PetaFlop/s-day, which is approximately equal to 64 A100 GPUs for 1 day of training; the computing requirement of GPT-3 is about 3640 PetaFlop /s-day, approximately equal to 1 year of training on 64 A100 GPUs. The following table shows the training computing resources used by several large models recently released in the industry.

3.2 Analysis of GPT-3 Model Pre-training Computing Architecture

As analyzed in the previous section, the computing part of the AI computing architecture uses the latest A100 GPU card, the I/O part uses a 4*HDR200 IB network, and NVLINK is used between GPUs to achieve 600GB/s high-speed interconnection.

NVLINK A100 Server Topology

The following figure is the corresponding network topology:

Large model training platform architecture (140 nodes)

4 Conclusion

AI computing power is an important part of the three elements of artificial intelligence. AI acceleration technology is developing rapidly around computing and I/O, and continuously improves the computing efficiency of AI computing tasks. We strengthen our understanding of AI architecture.

Of course, in addition to configuring the corresponding hardware architecture, AI acceleration also requires the cooperation of relevant technical personnel such as platforms, frameworks, algorithms, etc., in order to maximize the use of the latest AI architecture.

About the Author

Jason OPPO Senior AI Architect

Graduated from the Institute of Geology and Geophysics, Chinese Academy of Sciences, and worked as a senior AI architect at Inspur, providing AI computing power architecture selection and optimization for AI customers.

For more exciting content, please scan the code and follow the [OPPO Digital Intelligence Technology] public account