Abstract: Chinese large-scale pre-training language model era with 100 billion parameters has arrived.

This article is shared from the HUAWEI cloud community " MindSpore open source framework blessing, how to "refine" the first Chinese pre-trained language model with 100 billion parameters and TB-level memory? ", original author: chengxiaoli.

The era of Chinese large-scale pre-training language model with hundreds of billions of parameters has arrived.

Recently, the Chinese large-scale pre-training language model circle has been a bit lively. The "Enlightenment · Wenyuan" with 2.6 billion parameters, the PLUG with 27 billion parameters, and the hundreds of billions of "Pangu" NLP large models released by Huawei Cloud yesterday. The pre-trained language model has grown to a terabyte of memory required for loading. Or video memory.

We can intuitively think that the effect of "Pangu" should be better, but the calculation demand is also greater, and training is more difficult.

However, "Pangu" is actually such an exploration: the open source framework MindSpore, the basic software and hardware platform of Shengteng, and the super-large-scale Chinese pre-training model mean that the infrastructure is already complete.

This work was jointly completed by Huawei and relevant technical teams of Peking University. With the help of the Ascend basic software and hardware platform and the automatic parallelization of the MindSpore framework, the largest Chinese pre-training model was trained.

So how is the Pangu model, which is constantly increasing in magnitude, trained? Next, let us interpret the key technology behind "Pangu" in detail.

Hundreds of billions of parameters, TB-level memory model

Taking Pangu 200 billion as an example, if we use the standard FP32 data format for the weights during training, then the weight occupied by the weight reaches 750GB, and the memory overhead will increase several times during the training process. These 750GB parameters are not placed on the hard disk or loaded into the memory, but need to be moved to the Ascend Atlas training server HBM (High Bandwidth Memory) memory to use the Ascend Atlas training server to train the model.

A large model means that the data is also large, and they all need to be high-quality data. In order to meet the data requirements, the R&D team crawled 80 TB of text from the Internet, and finally cleaned it into a 1TB Chinese data set.

Such models and data can no longer be loaded on our servers, let alone training. Fortunately, the R&D team will provide an API, and the average algorithm engineer can directly call the interface to try the effect.

It can be said that Pangu is the industry's first 100 billion Chinese pre-training model, with the highest parameter amount reaching 200 billion.

Ultra-large-scale automatic parallelism, the gospel for algorithm engineers

Consider a question first. Have you thought of how to train such a large model?

If you give you enough computing power, can you think of how to train such a large model? Our most commonly used distributed training method is data parallel. It is definitely not enough to do this alone, because no computing hardware can hold 800GB of parameters. What about model parallelism? A new question has arisen, how do we split such a huge "Pangu"? What is the gradient flow and data flow communication between hardware products (such as NPU, GPU, etc.)?

Obviously, training such a large model is far more complicated than we thought. It requires a lot of engineering operations, and it is guaranteed that these operations will not or very little affect the final convergence effect of the model.

Does Pangu really rely on manual parallel optimization?

If you write distributed training logic manually, you need to consider a lot of complex things such as the amount of calculation and type, cluster bandwidth, topology, number of samples, etc., and then design a parallel segmentation strategy with better performance and write it A large number of parallel segmentation and communication codes between nodes. If the system environment changes, you have to redesign and modify the algorithm, which is a big deal to think about.

If we use TensorFlow or other similar frameworks, the MirroredStrategy series of built-in distributed strategies can't be used at all. It seems that writing parallel strategies by yourself is indispensable. However, Pangu's real training is a way of software and hardware coordination, MindSpore computing framework, CANN heterogeneous computing architecture, and the whole set of infrastructure of Shengteng basic software and hardware platform. Among them, the one provided by MindSpore contains the vital automatic parallel capability.

Fusion of 5 dimensions, powerful automatic parallelism

MindSpore automatic parallelism provides 5-dimensional parallelism: data parallelism, operator-level model parallelism, Pipeline model parallelism, optimizer model parallelism, and recalculation. In the graph compilation stage, it organically integrates 5 dimensions of parallelism. The combination of these 5-dimensional parallel modes constitutes Pangu's parallel strategy.
image.png

a. Data parallel

Data parallelism is the most basic and most widely used parallel method. It splits the training data (mini-batch), and each device obtains one copy; each device has a complete model. During training, after each device undergoes gradient calculation, it needs to go through the gradient synchronization between the devices, and then the model parameters can be updated.

b. Operator-level model parallelism

Operator-level model parallelism is to segment the tensor involved in each operator in the model network. MindSpore models each operator independently, and each operator can have different segmentation strategies.

Take the matrix multiplication operator MatMul(x, w) as an example, x is the training data, w is the model parameter, and both are two-dimensional matrices. Parallel strategy ((4, 1), (1, 1)) means that x is sliced into 4 pieces in rows, and w is not sliced. If there are 4 devices in total, then each device has a slice of x, and the complete w.

c. Pipeline model parallel

The Pipeline model divides the model into multiple stages in parallel, and then maps each sage to multiple devices. In order to improve the utilization of equipment resources, the mini-batch is divided into multiple micro-batch, so that different equipment can process different micro-batch data at the same time.

A pipeline parallel method (Gpipe) requires that the reverse calculation waits for the forward calculation of all devices to be completed before starting, and the reverse calculation may depend on the forward output, resulting in activation memory accumulated during the forward calculation of each card It is proportional to the number of micro-batch, thus limiting the number of micro-batch. In MindSpore's Pipeline parallelism, the reverse direction will be advanced. After each micro-batch calculation is completed, the reverse calculation will start, which effectively reduces the activation storage time and improves the overall parallel efficiency.
image.png

d. Parallel optimizer model

The optimizer model splits the parameters and gradients involved in the optimizer into multiple devices in parallel. Taking the Adam optimizer as an example, there may be multiple "momentum" equal to the weight to participate in the calculation. In the case of data parallelism, each card has complete "momentum", and they are repeatedly calculated on each card, resulting in waste of memory and calculations. By introducing the optimizer in parallel, each card only saves the weight and "momentum" slices, which can reduce the static memory of each card and improve computing efficiency.
image.png

e. Recalculation

Rematerialization The output of the forward operator is accumulated in the memory, which leads to the problem of excessive memory peaks. Some of the output of the forward operator is discarded, and it is recalculated when it is used in the reverse phase. This effectively reduces the peak memory usage during training. As shown in the figure below, the first memory peak is eliminated by recalculation, and the second memory peak can be eliminated in parallel by the optimizer mentioned earlier.
image.png

With these 5-dimensional parallel dimensions, how to combine them to act on Pangu, and how to distribute the segmented model fragments to each device is still a difficult problem. MindSpore automatically parallelizes, and combines these 5 dimensions in parallel organically to achieve very efficient large model distributed training capabilities.

The following figure (b) is a typical tree-shaped hardware topology, and its bandwidth decreases as the depth of the tree increases, and some traffic conflicts will occur. In order to take advantage of this feature, MindSpore’s goal is to maximize the computing communication ratio, placing a parallel method with a large amount of communication (operator-level parallel) between the multi-cards inside the server; placing a smaller amount of communication (Pipeline parallel) in the server Between servers in the same rack; the data parallel (optimizer parallel) part is placed between different racks, because the communication can be performed at the same time as the calculation (overlap), and the bandwidth requirement is low.
image.png

In the Pangu 200 billion model, MindSpore divides 64 layers into 16 stages, and each stage contains 4 layers. In each layer, the tensor is segmented using parallel operator level.

As shown in the figure below, the parameters of Q, K, V are cut by 8 in practice (by column), the input tensor (by row) is cut by 16 parts, and the output tensor is therefore cut by 128 parts (8*16 ). The recalculation configuration is configured in each layer, that is, the extra calculation amount introduced by the recalculation will not exceed the calculation amount of one layer. In total, MindSpore used 2048 Ascend processors to train Pangu.
image.png

MindSpore shields the details of complex parallel implementation from the outside, making it as simple as writing a stand-alone model script. On the basis of stand-alone scripts, users can achieve multi-dimensional hybrid parallelism only with less configuration. The following figure is a simplified version of Pangu script, in which the red bold font indicates the parallel strategy in MindSpore. If the red bold font is removed, it is a stand-alone script.
image.png

Graph computing cross-layer joint optimization, giving full play to the extreme performance of hardware

In addition to large-scale automation across nodes, within a single card node, MindSpore uses cross-layer collaborative optimization of layers and operator layers to further exert its rising computing power.

In the traditional NN network, the amount of calculation and calculation complexity carried by different operators are also different. For example, LayerNorm consists of 11 basic operators, while Add has only 1 basic operator. This kind of operator definition based on the user's perspective is usually unable to give full play to the computing power of hardware resources. Because of the large and complex operators, it is usually difficult to generate high-performance operators with better segmentation. Therefore, the utilization rate of the equipment is reduced; and the operator with too small calculation amount cannot effectively hide the data movement overhead due to the calculation, and it may also cause the space delay of the calculation, thereby reducing the utilization rate of the equipment.

In order to improve hardware utilization, MindSpore uses graph-calculation fusion optimization technology, through the joint optimization of layers and operator layers, and reorganizes and integrates the "user-oriented ease of use operator", and then converts it into a "hardware execution perspective". "High-performance operators" to fully improve the utilization of hardware resources, thereby enhancing the execution performance of the entire network. The specific optimization process is shown in the figure below:
image.png

Take the LayerNorm operator as an example. Through operator splitting and reorganization, 11 small operators form a single operator and 2 fusion operators. These reorganized operators can generate higher-performance operators, which greatly reduces the overall network running time.
image.png

In the Pangu model, graph-calculation fusion helped reduce the overall training time by more than 20%. In addition, for other NLP, CV and other tasks, graph-calculation fusion has a good performance in optimizing performance.

Summary: The perfect embodiment of training a large model

Even if we give us enough computing power, the training of the super-large model is still extremely complicated and far more difficult than imagined. For our general algorithm engineers, for a certain task, hundreds of millions of parameters are already considered large, but we don't feel any difficulty in training, because each deep learning framework can directly call the data parallel interface.

But if the model continues to grow to tens of billions, hundreds of billions, or even trillions, the complexity of parallel and optimization strategies will rise suddenly, and it will be too difficult for algorithm engineers to write and optimize code little by little. MindSpore decouples computing logic and parallel logic through automatic parallelism, and the single-card serial code automatically realizes distributed parallelism, so that algorithm scientists can free their energy on the model itself.

In order to gain more knowledge from pre-training, models such as GPT-3 and Pangu will become larger and larger. After all, we have not seen the limit of the pre-training effect of large models. By then, this type of model will have greater infrastructure requirements, and parallel and optimization strategies will be more complex. Only with sufficient good infrastructure can the effect of large-scale pre-training be better, so as to play a greater role in knowledge question and answer, knowledge retrieval, knowledge reasoning, reading comprehension and other scenarios, and realize the commercial value of intelligent customer service, marketing, copywriting generation, etc. .

Large-scale computing clusters and collaborative optimization of software and hardware have been fully and perfectly reflected in Pangu's training this time. As the development team said, "The practice of the 100 billion parameter model based on Mindspore and Ascend's basic software and hardware platform is also an exploration. The distributed training of large models, super parameter tuning, data set composition, model structure adaptability, etc. There are too many unknowns. Now, the Pangu model works very well, refreshing the first version of the clue version, which means that for the first time based on domestic software and hardware collaborative optimization and ultra-large-scale distributed training, the results are exciting, our own It also has a sufficiently strong infrastructure."

Of course, as mentioned above, Pangu is just an exploration of super-large-scale distributed training and super-large-scale Chinese pre-training models. In the future, more researchers are needed to invest in general intelligence and large-scale distributed computing research. in.

Click to follow, and get to know the fresh technology of Huawei Cloud for the first time~


华为云开发者联盟
1.4k 声望1.8k 粉丝

生于云,长于云,让开发者成为决定性力量