How to implement a trillion-level parametric model algorithm based on MindSpore?

Abstract: Recently, increasing the model size has become the main method to improve the performance of the model. In particular, the self-supervised pre-training language model in the NLP field is getting larger and larger, from the 175 billion parameters of GPT3 to the 1.600 billion parameters of Switch Transformer, which is another order of magnitude increase.

This article is shared from the article " " from the HUAWEI CLOUD community to take you to understand the key technologies of the trillion-level parameter super-large model supported by MindSpore! ", original author: HWCloudAI.

Preface

Recently, increasing the scale of the model has become the main means to improve the performance of the model. In particular, the self-supervised pre-training language model in the NLP field is getting larger and larger, from the 175 billion parameters of GPT3 to the 1.600 billion parameters of Switch Transformer, which is another order of magnitude increase.

The order of magnitude increase in model scale has achieved a certain degree of performance improvement, and even produced some unexpected "magic" effects (such as GPT3), but the computational overhead behind it has become the biggest problem, such as the use of GPT3 training. It has tens of thousands of GPUs and several weeks of training time. How to use ultra-large-scale parameters to improve model expression and performance, and to control a small increase in the amount of calculation, has become one of the most important challenges. The dynamic neural network technology represented by MoE was introduced with emphasis. The brain is a typical calculation model with low energy consumption and high efficiency, and sparse activation is the most important feature. In addition to the computational efficiency challenge of the giant model in training inference, especially during training, another bigger challenge of the current training optimization algorithm of the giant model is (not discussed here), the BP algorithm is currently the most available deep network optimization, but more The ideal optimization algorithm needs high parallelism, asymmetric optimization process, and can complete the overall optimization through continuous local optimization in the space-time dimension.

1. In the traditional neural network model, when feeding forward, the processing of each sample in the input batch will activate every parameter in the network to participate in the calculation.

2. The loosest definition of conditional calculation refers to a type of algorithm that only activates certain parts of the network. Conditional Computation refers to a class of algorithms that activate only some of the different parts in a network. In the realization of a certain type of condition calculation, the conditional selection mode may independently activate different parts of the network according to each sample in the input batch. Different parts of the data space (such as different areas of the image or channel) may be based on different parts of the input data time (such as different slide windows of the time series or different frames of the video.), and may be independent of each task according to the different target tasks , May be calculated independently according to the non-learnable fixed random allocation of different subnets.

3. For different inputs (original or front layer), according to certain conditions, the subsequent partial network calculations are selectively executed. Under this technology, there are some approximate or related technologies, such as: dynamic neural network(s), conditional computing , conditional activation, sparse activating, selective execution, mixture of experts (MoE), dynamic routing, …; some strongly related models such as Switch Transformer, etc.

Classification of conditional calculations (broad)

1. According to whether routing is learnable, it can be divided into: learnable routing conditional computation and unlearnable routing conditional computation.

2. According to whether activation does not perform non-activation calculation, it can be divided into hard conditional computation and soft conditional computation. For hard-mode conditional calculations, select segmentation and other operations through tensor, no matter what the conditional selection mode, data that does not need to be activated will not participate in the calculation of the inactive network part at all; soft-mode conditional calculations may only be used The relevant data is set to zero to avoid the calculation effect, but it is still and does not need to activate the network part to actually perform the calculation process.

The main advantages of conditional calculations

1. The calculation is effective and the energy consumption is reduced: through the partial activation calculation, taking the condition calculation of each sample condition activation as an example, a single sample only needs to pass through a part of the entire SuperNet to participate in the calculation.

2. Larger network, stronger expression: Because of the Route from one place to many places, the Inputs of each (layer) are routed to different subnets for independent calculation, and the expression of different inputs at each layer is relatively independent and has no effect , The expression ability is stronger, the network can be larger, but the expression efficiency is reduced.

Conditional calculation network and calculation form

The network and calculation forms of conditional calculation are more flexible. Some construction forms are as follows: (Specific models and paper citations are omitted here, see: http://intellabs.github.io/dis)

1. According to the characteristics of tasks such as CV, multiple independent CNNs are used as expert networks, routed independently according to tasks, and the tails are combined to give a large network.

2. Use more complex forms such as cascading to combine different expert networks at different levels.

3. Use decision tree and other methods to do data transformation to realize routing.

4. Choose a route through a learnable network. Among them, the loss of strategy learning has many construction forms: directly use the main loss of tasks such as classification, and the importance of different experts and load construction losses as auxiliary losses, and so on.

Conditional calculation routing strategy

1. Non-learnable/hard-mode, calculate the route through a certain deterministic strategy, such as LSH.

2. learnable-mode, to calculate routing through the learnable network. The network scale can be large or small, and the simple learnable routing is a single-layer weight: G(x) = P(X W), G(x) is the routing gate function, X is the input, and W is measured by the loss function The routing weight can be learned, and P is a certain selection function (such as topk, sort, etc.). In actual implementation, W may be used as part of the input information of the subsequent network, not only using G(x) To select a route, you need to W. A more typical form is: G(x)=P(N(X W)), where N is a normalization function, such as Softmax.

Redundancy strategy for conditional calculation

The redundant strategy of conditional calculation can be divided into non-redundant conditional calculation and redundant conditional calculation:

1. The calculation of non-redundant conditions can be realized by the realization of P(.) function such as topk(k=1,...);

2. Redundant condition calculation can be realized in various forms, which can be realized by the realization of P(.) function such as topk(k=n,...), n>=2, or by hard redundancy mode, in the entire network Supports input duplication and multi-channel calculation implementation.

The challenge of conditional calculations

1. The influence of the routing algorithm on the model quality No matter whether the information (X*W) of the input and the routing weight is used only as routing selection and as the input of the subsequent network unit, or directly as part of the input of the subsequent network unit, the routing algorithm determines The processing flow of the input information has a great impact on the overall quality of the model. 2. The stability of routing/gate The weights of routing/gates are randomly initialized, and the weights themselves are constantly being trained and adjusted; the network of the front and back layers is continuously trained and changed, and the same sample will be assigned at different stages of training In different subsequent network units, this dynamic change is too drastic, which will seriously affect the stability and convergence speed of the entire network training process. 3. Importance of routing expert samples and load balance

In the training phase, the importance of the correlation between each expert and the samples in the sample batch, and the load balance of the samples in each batch being evenly assigned to different experts, these two indicators are both related and conflicting. It is necessary to construct a loss function separately as an auxiliary loss to optimize these two indicators. In arxiv:1701.06538 "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer", related discussions were made.

About Conditional Calculation/Dynamic Neural Network

About conditional calculation/dynamic neural network, more information is in "Dynamic Neural Networks: A Survey" arxiv:2102.04906 ( http://arxiv.org/abs/2102.0490). Various dynamic network-related technologies are classified according to instance level, time level, and space level.

Instance-wise Dynamic NN: dynamic instance by instance, each sample independently activates different networks and parameters (MoE is this direction). Dynamic Architecture: Dynamic Depth, Dynamic Width, Dynamic Routing/MoE; Dynamic Parameter: Parameter Adjustment, Parameter Prediction, Dynamic Feature(s)
Spatial-wise Dynamic NN: Spatial-wise Dynamic: Different spatial positions such as images activate subsequent different networks and parameters. (CNN, etc.): Pixel Level, Region Level, Resolution Level
Temporal-wise Dynamic NN: Time-wise dynamic: Time series data is divided according to time series to activate different subsequent networks and parameters. (Video-frames, text-sequence, time-series, stream, ...)Text-SequenceVideo-Frames
The above is the overall classification of Dynamic NN in this review paper.

From the perspective of ultra-large-scale network dynamic network technology support, the classification is mainly considered with high expression ability and low computational cost, and the dynamic network technology is classified from two dimensions:

1. According to whether it is partially activated during feedforward calculation:

Hard-Dynamic: When feeding forward, part of the network is definitely not activated to participate in the calculation

Soft-Dynamic: In the feedforward, part of the network passes through the gate/route such as softmax, and loses the expressive ability through methods such as zeroing the tensor element, but will participate in the calculation.

2. According to the input of the dynamic activation judgment algorithm:

Sample-by-sample level: (in the input layer) the subsequent activation of the dynamic network is determined according to the instance of each sample.
Sub-sample level: (in the input layer) different subsequent network units are activated at the time/space level within the sample. Generally, deep networks will be selectively activated and executed not only in the input layer, but also in the middle layer.

Among them, the intelligent platform supports the Hard-Dynamic dynamic neural network sample-by-sample, which can naturally obtain the sparse activation of large particles of the network structure, and can achieve high energy efficiency for training and inference in the super-large model.

Compared with the neural network of static structure, dynamic neural network has done a lot of comparative research in the aspects of efficiency, expression, generalization, robustness and interpretability in related research. From the perspective of intelligent platforms supporting ultra-large-scale networks with the lowest possible computational cost to improve model performance, Efficiency and Representation are the most important:

1. Efficiency: A static network "will affect the whole body". Every sample is input to the entire network/all parameters must be responded to. This is too challenging for a super-large network to achieve the leading effect of the model's energy consumption.

2. Representation: The amount of parameters is larger and the expression capacity is larger; but the MoE and other structures have reduced reuse in the expression of the characteristics of each layer of the deep network, and the expression efficiency of each parameter is lower.

Implementation strategy

To realize the ultra-large-scale parameter version with dynamic routing sparse activation of various models, it is necessary to study and implement the model separately.

Taking Switch Transformer as an example, its parameters are extended to part of the FFN part of Transformer. Its MoE expansion, as shown below:

(Image source: Switch Transformer paper)

It can be seen that the main change of MoE is to add MoE-related logic before and after the need for the Expert sub-network. This article mainly introduces the implementation on the platform. Dynamic routing condition calculation mainly includes four steps: routing calculation, data distribution, independent calculation, and result merging.

1. Routing calculation-Gate: According to the input (which can be the input of the entire network, or the output of the previous network unit/layer), the calculation is completed in the routing unit, and in the sample-wise routing in the batch, the calculation of each sample Subsequent network routing assigned (Mixture-of-Experts/Experts in MoE).

2. Data distribution-Dispatch: From the overall input Tensor, according to the sample-expert relationship calculated by the route, collect and merge the Tensor that each expert needs to process. If in a fixed expert-batch design, it is necessary to balance the number of samples assigned to each expert in each batch of training and the maximum capacity of each round of expert training. Due to the randomness of the sample input, it is difficult to ensure a more uniform distribution. For batches with the largest capacity, pads should be made for the fixed batch-size, and for samples higher than the maximum capacity, methods such as delayed resampling can be used. In order to maintain the correct input and output relationship (Input/X-Label/Y) and training is the derivation relationship of back propagation, the implementation needs to maintain the index relationship between the original batch and the sub-batch of each expert, and then derivation and combination Used when merging.

3. Independent calculation-Expert: Concurrently (logically sequential) call each expert to process the corresponding sub-batch. This is also one of the concurrent APIs that the smart platform must support.

4. Combine results-Combine: Combine the result tensor of each expert to the tensor of the entire batch, and assign the index according to the data, and exchange to the original input order.

In mainstream deep learning intelligence platforms, two main types of implementation strategies can be adopted:

tensor to zero: for the need to be assigned to different subsequent network units (expert network subnets, etc.), copies several tensors for the experts that need to be assigned, and zeros the dimension of the data that should not be input to the current expert. This method is simple to implement, full tensor operation, and has no special requirements on the platform under the condition that the logic of zero-setting calculation is correct. It is suitable for algorithm research. It only reflects that the pre-order data of conditional calculation is dynamically routed to different subsequent network units. Analysis The effect of the algorithm. If the zeroing method is used, the tensor processed by each expert in this method is a full batch in the batch dimension, which cannot save the amount of calculation and memory usage.

tensor sorting: pairs need to be assigned to different subsequent network units (expert network subnets, etc.), copy several tensors for the experts that need to be assigned, and do not reserve the data dimensions that should not be input to the current expert. And maintain the corresponding relationship of the sample-level index before and after the transformation. In a distributed and friendly implementation, if the expert subnet is divided into different computing nodes as a unit, the expert network is best implemented after inheriting from the subnet-level platform object, such as: mindspore.nn.Cell in MindSpore . For detailed implementation details, refer to the subsequent technical implementation chapter.

Core code

core code: routing calculation, data distribution, independent calculation, the result is merged

The reference code is implemented by MindSpore. (Note: import mindspore as ms)

The core logic of Mixture of Experts, for input I, go through routing_network (the simplest *W is enough), then topk (if the variant algorithm requires gate weight, softmax is required, otherwise it is not), and then use tensor operation (can be selected according to batch) Output the tensor of each subnetwork/expert.

To facilitate debugging, a very small non-random deterministic value is used to construct the input and routing weights, and the routing network uses simple X*W.

1. Routing calculation

When the above input 5 rows of samples (only 3 types, hope to be assigned to 3 experts) samples, and the Gate weights are matrix multiplied, the experts to be assigned to each sample can be clearly calculated. You can use matmul or similar to gates_weighted = einsum('bd,de->be', [data_inputs, gate_weights]). The result of the first round of matrix multiplication is:

Input and weight multiplication, in python, you can use @, you can also use matmul, you can also use Einstein's summation simple memory function einsum. When it is a simple matrix multiplication, the einsum will actually be split into multiple algorithms when the calculation graph is compiled, and the performance is not good; but when the input and weight exceed 2 dimensions, the batch dimension needs to be fixed for routing calculations. , It is very simple to use einsum to program.

2. Distribute

The main logic of conditional calculation is to calculate top-k experts for each sample based on the output of the routing network. Its realization can be realized by the topk function. Since top selects score as the input information (including routing information) of subsequent network units, it is generally necessary to normalize the routing output with softmax.

On-demand calculation 1: the normalized weight between all-N experts (please refer to #2), the same as gates_weighted, normalized according to dim=-1, but the output is:

Select Top-K experts for each sample in the batch. Here is the weight of each expert in the batch. You can choose top-k from softmax-ed or top-k directly from gates_weighted; because there may not be softmax or extension After that, it can be gates_weighted, here is the serial number of each expert in the batch

The output is:

then:

On-demand calculation 2: normalized weights between top-n experts

How to extract the tensor processed by each expert from the original input according to the assigned index. In the current mainstream intelligent platform, there is no special operator, and similar effects can be achieved through a combination of other operators. . In MindSpore, operators can be implemented through the underlying C++, or by inheriting Cell in Python and implementing bprob, and then organizing the original gate tensor into the target output according to the index. Here we implement a Dispatch class

3. Independent calculation

Call the subsequent expert network directly in parallel. The parallel part can be supported by the platform. It can be identified by a special function or annotation, or it can be optimized for parallel execution when the platform is compiled. (In the network model of non-dynamic routing condition calculation, there is generally no similar optimization.)

4. Combine

The logic of the merging is relatively simple. First, use cat to splice according to the batch dimension, and then construct the correct zeros tensor. Use index_add to merge the results of each expert network according to the index while keeping the input order as the output of the MoE module.

The above completes the complete calculation process of the entire MoE.

Code frame

We extend the logic based on the tensor operation based on the above basic dynamic routing conditions to a complete training code framework:

class Dispatch(ms.nn.Cell): Implement the dispatch logic in routing
class Combine(ms.nn.Cell): Implement assembly logic in routing
class Route(ms.nn.Cell): Complete the entire dynamic routing logic, which can be implemented as a relatively general class
class Expert(ms.nn.Cell): Expert network customized by platform users
class Network(ms.nn.Cell): Platform user-defined large network
class MSELoss(ms.nn.Cell): Realize MSE loss and realize the logic of auxiliary loss
class OutputLossGraph(ms.nn.Cell): output infer and loss, single step in PyNative mode
class Dataset: Data set class, which only satisfies the reasonable correspondence between input shape and XY, only examples def train( …): training entrance

Conditional calculation realization technical points

1. Dynamic routing

Non-learnable routing

For example, using LSH (locality sensitive hashing) for routing: At the front end of the entire learnable network, LSH is used to distribute samples, which can avoid the problem of partial derivation of LSH; if the LSH module is added in the middle of the network, the deterministic algorithm needs to be completed through gradient estimation Partial gradient transfer.

Learnable routing

The simple way is to define gate_weights as a learnable Parameter. For two-dimensional tensors, use python@ or matmul to complete the weight routing calculation; if it is a higher-dimensional tensor, and the batch dimension needs to be fixed, einsum('bd , calculation is completed in the form of 160dbf351dca10 de->b*e').

2. The relationship between topk and softmax

In the two types of Gate implementations G_1(x)=softmax(topk(X W))) and G_2(x)=topk(softmax(X

Put softmax before and after Topk, and the choice of top-k remains unchanged; when G_* is required as part of the subsequent network input, that is, the routing weight information is used as the subsequent network input information, you need to consider: the need for all-N experts For the normalized weight of, then softmax is placed before top-k; otherwise, softmax is placed after top-k to calculate the normalized weight between top-N experts.

3. How to balance each expert in batch processing

Sum the routing weights of each sample, that is, add the importance and weights of the 1+ exports assigned to a single sample of the batch to calculate the importance; calculate the load according to the non-zero sum of the routing weights of each sample Experts come to get the load. Use coefficient_of_variation(importance) + coefficient_of_variation(load) as auxiliary_loss to participate in optimization to balance importance and load. Coefficient of Variation (Coefficient of Variation) is used to measure the degree of dispersion of the dimensionless data. The more discrete, the worse the balance is here, and the smaller it needs to be optimized.

In a multi-layer (multiple) MoE model such as Transformer, multiple groups of auxiliary_loss are combined as auxiliary_loss, which can be done after adding dominated_loss.

Click to follow and learn about Huawei Cloud's fresh technology for the first time~