Alibaba Cloud takes the top spot in FewCLUE! Knowledge integration into pre-training + actual analysis of small sample learning
Introduction to On July 8, CLUE, the authoritative evaluation benchmark for Chinese language comprehension, released the latest results of the Chinese small sample learning evaluation list. The parameter limit model is the first in the total score of the double track, and the final defense is the first in the total score.
Author | Tong Run, Gui Yu, Xiong Xi
Source | Alibaba Technical Official Account
On July 8, CLUE, the authoritative evaluation benchmark for Chinese language understanding, released the latest results of the evaluation list of Chinese small-sample learning. The Alibaba cloud computing platform PAI team and the Dharma Academy’s intelligent dialogue and service technical team worked together in both large models and unrestricted models. The total score of the track is the first, and the final defense is the first.
Since its establishment, CLUE has released a number of NLP evaluation benchmarks, including classification lists, reading comprehension lists, and natural language inference lists, which have had a profound impact in academia and industry. Among them, FewCLUE is a Chinese small-sample learning evaluation benchmark newly launched by CLUE, which is used to evaluate whether machine learning models can master specific natural language processing tasks through learning with very few samples. Based on this evaluation, researchers can more accurately measure the generalization and accuracy of the model trained by machine learning. For example, the user intention recognition in the intelligent customer service scenario only needs to manually mark dozens of samples, and the accuracy of the intention recognition can reach 90%.
As we all know, although large-scale pre-training models have achieved great results in major tasks, they still need a lot of annotation data on specific tasks. Because the training data collection cost for collecting and labeling models is expensive, it is necessary to tackle small sample learning techniques, use much less data than classic deep learning algorithms, and approach or even exceed the accuracy of classic deep learning algorithms. This time, the Alibaba Cloud PAI team and Dharma Academy jointly proposed a set of large model + small sample joint program. On the basis of large-scale general pre-training, it combines knowledge-based pre-training and Fuzzy-PET small-sample learning. Achieved excellent results. Even on a small sample learning task, the accuracy surpasses that of humans.
The second question analysis & modeling ideas
The overall characteristics of the competition data set are as follows:
- Small sample: Both the training set and the test set are 16 shots for each category, which tests the robustness of the algorithm in a small sample scenario
- Generalization: the task characteristics are obviously different, and the model needs to have good generalization ability
- Unlabeled data: Most tasks provide a considerable amount of unlabeled data, you can try continued pretrain and self-training
Based on the interpretation of the competition questions, we designed a three-stage modeling method:
- Re-training of general domain data: With the help of various acceleration strategies and pre-training kits provided by PAI-Rapidformer, we have pre-trained Chinese pre-training models of the order of 300 million and 1.5 billion from the beginning. The pre-training process adopts knowledge-based pre-training Pre-training algorithm (see 3.2 for details).
- Multi-task continuous pre-training: The purpose is to further strengthen the performance of double sentence matching tasks (OCNLI, BUSTM, CSL). We convert the classification task into a textual implication task, and use the textual implication data for Continued Pretraining. For example [CLS]I like the movie[SEP]This indicates positive user sentiment[EOS]
- Small sample algorithm fine-tuning for each task: choose PET (Pattern-Exploiting Training) as the core method of downstream fine-tuning, develop Fuzzy-PET algorithm, reduce the fluctuation caused by manual selection of PET algorithm label words, and bring it to the task The effect is improved. At the same time, the self-training semi-supervised method is used, and the upper semi-supervised learning is used in the downstream fine-tuning stage (see 3.3 for details)
Three core technologies
1. PyTorch large model training acceleration
Since the launch of the PAI-EasyTransfer framework for NLP and transfer learning in 2020, the PAI team has developed a PyTorch version of EasyTransfer, named EasyTexMiner. The model used in the competition is completed by EasyTexMiner's high-performance distributed pre-training. EasyTexMiner's distributed training organically integrates the advantages of Microsoft's DeepSpeed and NVIDIA's Megatron. The overall block diagram is as follows:
EasyTexMiner's distributed training incorporates the following core technologies:
1) Activation Checkpoint
Set up several checkpoints in the middle of the neural network. All intermediate results other than the checkpoints are discarded. The time for backpropagating the derivative requires an intermediate result to be calculated from the nearest checkpoint, which saves video memory. It also avoids the tedious process of calculating from scratch.
2) Gradient Accumulation
Taking batch\_size=16 as an example, you can calculate the average gradient of 16 samples at a time, and then add up the cache. After 4 times are counted, the total gradient is divided by 4, and then the parameter update is performed. This effect is equivalent In batch\_size=64. This is an effective way to increase Batch Size. Through this strategy, the batch size of each step can be expanded to a large extent, and the LAMB optimizer will improve the convergence speed.
3) Mixed Precision Training
The main benefits of using mixed precision training are as follows:
- Reduce video memory usage. Since the memory usage of FP16 is only half that of FP32, it can naturally help the training process to save half of the video memory space.
- Speed up the calculation of training and inference. In addition to saving memory, FP16 can also save model training time. The specific principle is shown in the figure below. The core is to maintain a FP32 backup to avoid rounding errors when the backpropagation parameters are updated. In addition, Loss Scaling will be used to alleviate overflow errors.
4) Just-in-time compilation JIT
When PyTorch is performing a series of element-wise Tensor operations, the implementation of the underlying Kernel needs to read and write memory repeatedly, but only a small amount of calculations are performed. Most of the time overhead is not in calculations, but in memory accesses. . For example, to implement a multiplication/addition kernel of a Tensor with N elements, it requires N addition calculations, 2N reads and N write and memory access operations. We call the Kernel with less calculation and more memory access times as the memory access Bound. In order to avoid this repeated reading and writing and reduce the overhead of Kernel Launch, Kernel Fusion can be used. The core principle of Kernel Fusion for memory access Bound is to automatically merge multiple element-wise Kernels into one Kernel through the principle of locality of memory access, avoiding intermediate results from being written to the memory, so as to improve the utilization rate of memory access; Multiple Kernels are merged into one Kernel, and the Kernel launch overhead is also reduced to one time.
5) 3D parallel
The 3D parallel strategy refers to the mixed use of data parallelism, model parallelism, and pipeline parallelism to achieve the purpose of fast training of tens of billions/100 billions of magnitude models. This technology was first developed by the DeepSpeed team and can accelerate the training of large models.
Backpropagation is not calculated on the GPU, but on the CPU. The intermediate variables used are all stored in the memory. This saves the GPU memory usage and exchanges time for space so that it can be placed in a larger size model. .
7) Zero video memory optimizer
ZeRO (The Zero Redundancy Optimizer) is a new memory optimization technology for large-scale distributed deep learning. ZeRO has three main optimization stages:
- Optimizer state partition (Pos): 4 times the memory is reduced, and the communication capacity is the same as the data parallelism;
- Increase gradient partition (Pos+g): 8x memory is reduced, and the communication capacity is the same as the data parallelism;
- Increase parameter partition (Pos+g+p): Memory reduction has a linear relationship with data parallelism and complexity.
throughput performance evaluation
This release uses the latest Alibaba Cloud EFLOPS AI cluster system, uses NVIDIA A100 GPU and 100Gbps Mellanonx CX6-DX network card, combined with the system-wide topology-aware high-performance distributed communication library ACCL and EFLOPS cluster multi-track network capabilities to achieve no congestion Communication greatly accelerates the training speed of the model. As shown below:
We use a model that is larger than BertLarge and cannot fit on a single card to evaluate the scalability of the model in parallel. The specific configuration is num-layers=24, hidden-size=2048, num-attention-heads=32, and the total number of parameters of the model is about 1.2B. We conducted throughput evaluation on 8/16/32/64 cards respectively. From the indicators in the figure below, as the number of cards increases, the throughput increases almost linearly.
2. KGBERT, a pre-training algorithm incorporating knowledge
On the basis of the general pre-training model, we consider pre-training that incorporates knowledge to improve the effect of the pre-training model.
Data and knowledge: Through cooperation with the NLP data team of Dharma Academy, we have obtained large-scale, high-quality and diverse data and knowledge.
- Large-scale: 500 million Chinese atlas knowledge, 200 million Sentence-SPO Pairs obtained through remote supervision;
- High quality: In view of the large and complex original corpus, there are a lot of redundancy and noise problems, through the DSGAN knowledge noise reduction algorithm, hundreds of millions of high-quality Sentence-SPO are selected for model training;
- Diversity: In addition to general fields, the FewCLUE data set also includes vertical industries such as e-commerce, tourism, education, and finance. This part of the data and knowledge is relatively scarce. For this reason, we have built an efficient knowledge production system that can be used for all kinds of Documents and web pages of vertical industries are automatically extracted by triples, which greatly enhances the richness of knowledge.
model and pre-training task
In order to use knowledge efficiently, we designed a multi-granular semantic understanding pre-training task based on the alignment corpus of "Sentence-positive SPO-negative SPO":
- Mention Detection: Enhance the model's understanding of the core entity Mention;
- Sentence-SPO joint Mask: Input large-scale text data and its corresponding SPO knowledge into the pre-training model at the same time for pre-joint training, promote information sharing between structured knowledge and unstructured text, and improve the semantic understanding of the model;
- SPO Margin Magnify: Design a pre-training task for comparative learning, open the semantic gap between Sentence-related SPO and irrelevant SPO, so that it has stronger semantic discrimination capabilities.
Technological innovation: knowledge screening and integration mechanism
In NLP tasks, a common practice is to model based on the current natural language input, but the information usually used in this way is only the current literal local information. This is obviously different from how humans understand language. Humans will use the knowledge we have learned to assist in understanding. Humans will use these external knowledge to strengthen their own understanding. If there is no additional knowledge, such as exposure to an unfamiliar field, it is difficult for us to fully understand semantics. However, the current common practice of NLP only uses input information, not external knowledge, and the level of understanding is low.
In reality, knowledge is huge and complicated, and targeted sampling knowledge is needed to reduce the introduction of irrelevant knowledge and maximize the benefits of knowledge.
Design a novel Gated mechanism, first encode the sentence, then aggregate the sub-picture information through GCN, and control the flow of information through the gating mechanism; in the pre-training stage, the objective function of maximizing knowledge gain is designed to make the model more Learn valuable information well.
Knowledge screening based on the Gated mechanism can effectively capture high-gain triples for integration, and improve the accuracy of government affairs and financial attribute recognition tasks by 2%. Such a knowledge screening mechanism has been validated in the academic public data set and achieved the effect of SOTA. Related work has been published in SIGIR2021.
3. Small sample learning algorithm
Based on the pre-trained language model that incorporates knowledge, the computing platform PAI and the Dharma Academy team jointly launched a self-developed multi-task small sample learning algorithm Fuzzy-PET. Since the FewClue list has a series of tasks in different categories, if the model can learn cross-task transferable knowledge before small-sample fine-tuning for specific tasks, the model will get better in the process of small-sample fine-tuning for specific tasks. Initial parameter settings. Based on the accumulation of Meta-Learning related algorithms by the computing platform PAI team, we introduced the unlabeled data of multiple FewClue tasks for learning in the continuing pre-training stage of the pre-trained language model incorporating knowledge. During the learning process, the model The background knowledge of these tasks is automatically learned from the data related to these tasks, which is more conducive to the small sample learning of specific tasks. Meta-Learning related algorithms have been published on EMNLP2020 and ACL2021. .
In the learning phase of specific small sample tasks, we improved the Pattern-Exploiting Training (PET) algorithm and introduced the Fuzzy Verbalizer Mapping mechanism. For example, in the classic PET algorithm, for FewClue's task OCNLI, we designed the following template: "Actually, I think you don't understand ball" and "You don't understand basketball." The relationship is MASK.
For the output Masked Language Token (Verbalizer), if the prediction result is "relevant", we map it to the category label "entailment"; if the prediction result is "irrelevant", we map it to the category label "neural"; if The predicted result is "opposite", and we map it to the category label "contradiction". Using Verbalizer to manually map category labels, PET realizes the modeling of text classification tasks. In the Fuzzy Verbalizer Mapping mechanism, we assume that multiple Verbalizers may have a mapping relationship to a certain category label, thereby further improving the generalization of the model in the small sample learning process. Referring to the previous example, we design three sets of label words: relevant, irrelevant, opposite/implying, neutral, contradictory/containing, neutral, and reverse. During training, each sample uses multiple sets of label words as input. During inference, the predicted probabilities of all candidate words are calculated and added for each category, and finally the category with the highest total probability is selected. As in the above example, if the sum of the predicted "relevant", "implied", "contained" probability is greater than the predicted "irrelevant", "neutral", "neutral" or predicted "opposite", "contradictory", "reverse" probability , The prediction result is "entailment".
This mechanism has a positive effect on the improvement of prediction accuracy in FewClue's multiple tasks, and to a certain extent reduces the fluctuation caused by manual selection of different label words. In addition, we also consider introducing unlabeled data for Self-training in the small sample learning stage, that is, relying on the existing model to mark the unlabeled data to achieve iterative optimization of the model.
Four businesses & products
It is worth mentioning that, based on the machine learning platform PAI platform, this technology has been implemented in actual business scenarios and has a good performance. These technologies have enhanced Dharma Academy Cloud Xiaomi's KBQA capabilities, enabling it to have the ability to quickly cold start, accurate question and answer, and land in government affairs, finance, and general-purpose business scenarios. In actual projects, in the case of a small number of samples (20 items), a fast cold start can be achieved, so as to achieve accurate question and answer. At the same time, these technologies are expected to give machine learning algorithms on Alibaba Cloud the ability to learn from small samples. With a small amount of data annotation, the effect of downstream tasks can be greatly improved. This means that the Alibaba Cloud model has the ability to implement low-cost and rapid implementation, and can efficiently and agilely empower the business of enterprises.
Based on PAI, Alibaba Cloud hopes to build large-scale AI end-to-end capabilities, from underlying chips to distributed systems, and then to the scale of upper-level algorithms and data, to build the ability of AI engineering group operations to serve all walks of life. At present, the PAI platform supports accelerated training of hundreds of billions of features and trillions of sample sizes, with built-in 200+ mature algorithms, and more than 50 high-quality deep learning pre-training models in AI fields such as image vision, audio and video, and text, which comprehensively improves the efficiency of enterprise AI engineering . On the basis of platform capabilities, the PAI platform also provides mature industry solutions and has become the preferred service for many companies. It has been commercially mature in many scenarios such as intelligent recommendation, user growth, end-to-side overscore, and autonomous driving.
Copyright Notice: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.