Recently, the small-sample learning model FSL++ of the NLP Center Semantic Understanding Team of Meituan Search and NLP Department has topped the FewCLUE list, the authoritative evaluation benchmark for Chinese small-sample language understanding, and won the first place in the single task of Natural Language Inference (OCNLI), and was in Under the condition of very few samples (only more than 100 in one category), it exceeds the human recognition accuracy in the tasks of news classification (TNEWS) and scientific literature subject classification (CSLDCP).
1 Overview
CLUE (Chinese Language Understanding Evaluation) [1] is an authoritative evaluation list of Chinese language understanding, including text classification, inter-sentence relationship, reading comprehension and many other sub-tasks of semantic analysis and semantic understanding, which has generated a lot of attention in both academia and industry. greater impact.
FewCLUE [2,3] is a sub-list of CLUE specially used for Chinese small-sample learning evaluation. It aims to combine the general and powerful generalization capabilities of pre-trained language models to explore the best model for small-sample learning and practice in Chinese. . Some datasets of FewCLUE only have more than 100 labeled samples, which can measure the generalization performance of the model under very few labeled samples. The participation of many enterprises and research institutes such as the Institute. Not long ago, the small sample learning model FSL++ of the semantic understanding team of the NLP Center of the Meituan Platform Search and NLP Department won the first place on the FewCLUE list with superior performance, reaching the SOTA level.
2 Method introduction
Although large-scale pre-training models have achieved very good results in various tasks, they still require a lot of labeled data for specific tasks. In various businesses of Meituan, there are rich NLP scenarios, which often require high manual labeling costs. In the early stage of business development or when new business requirements need to be launched quickly, there are often insufficient labeled samples, and the deep learning training method using traditional Pretrain (pre-training) + Fine-Tune (fine-tuning) often fails to meet the ideal index requirements. , so it is necessary to study the model training problem of small sample scenarios.
This paper proposes a set of large model + small sample joint training scheme FSL++, which integrates model optimization strategies such as model structure optimization, large-scale pre-training, sample enhancement, ensemble learning and self-training. The FewCLUE list has achieved excellent results, and its performance exceeds human level on some tasks, and there is still some room for improvement on some tasks (such as CLUEWSC).
After the release of FewCLUE, NetEase Fuxi used the self-developed EET model [4] , and enhanced the semantic understanding ability of the model through secondary training, and then added templates for multi-task learning; IDEA Research Institute's Erlangshen model [5] is based on the BERT model Using more advanced pre-training techniques to train large models, the Masked Language Model (MLM) with dynamic Mask strategy is used as an auxiliary task in the process of fine-tuning downstream tasks. These methods all use Prompt Learning as the basic task architecture. Compared with these self-developed large models, our method mainly adds model optimization strategies such as sample enhancement, ensemble learning and self-learning on the basis of the Prompt Learning framework. Improve the task performance and robustness of the model, and this method can be applied to various pre-training models, which is more flexible and convenient.
The overall model structure of FSL++ is shown in Figure 2 below. The FewCLUE dataset provides 160 pieces of labeled data and nearly 20,000 pieces of unlabeled data for each task. In this FewCLUE practice, we first constructed multi-template Prompt Learning in the Fine-Tune stage, and adopted adversarial training, contrastive learning, Mixup and other enhancement strategies for labeled data. Since these data enhancement strategies use different enhancement principles, it can be considered that the differences between these models are significant, and they will have better results after ensemble learning. Therefore, after training with the data augmentation strategy, we have multiple weakly supervised models, and use these weakly supervised models to predict on unlabeled data to obtain the pseudo-label distribution of unlabeled data. After that, we integrate the pseudo-label distributions of unlabeled data predicted by different data augmentation models to obtain a pseudo-label distribution of the total unlabeled data, and then reconstruct the multi-template Prompt Learning and use the data again. Enhance the strategy and choose the optimal strategy. At present, our experiment only performs one iteration, and we can try multiple iterations, but as the number of iterations increases, the improvement is no longer obvious.
2.1 Augmented pre-training
Pretrained language models are trained on huge unlabeled corpora. For example, RoBERTa [6] is trained on over 160GB of text, including encyclopedias, news articles, literary works, and web content. The representations learned by these models achieve excellent performance on tasks containing datasets of various sizes from multiple sources.
The FSL++ model uses the RoBERTa-large model as the base model, and adopts the Domain-Adaptive Pretraining (DAPT) [7] pretraining method incorporating domain knowledge and the Task-Adaptive Pretraining (TAPT) [7] incorporating task knowledge. DAPT aims to add a large number of unlabeled texts in the domain to continue training the language model on the basis of the pre-trained model, and then fine-tune it on the data set of the specified task.
Continued pre-training on the target text domain can improve the performance of the language model, especially on downstream tasks related to the target text domain. Moreover, the higher the relevance of the pre-trained text to the task domain, the greater the improvement. In this practice, we finally use the RoBERTa Large model pre-trained on 100G corpus of CLUE Vocab [8] which contains corpus in various fields such as entertainment programs, sports, health, international affairs, movies, celebrities, etc. TAPT refers to pre-training by adding a small number of unlabeled corpora that are directly related to the task on the basis of the pre-training model. For the TAPT task, the pre-training data we choose to use is the unlabeled data provided by the FewCLUE list for each task.
In addition, in the practice of inter-sentence relation tasks, such as Chinese natural language inference task OCNLI, Chinese dialogue short text matching task BUSTM, we use other inter-sentence relation tasks such as Chinese natural language inference dataset CMNLI, Chinese short text The model parameters pre-trained on the similarity data set LCQMC are used as initial parameters, which can also improve the effect to a certain extent compared to directly using the original model to complete the task.
2.2 Model structure
FewCLUE contains multiple task forms, and we choose a suitable model structure for each task. The category words of text classification tasks and machine reading comprehension (MRC) tasks themselves carry information, so they are more suitable for modeling in the form of Masked Language Model (MLM). Next Sentence Prediction (NSP) [9] task form. Therefore, we choose the PET [10] model for the classification task and reading comprehension task, and the EFL [11] model for the inter-sentence relation task. The EFL method can construct negative samples through global sampling and learn a more robust classifier.
2.2.1 Prompt Learning
The main goal of Prompt Learning is to minimize the gap between the pre-training objective and the downstream fine-tuning objective. Usually, the existing pre-training tasks contain the MLM loss function, but the downstream tasks do not use MLM, but introduce a new classifier, which makes the pre-training task and the downstream task inconsistent. Prompt Learning does not introduce additional classifiers or other parameters, but uses splicing templates (Template, that is, splicing language fragments for the input data, so as to transform the task into MLM form) and tag word mapping (Verbalizer, that is, for each tag in the vocabulary table) to find the corresponding words in the MLM task, so that the model can be used in downstream tasks with a small number of samples.
Take the e-commerce evaluation sentiment analysis task EPRSTMT shown in Figure 3 as an example. Given the text "This movie is really good, it's worth watching a second time!", the traditional text classification is to connect the classifier to the Embedding of the CLS part and map it to the 0-1 classification (0: negative, 1: positive Towards). This method needs to train a new classifier in a small sample scenario, and it is difficult to obtain good results. The method based on Prompt Learning is to create a template "This is a [MASK] comment.", then splicing the template with the original text, predicting the word at the [MASK] position through the language model during training, and then mapping it to the corresponding category up (good: positive, bad: negative).
Due to lack of sufficient data, it is sometimes difficult to determine the best performing template and tag word mapping. Therefore, the design of multi-template and multi-label word mapping can also be adopted. By designing multiple templates, the final result adopts the integration of the results of multiple templates, or design a one-to-many tag-word mapping, so that one tag corresponds to multiple words. Similar to the above example, the following template combinations can be designed (left: multi-template for the same sentence, right: multi-label mapping).
task example
2.2.2 EFL
The EFL model stitches the two sentences together, and uses the embedding at the [CLS] position of the output layer followed by a classifier to complete the prediction. In the training process of EFL, in addition to the samples of the training set, negative samples are also constructed. During the training process, sentences in other data are randomly selected as negative samples in each batch, and data enhancement is performed by constructing negative samples. Although the EFL model needs to train a new classifier, there are currently many public datasets of textual entailment/inter-sentence relation, such as CMNLI, LCQMC, etc., which can be learned by continuous-training on these samples, and then the learned The parameters are transferred to the few-shot scenario and further fine-tuned with the task dataset of FewCLUE.
task example
2.3 Data Augmentation
Data enhancement methods mainly include sample enhancement and Embedding enhancement. In the field of NLP, the purpose of data augmentation is to augment textual data without changing the semantics. The main methods include simple text replacement, using language models to generate similar sentences, etc. We have tried methods such as EDA to expand text data, but a change in a word may cause the meaning of the entire sentence to flip, and the replaced text carries a lot of noise. So it is difficult to generate enough augmented data with simple regular sample changes. With Embedding enhancement, the input is no longer operated, but the operation is performed at the Embedding level. The robustness of the model can be improved by adding disturbance or interpolation to the Embedding.
Therefore, in this practice, we mainly carry out Embedding enhancement. The data augmentation strategies we use are Mixup [12] , Manifold-Mixup [13] , Adversarial training (AT) [14] and Contrastive Learning R-drop [15] , respectively. For a detailed introduction to the data augmentation strategy, see the previous technical blog Small sample learning and its application in the Meituan scenario .
Mixup can enhance the generalization ability of the model by performing a simple linear transformation on the input data to construct new combined samples and combined labels. On various supervised or semi-supervised tasks, using Mixup can greatly improve the generalization ability of the model. The Mixup method can be regarded as a regularization operation, which requires that the combined features generated by the model at the feature level satisfy a linear constraint, and use this constraint to impose regularization on the model. Intuitively, when the input of the model is a linear combination of the other two inputs, the output is also the linear combination of the output obtained after the two data are input into the model separately, which actually requires the model to be approximated as a linear system.
Manifold Mixup generalizes the above Mixup operation to features. Because features have higher-order semantic information, interpolating across their dimensions may yield more meaningful samples. In models similar to BERT [9] and RoBERTa [6] , the number of layers k is randomly selected, and the feature representation of this layer is subjected to Mixup interpolation. The interpolation of ordinary Mixup occurs in the Embedding part of the output layer, and Manifold Mixup is equivalent to adding this series of interpolation operations to a random layer of the Transformers structure inside the language model.
Adversarial training significantly improves model Loss by adding tiny perturbations to the input samples. Adversarial training is to train a model that can effectively identify original samples and adversarial samples. The basic principle is to construct some adversarial samples by adding disturbances, and hand them over to the model for training, which improves the robustness of the model when encountering adversarial samples, and also improves the performance and generalization ability of the model. Adversarial examples need to have two characteristics, namely:
- The added perturbation is tiny relative to the original input.
- can make the model make mistakes. Adversarial training has two functions, namely, improving the robustness of the model to malicious attacks and improving the generalization ability of the model.
R-Drop does two Dropouts for the same sentence, and enforces that the output probabilities of different sub-models generated by Dropout remain consistent. The introduction of Dropout works well, but it can lead to inconsistencies in the training and inference process. In order to alleviate this inconsistency in the training and reasoning process, R-Drop performs regularization processing on Dropout, adds restrictions on the output data distribution in the outputs generated by the two sub-models, and introduces the KL divergence loss of the data distribution metric, so that within the Batch The two data distributions generated by the same sample should be as close as possible and have distribution consistency. Specifically, for each training sample, R-Drop minimizes the KL divergence between the output probabilities of submodels generated by different dropouts. As a training idea, R-Drop can be used in most supervised or semi-supervised training, and is highly versatile.
The three data enhancement strategies we use, Mixup is to make a linear change of two samples in the output layer of the language model Embedding and the output layer of a certain layer of Transformers inside the language model, and adversarial training is to add tiny disturbances to the samples, Contrastive learning is to perform Dropout twice on the same sentence to form a positive sample pair, and then use KL divergence to limit the two sub-models to be consistent. The three strategies enhance the generalization of the model by completing some operations in Embedding. The models obtained through different strategies have different preferences, which provides conditions for the next step of ensemble learning.
2.4 Integrated learning & self-training
Ensemble learning can combine multiple weakly supervised models to obtain a better and more comprehensive strongly supervised model. The underlying idea of ensemble learning is that even if a weak classifier gets a wrong prediction, other weak classifiers can correct the error back. If the differences between the models to be combined are significant, there is usually a better result after ensemble learning.
Self-training uses a small amount of labeled data and a large amount of unlabeled data to jointly train the model. First, the trained classifier is used to predict the labels of all unlabeled data, and then the labels with higher confidence are selected as pseudo-label data. The labeled data is combined with the manually labeled training data to retrain the classifier.
Ensemble learning + self-training is a set of schemes that can utilize multiple models and unlabeled data. Among them, the general steps of ensemble learning are: train a number of different weakly supervised models, use each model to predict the label probability distribution of unlabeled data, calculate the weighted sum of the label probability distribution, and obtain the pseudo-label probability distribution of unlabeled data. . Self-training refers to training a model to combine other models. The general steps are: training multiple Teacher models, the Student model learning the Soft Prediction of high-confidence samples in the pseudo-label probability distribution, and the Student model as the final strong learner.
In this FewCLUE practice, we first constructed multi-template Prompt Learning in the Fine-Tune stage, and adopted adversarial training, contrastive learning, Mixup and other enhancement strategies for labeled data. Since these data enhancement strategies use different enhancement principles, it can be considered that the differences between these models are significant, and they will have better results after ensemble learning.
After training with the data augmentation strategy, we have multiple weakly supervised models, and use these weakly supervised models to make predictions on unlabeled data to obtain pseudo-label distributions of unlabeled data. After that, we integrate the pseudo-label distributions of multiple unlabeled data predicted by different data augmentation models to obtain a total pseudo-label distribution of unlabeled data. In the process of screening pseudo-label data, we do not necessarily choose the sample with the highest confidence, because if the confidence given by each data enhancement model is very high, it means that this sample may be an easy-to-learn sample, not necessarily a large sample. value.
We integrate the confidence given by multiple data enhancement models, and try to select samples with high confidence but not easy to learn (for example, the predictions of multiple models are not all consistent). Then, the multi-template Prompt Learning is reconstructed with the set of labeled data and pseudo-labeled data, and the data augmentation strategy is used again, and the best strategy is selected. At present, our experiment only has one iteration, and we can try multiple iterations, but as the number of iterations increases, the improvement will also decrease and is no longer significant.
3 Experimental results
3.1 Dataset introduction
The FewCLUE list provides 9 tasks, including 4 text classification tasks, 2 inter-sentence relationship tasks and 3 reading comprehension tasks. Text classification tasks include e-commerce evaluation sentiment analysis, scientific literature classification, news classification and App application description topic classification tasks. It is mainly classified into short text binary classification, short text multi-classification and long text multi-classification. Some of them have many task categories, more than 100 categories, and there is a problem of category imbalance. Inter-sentence relation tasks include natural language inference and short text matching tasks. For reading comprehension tasks, there are idiom reading comprehension choices to fill in the blanks, abstract judgment keyword discrimination and pronoun disambiguation tasks. Each task generally provides 160 pieces of labeled data and about 20,000 pieces of unlabeled data. Because the long text classification task has many categories and is too difficult, it also provides more labeled data. The detailed task data is shown in Table 4:
3.2 Experimental comparison
Table 5 shows the comparison of experimental results for different models and parameter quantities. In RoBERTa Base experiments, using the PET/EFL model outperforms the traditional direct Fine-Tune model results by 2-28PP. Based on the PET/EFL model, in order to explore the effect of the large model in the small sample scene, we conducted experiments on RoBERTa Large. Compared with the RoBERTa Base, the large model can improve the model by 0.5-13PP; in order to make better use of domain knowledge , we further conduct experiments on the pre-trained RoBERTa Large Clue model that has been enhanced on the CLUE dataset, and the large model incorporating domain knowledge further improves the results by 0.1-9pp. Based on this, in subsequent experiments, we conduct experiments on RoBERTa Large Clue.
Table 6 shows the experimental results of data augmentation and ensemble learning on the PET/EFL model. It can be found that even if the data augmentation strategy is used on the large model, the model can bring an improvement of 0.8-9PP, and further ensemble learning & self- After training, the model performance will continue to improve by 0.4-4PP.
Among them, in the integrated learning + self-training step, we tried several screening strategies:
- Select the samples with the highest confidence. The improvement brought by this strategy is within 1PP. Many of the pseudo-label samples with the highest confidence are samples with consistent predictions by multiple models and high confidence. These samples are easier to learn and integrate. The benefit from this part of the sample is limited.
- Select high-confidence and controversial samples (there is at least one model with inconsistent prediction results from other models, but the overall confidence of multiple models exceeds the threshold 1). This strategy avoids samples that are particularly easy to learn, and avoids by setting a threshold Bringing too much dirty data can bring an improvement of 0-3PP;
- Combining the above two strategies, if the prediction results of multiple models for a sample are consistent, we choose the sample whose confidence is less than the threshold 2; for at least one model that is inconsistent with the prediction results of other models, we choose the confidence greater than the threshold. 3 samples. In this way, samples with higher confidence are selected to ensure the credibility of the output, and more controversial samples are selected to ensure that the selected pseudo-label samples have greater learning difficulty, which can bring an improvement of 0.4-4PP.
4 Application of small sample learning strategy in Meituan scene
In the various businesses of Meituan, there are rich NLP scenarios, and some tasks can be classified as text classification tasks and inter-sentence relationship tasks. The small sample learning strategy mentioned above has been applied to various scenarios of Meituan Dianping. In the case of scarce data resources, a better model can be trained. In addition, the small sample learning strategy has been widely used in various NLP algorithm capabilities of Meituan's internal natural language processing (NLP) platform, and has achieved significant benefits in many business scenarios. Engineers within Meituan can experience NLP through this platform. Center-related competencies.
text classification task
Classification of medical aesthetics topics : Notes on Meituan and Dianping are divided into 8 categories according to the subject matter: curiosity, shop exploration, evaluation, real cases, treatment process, pit avoidance, effect comparison, and popular science. When a user clicks on a certain theme, the corresponding note content is returned, and the experience is shared on the encyclopedia page and plan page of the Meituan and Dianping App medical beauty channel. The small sample learning uses 2,989 pieces of training data to increase the accuracy rate by 1.8PP, reaching 89.24 %.
Strategy identification : Mining travel strategies from UGC and notes, providing content supply of travel strategies, applied to the strategy module under the intensive search of scenic spots, recalling the notes describing travel strategies, small sample learning uses 384 pieces of training data to improve the accuracy rate by 2PP , reaching 87%.
Xuecheng text classification : Xuecheng (Meituan's internal knowledge base) has a large number of user texts. After induction, the texts are divided into 17 categories. The existing models are trained on 700 pieces of data. Through small sample learning, on the existing models Improve the model accuracy by 2.5PP to 84%.
Project screening : The current evaluation list page of LE Life Services/Beauty and other businesses is inconvenient for users to quickly find decision-making information. Therefore, more structured classification labels are needed to meet the needs of users. Small sample learning is used in these two In business, the accuracy rate of 300-500 pieces of data has reached 95%+ (multiple data sets are increased by 1.5-4PP respectively).
Inter-sentence relation task
Medical beauty effect marking : The contents of the notes of Meituan and Dianping are recalled according to the effect. The types of effects are: moisturizing, whitening, face-lifting, wrinkle removal, etc., online to the medical beauty channel page, there are 110 types of effects that need to be marked , the small sample learning only uses 2909 training data to achieve an accuracy of 91.88% (an increase of 2.8PP).
Medical beauty brand marking : Brand upstream companies have demands for brand promotion and marketing of their products, and content marketing is one of the current mainstream and effective marketing methods. Brand marking is to recall notes for each brand, such as "Yifuquan" and "Shuweike", which detail the content of the brand's notes. There are 103 brands in total, which have been launched in the Medical Beauty Brand Pavilion, and only 1676 training items are used for small sample learning. The data accuracy rate reached 88.59% (an increase of 2.9PP).
5 Summary
In this list submission, we built a RoBERTa-based semantic understanding model, and improved the effect of the model by enhancing pre-training, PET/EFL model, data augmentation, and ensemble learning & self-training. The model can complete text classification, inter-sentence relation inference tasks and several reading comprehension tasks.
By participating in this evaluation task, we have a deeper understanding of the algorithms and research in the field of natural language understanding in small sample scenarios, and we have also conducted a thorough test of the Chinese landing ability of cutting-edge algorithms, which is the basis for further algorithm research and algorithm research. The groundwork is laid. In addition, the task scenarios in this dataset are very similar to the business scenarios of Meituan Search and the NLP Department. Many strategies of this model are also directly applied to actual business, directly empowering the business.
author of this article
Luo Ying, Xu Jun, Xie Rui, and Wu Wei are all from Meituan Search and NLP Department/NLP Center.
references
- [1] FewCLUE Github project address
- [2] FewCLUE list address
- [3] CLUE Github project address
- [4] https://github.com/NetEase-FuXi/EET
- [5] https://github.com/IDEA-CCNL/Fengshenbang-LM
- [6] Liu, Yinhan, et al. "Roberta: A robustly optimized bert pretraining approach." arXiv preprint arXiv:1907.11692 (2019).
- [7] Gururangan, Suchin, et al. "Don't stop pretraining: adapt language models to domains and tasks." arXiv preprint arXiv:2004.10964 (2020).
- [8] Xu, Liang, Xuanwei Zhang, and Qianqian Dong. "CLUECorpus2020: A large-scale Chinese corpus for pre-training language model." arXiv preprint arXiv:2003.01355 (2020).
- [9] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
- [10] Schick, Timo, and Hinrich Schütze. "It's not just size that matters: Small language models are also few-shot learners." arXiv preprint arXiv:2009.07118 (2020).
- [11] Wang, Sinong, et al. "Entailment as few-shot learner." arXiv preprint arXiv:2104.14690 (2021).
- [12] Zhang, Hongyi, et al. "mixup: Beyond empirical risk minimization." arXiv preprint arXiv:1710.09412 (2017).
- [13] Verma, Vikas, et al. "Manifold mixup: Better representations by interpolating hidden states." International Conference on Machine Learning. PMLR, 2019.
- [14] Verma, Vikas, et al. "Manifold mixup: Better representations by interpolating hidden states." International Conference on Machine Learning. PMLR, 2019.
- [15] Wu, Lijun, et al. "R-drop: regularized dropout for neural networks." Advances in Neural Information Processing Systems 34 (2021).
- [16] Small sample learning and its application in the Meituan scene
Read more collections of technical articles from the Meituan technical team
Frontend | Algorithm | Backend | Data | Security | O&M | iOS | Android | Testing
| Reply keywords such as [2021 stock], [2020 stock], [2019 stock], [2018 stock], [2017 stock] in the public account menu bar dialog box, you can view the collection of technical articles by the Meituan technical team over the years.
| This article is produced by Meituan technical team, and the copyright belongs to Meituan. Welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication, please indicate "The content is reproduced from the Meituan technical team". This article may not be reproduced or used commercially without permission. For any commercial activities, please send an email to tech@meituan.com to apply for authorization.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。