A list of the latest research in the field of NLP from Microsoft Research Asia

Editor's note: EMNLP (Conference on Empirical Methods in Natural Language Processing) is a top international academic conference in the field of computational linguistics and natural language processing. This year's EMNLP conference was officially held online from November 7th to 11th. In this conference, Microsoft Research Asia has many papers selected, and today we have selected 6 of them to give you a brief introduction. Interested readers are welcome to read the original text of the paper and learn about the cutting-edge progress in the field of natural language processing!

CAST: Enhance the generation of code summary through hierarchical segmentation and reconstruction of the abstract syntax tree

Paper link:
https://arxiv.org/abs/2108.12987

Code link:
https://github.com/DeepSoftwareAnalytics/CAST

The task of code summary generation is to understand the real code snippets, and automatically generate natural sentences, and then realize the function of describing this code. Since the abstract can concisely describe the function of the code and is highly readable, a good abstract can help developers understand, reuse and maintain the code more conveniently, greatly improving production efficiency. However, the abstract in reality has problems such as missing, error and not updated in time, and the abstract of the artificially marked code needs to have a certain professional knowledge background, and the whole process is time-consuming and labor-intensive. Therefore, it is particularly important to automatically generate a summary for the code.

In recent years, many researchers have used various techniques to model abstract syntax trees (ASTs) rich in grammatical and structural information to better generate code summaries. However, due to the complexity of the program, abstract syntax trees are generally large and deep and difficult to model. Existing methods also have many limitations such as damaged tree structure, high training cost, and loss of information. To this end, researchers from Microsoft Research Asia proposed a hierarchical segmentation and reorganization of AST CAST. The core is to divide the AST hierarchically into subtrees of appropriate granularity, so that each subtree contains relatively complete semantics; after independently modeling the subtrees, re-aggregate the representations of the subtrees according to the relative positions of the subtrees before the segmentation. This can make the structure of the AST easier to model, the semantics of the code can be constructed more completely, and the semantics embodied by the generated abstract are more comprehensive.

(Figure 1: Schematic diagram of the code and its abstract syntax tree (AST), sub-trees and structure trees)

CAST model selects the commonly used sequence-to-sequence architecture, which mainly contains three modules: AST encoder, code encoder and digest decoder . The AST encoder is used to model the semantic and structural information of the code; the code encoder is used to model the vocabulary and semantic information at the variable name level; the decoder combines the code representation obtained by the two encoders and the copy mechanism to generate a code summary.

(Figure 2: Schematic diagram of the architecture of the CAST model)

The researchers conducted automated tests under two public data sets and four measurement indicators, and carried out manual evaluation experiments. A large number of experimental results prove the effectiveness of this method.

(Table 1: Experimental results under two data sets and four automated indicators)

(Table 2: Experimental results under manual evaluation, the average score is outside the brackets (full score is 4), and the variance is inside the brackets)

Discovery of the Embedded Language Alliance (Representation Sprachbund) and its effect on multilingual pre-training

Paper link:
https://arxiv.org/abs/2109.00271

As an important technical branch of modern natural language processing (NLP), multilingual NLP aims to free the existing NLP technology from the limitation of language types, and can use one model to process tasks in hundreds of languages at the same time. The core challenge of the current multilingual pre-training model is that many existing data sets only have English training data, while many other low-resource languages with a small number of users have only test data. If you use English data to fine-tune the multilingual model and test it in other languages, there is a big gap between the results obtained and the structure in English. Facing the above challenges, the researchers of Microsoft Asia Research Institute were inspired by Sprachbund in linguistic theory and designed a new paradigm of multilingual NLP as shown in the figure below.

(Figure 3: Process of Embedded Language Alliance Discovery and Pre-training)

Sprachbund is a German linguistic term that means a language without kinship, due to the long-term coexistence in the same area, the phenomenon of common regional characteristics in the language structure. Researchers believe that the large differences between languages during pre-training are responsible for the poor performance of cross-language models. Therefore, researchers put forward the concept of Representation Sprachbund, which refers to a series of languages with similar embedded representations. Researchers extracted language embedding representations through cross-language pre-training models, clustered the language representations into multiple Representation Sprachbunds, and pre-trained for each type of Representation Sprachbund language with similar embedding representations. Experiments carried out on multiple cross-language benchmark tasks such as XGLUE and XTREME show that the method in this paper can achieve a significant improvement on the basis of the baseline model.

(Table 3: Results on cross-language benchmark tasks)

Another important contribution of this paper is to explore the relationship between the language embedding representation distribution obtained by the pre-trained language model and linguistic theory, including Language Family, Sprachbund and language grammar.

(Figure 4: Visual analysis of language embedding representation)

This article reveals the rich linguistic nature of language embeddings through visual analysis. Researchers hope to further explore the connection between multilingual NLP based on deep learning and classical linguistic theories in the future.

Efficient-FedRec: Efficient news recommendation privacy protection framework

Paper link:
https://arxiv.org/abs/2109.05446

Nowadays, privacy protection is becoming increasingly important for AI systems. Since recommendation systems (such as news recommendation) need to use a large amount of user behavior data for model training and inference, it is also very urgent to meet the needs of user data privacy protection. Federated learning is a model training framework that can realize privacy protection. It can unite a large number of users for collaborative model training without leaving the user data locally.

FedRec[1] is a method of recommending news about privacy protection based on federated learning, as shown in Figure 5. This method will train the gradient of the local news recommendation model based on the user's locally stored behavior data on each user side, and then upload it to the server side for aggregation, and update the global news recommendation model on the server side, and then Redistribute to the client. Since the model adopted by the recommendation system is getting larger and larger, and the model training of this method is mainly carried out on the user side, it will bring a huge computational burden to the user side (such as a mobile phone). At the same time, the user side and the server side also need to interact with the parameters of all models in multiple rounds, which also makes the communication overhead very huge.

(Figure 5: FedRec[1] frame)

To solve this problem, researchers from Microsoft Research Asia proposed efficient news recommendation privacy protection framework Efficient-FedRec. Federal training and learning all the different models of practice on the client, the researchers propose computing tasks are divided and balanced, so that the client and server at the same time to participate in the training model to , as shown in Figure 6.

(Figure 6: Efficient-FedRec framework)

Specifically, the researchers split the news recommendation model into a user model and a news model. The goal of user model is to model user interest from user behavior. Generally, the model is relatively lightweight, but involves privacy-sensitive user behavior data. The goal of the news model is to model the semantic content of news from the news text. The model is usually relatively large, but the news text data processed is not sensitive to privacy. Therefore, researchers train lightweight and privacy-sensitive user models on the user side, and train heavyweight, privacy-insensitive news models on the server side, which can significantly reduce the computing overhead on the user side and users. The communication overhead between the client and the server.

Each round of Efficient-FedRec model training consists of the following 4 steps:

(1) The server randomly selects a part of users and sends them the global user model and the news representations they have interacted with.

(2) Each user terminal uses its own local private data for training, and calculates the gradient of the user model and news representation.

(3) The server side aggregates the gradient of the user model and news representation.

(4) The server uses the aggregated user model gradient to directly update the user model, and uses the news representation gradient to calculate the gradient of the news model to update the news model. The updated news model will be used to calculate the new news representation.

In addition, order not to expose user interaction history, researchers used Secure Aggregation to calculate the union of interactive news between different users . Users protect their interaction history by requesting and centralized news presentation. At the same time, Secure Aggregation is also used to aggregate the gradients of different users to protect the private information in the user's local gradient.

Researchers conducted experiments based on the MIND and Adressa data sets. The results in Table 4 show that similar to the news recommendation method relying on centralized data storage.

(Table 4: Performance comparison of different methods on MIND and Adressa data sets)

Figure 7 further compares the calculation and communication costs of Efficient-FedRec and other privacy-preserving news recommendation methods. experimental results of 161974b7bca682 show that Efficient-FedRec can effectively reduce the computing and communication burden on the user side .

(Figure 7: Comparison of calculation and communication overhead of different privacy protection methods on the MIND data set)

Use weak decoder to assist retrieval task pre-training

Paper link:
https://arxiv.org/abs/2102.09206

In recent years, Dense retrieval technology has received more and more attention in various scenarios such as search, recommendation, and Q&A. In the first-stage retrieval of these scenarios, the Dense retrieval model usually adopts a double-tower structure. The encoder model is used to first encode the user side (query, browsing history, or question) and the corpus side (document or article) into the learning representation space. Then use simple similarity calculations (such as dot product or cosine similarity) for efficient retrieval. However, previous studies have shown that commonly used pre-trained language models are not very effective in encoding text in Dense retrieval scenarios, especially when the text sequence is mostly longer than 128 words.

In this regard, this paper proposes a pre-trained language model SEED-Encoder that generates high-quality text representations, which is used in large-scale dense vector retrieval scenes . SEED-Encoder adopts a self-encoder structure, using an encoder to generate a text representation, and a decoder based on the text representation to reconstruct the original text, thereby prompting the encoder to generate a more informative text representation.

However, theoretical analysis and experiments show that because a powerful decoder itself may learn a certain language paradigm, the decoding effect does not necessarily mean the better the generated representation . Specifically, after decomposing the loss expectation of the reconstructed text of the decoder into the KL divergence between the word distribution predicted by the decoder and the real distribution, and the conditional entropy of the predicted word and the previously known text content, Microsoft Asia Researchers of the Institute found that when the decoder’s fitting ability is strong enough or the current word is strongly dependent on the previously visible text content, even if the text representation generated by the encoder has no information, the decoding loss of the decoder can be very small. .

Therefore, this paper proposes limit the parameters of the decoder and the visible range of attention, and construct a weak decoder to solve this problem . In the pre-training process, in addition to the MLM loss, the training target of this article is the reconstruction loss of the weak decoder based on the text representation generated by the encoder. Since the number of parameters of the weak encoder is small, it will not increase too much overhead in the pre-training process. At the same time, only the encoder needs to be retained in the downstream tasks. Therefore, the overhead of fine-tuning on the downstream tasks is consistent with other pre-training language models, such as BERT. .

(Figure 8: Model frame)

large number of experiments show that compared to other pre-trained language models, the effect of SEED-Encoder is significantly improved; at the same time, it also reduces the number of training rounds required for fine-tuning on downstream tasks. , which proves the effectiveness of the method proposed in this article .

(Table 5: Experimental results)

(Figure 9: Convergence process comparison between BERT and Seed-Encoder)

mT6: Multilingual pre-training text to text Transformer using translated sentence pairs

Paper link:
https://arxiv.org/abs/2104.08692.pdf

The multilingual text-to-text Transformer (mT5) model performs well in all tasks of the multilingual natural language understanding benchmark. It inherits the characteristics of T5, unifies the modeling of natural language processing tasks as text-to-text problems, and demonstrates powerful cross-language transfer capabilities, which improves the effect of multilingual natural language processing tasks. However, how to use translated sentence pairs to improve mT5 still needs further research.

In this paper, Researchers from Microsoft Asia Research Institute proposed mT6 , which is a multilingual pre-trained text-to-text Transformer using translated sentence pairs. Researchers proposed three text-to-text cross-language pre-training tasks: Machine Translation (MT), Translation Pair Span Corruption (TPSC), and Translation Span Corruption (TSC) ) . Different from the traditional cloze task, in the TPSC and TSC tasks, model learning will cloze according to the bilingual context to encourage the model to learn general cross-language text representation.

(Figure 10: The three cross-language pre-training tasks proposed in the paper)

Compared with mT5, mT6 also has different training goals. The researchers proposed partial non-autoregressive decoding (as shown in Figure 11), which divides the original target text into several groups. The words to be predicted in the decoding process depend on the input text and only the target text in the same group. Not all target texts have been generated.

(Figure 11: Part of non-autoregressive decoding)

mT6 surpasses mT5 in the six tasks of the XTREME multilingual understanding evaluation benchmark, and the combined use of SC+TSC and partial non-autoregressive decoding has achieved the best results , as shown in Table 6.

(Table 6: The performance of mT6 on the XTREME multilingual comprehension evaluation benchmark)

As shown in Table 7, for the Gigaword multilingual text summarization task, mT6 surpasses mT5 in 3 languages, and shows better results in low-resource scenarios .

(Table 7: The performance of mT6 on the Gigaword multilingual text summarization task)

Using multi-language pre-training encoder to achieve zero-resource cross-language machine translation

Paper link:

https://arxiv.org/abs/2104.08757

Multi-language pre-training encoders (MPE) such as XLM-R have shown excellent zero-resource cross-language transfer capabilities in many natural language understanding tasks. However, how to use MPE to achieve zero-resource cross-language migration on machine translation tasks still needs further research. In this article, researchers from Microsoft Research Asia have explored and proposed improve the zero-resource cross-language migration ability of the machine translation model of MPE. Using only a multi-language pre-trained encoder and a parallel corpus of one language pair, the trained machine translation model can support translations in 100 source languages . A large number of experiments have shown that if appropriate fine-tuning training methods are used, the translation model trained with XLM-R has better zero-resource cross-language transfer capabilities than the use of mBART.

(Figure 12: Zero-resource cross-language migration on machine translation tasks.)

In the figure, the NMT model only needs a trained MPE and English-German bilingual corpus. After training, it can translate 100 source languages to English without using monolingual or bilingual information in languages such as Fi, Hi, Zh.

By comparing different schemes of using MPE to train the NMT model, the researchers proposed to use MPE to initialize the encoder and decoder embedding layers of the NMT model and keep them fixed during training; at the same time, train the decoder from scratch. In order to further improve the cross-language transfer ability of the model, the researchers adopted a two-stage training strategy, and introduced a decoder with enhanced model capacity and a position decoupled encoder . The context-based representation produced by the trained model encoder contains less language and location-related information, thereby gaining stronger cross-language transfer capabilities.

As shown in Table 8, among the many methods using the multi-language pre-training model XLM-R, SixT has achieved the best zero-resource cross-language transfer ability, and it also has a greater distance from the source language of the training data. Big promotion .

(Table 8: Comparison of the zero-resource cross-language migration capabilities of different methods on machine translation tasks (the BLEU scores in the table))

SixT has stronger cross-language transfer ability than the method of fine-tuning mBART . CRISS and m2m-100 are the current advanced unsupervised and supervised many-to-many machine translation models respectively. With less training data, SixT has achieved better or comparable performance to CRISS and m2m-100 on the multi-language to English translation test set in the table, as shown in Table 9.

(Table 9: Comparison between the method proposed in this article and other multi-language machine translation models)

As shown in Table 10, Generally speaking, SixT has a stronger transfer ability in languages similar to the source language of the training data set . In this paper, the researchers have also proposed the use of resources SixT zero cross-language migration of low resource machine translation two suggestions: First, larger training data set migration better; the second is as far as possible in the target language and migration on more similar training data.

(Table 10: The relationship between the language pair of the model training data set and the ability of cross-language transfer)

[1] Privacy-Preserving News Recommendation Model Learning, EMNLP 2020 Findings

(Reproduced from the Natural Language Computing Group of Microsoft Research Asia)

Welcome to follow the Microsoft China MSDN subscription account for more latest releases!