Youdao AI paper selected for COLING 2022, introducing conditional mask language model for neural machine translation data enhancement

COLING 2022 is an important international conference in the field of computational linguistics and natural language processing, hosted by the International Committee on Computational Linguistics (ICCL).

Youdao AI's research paper in the direction of machine translation was officially accepted and published by COLING2022 in the form of a long article.

Title: Semantically Consistent Data Augmentation for Neural Machine Translation via Conditional Masked Language Model
Authors: Cheng Qiao, Huang Jin, Duan Yitao

For the full text of the paper, see "Read the original text" at the end of the article

Research Background

Neural machine translation (NMT) usually requires a large number of bilingual parallel corpora for training, which is very easy to overfit on the training set of small data. It is difficult to obtain high-quality bilingual parallel corpora, and it usually requires a high cost to manually annotate the corpus. Data augmentation method is an effective technique to expand the scale of data, and has achieved remarkable results in some fields. For example, in the vision domain, training data is often augmented using methods such as cropping, flipping, bending, or color transformation.

Although data augmentation methods have become a basic technique for training neural network models in the field of vision, this technique has not been well applied in the field of natural language processing.

This paper mainly studies the technology of data augmentation using word replacement in neural machine translation (NMT). The word replacement technology achieves the purpose of data expansion by replacing the words in the existing parallel corpus sentence pairs. When using data augmentation methods, we observed that if the augmented data samples retain the correct label information, then the scale of the training data can be effectively expanded, thereby improving the effect of the model. We call this property semantic consistency .

In a neural machine translation system, training data exists in the form of sentence pairs, including source-side sentences and target-side sentences. Semantic consistency requires that both the source and target sentences are fluent and grammatically correct in their respective languages, and that the target sentences should be high-quality translations of the source sentences.

Existing word replacement methods usually exchange, delete or randomly replace words in the source and target sentences. Due to the discrete nature of natural language processing, these transformations cannot maintain semantic consistency, and often they may weaken the fluency of bilingual sentences or destroy the correlation between sentence pairs.

We can look at a case:

This example is a pair of sentences in the English-German parallel corpus and some sentences obtained by word replacement on the English side. Cases 1 and 2 are both problematic substitution methods. Although the former maintains the same meaning as the replaced word, it is grammatically incorrect. Although the latter is grammatically correct, it is not related to the translation of German sentences. Case 3 is a good augmentation sample because it is syntactically correct and semantically consistent.

In the process of generating augmented data, better augmentation can be achieved by utilizing contextual and label information. We introduce Conditional Mask Language Model (CMLM) for data augmentation of machine translation. The masked language model can simultaneously utilize intra-sentence bidirectional context information, while CMLM is an enhanced version of it that utilizes more label information. We show that CMLM can generate better replacement word distributions by forcing the source and target to maintain semantic consistency when performing word replacements.

Furthermore, to enhance diversity, we incorporate Soft Cotextual Data Augmentation, which uses a distribution over the vocabulary to replace specific words.

The method proposed in the paper is tested on 4 datasets of different scales, and the results all show that the method is more effective and has higher translation quality than previous word replacement techniques.

Method introduction

Our goal is to improve the data augmentation method in machine translation training, so that the semantics of the source and target sentences and the cross-language inter-translation relationship between them can be preserved during the augmentation process.

To achieve this goal, we introduce a Conditional Masked Language Model (CMLM) , which generates a context-sensitive distribution of replacement words, from which we can choose the best replacement word for a given word. The CMLM model is a variant of MLM that incorporates label information when predicting masks.

In a machine translation scenario, CMLM follows two requirements :

When predicting the mask, both the source and the target are conditioned;
During CMLM training, only partial words at the source end or partial words at the target end are masked, but both the source end and the target end are not masked.

In actual training, the source and target sentences can be spliced, and then 15% of the source words are randomly masked, and a CMLM is trained to predict the masked source words. Similarly, 15% of the target words can be randomly masked, and a CMLM can be trained to predict the masked target words based on the concatenated bilingual sentences. This feature of relying on bilingual information to predict a masked word at one end is the key to maintaining semantic consistency by using CMLM predicted words for data augmentation.

After using the above method to train the CMLM model, it can be used to expand the bilingual corpus for training. For the training bilingual corpus, mask some words at the source or target end, use CMLM to predict the distribution of possible candidate words, and then sample a word in the distribution to replace the word at the corresponding position.

Since CMLM combines both source and target information, the words predicted by the model can well maintain bilingual semantic consistency. This direct replacement method is time-consuming, and if the variance of sampling needs to be reduced, enough candidates need to be generated. To improve the efficiency here, we incorporate a soft data augmentation approach.

Soft data augmentation does not sample specific words, but calculates the expected word vector on the vocabulary according to the predicted distribution, and uses this soft word vector to replace the real word vector representation. The soft word vector representation is calculated like this:

The data augmentation architecture using CMLM in neural machine translation training is shown in the figure below. There are two separate CMLMs here to enhance the source and destination respectively. We initialize the CMLM with a pretrained multilingual BERT, fine-tuned using the aforementioned method. During the training process of the translation model, some parameters of the CMLM are fixed, and the soft word vectors generated by the CMLM are used to replace the real word vectors with a certain probability to participate in the training of the machine translation model. We explored the effect of different substitution probabilities on the quality of translation models.

Experiments and Results

In order to verify the effect of the method proposed in the paper, we use three smaller datasets: including IWSLT2014 German, Spanish, and Hebrew translation to English, and a larger dataset: WMT14 English to German, Experimental verification was carried out.

We compare this method with several other data augmentation methods, including some regular word replacement methods such as word swapping, deletion, random replacement, and two methods that utilize language models for replacement. We also compare the method in the paper with the sentence-level augmentation method mixSeq. Our baseline system is one that does not use any data augmentation.

For comparison, we use CMLM to conduct two sets of data augmentation experiments: the first group uses the soft word vector replacement method described above, and the second group uses the traditional sampling replacement method. The replacement words are generated according to the predicted sampling of CMLM.

Both methods are applied to both source and destination, and use the same mask probability gamma = 0.25, which is the optimal configuration we found.

The experimental results are shown in the following figure:

From the results in the table, it can be seen that both methods using CMLM for data augmentation significantly outperform the baseline system , where the CMLM soft word vector augmentation method achieves the best results on all tasks. In particular, it achieved an improvement of 1.9 BLEU in WMT English to German.

In addition to experiments on public corpora, we also apply the method to the online system of Youdao Translation . Youdao online translation system ( http://fanyi.youdao.com ) uses nearly 100 million sentence pairs for training, the model size is close to 500 million parameters, and uses a variety of optimization methods, which are better than others on multiple test sets product. Our method also achieves significant improvement on such leading commercial machine translation systems.

practical application

Since the launch of NetEase Youdao Dictionary in 2007, the Youdao AI team has been working on machine translation technology for many years. In 2017, Youdao Neural Network Translation Engine (YNMT) was launched, which made a qualitative leap in translation quality.

In addition to NetEase Youdao Dictionary, Youdao neural network translation technology has been applied to rich learning tool apps such as Youdao Translator, Youdao Children's Dictionary, U-Dictionary, etc. Translation and language learning services.

In addition to software, YNMT technology has also been applied to Youdao Dictionary Pen, Youdao Smart Learning Lamp, Youdao AI Learning Machine, Youdao Listening Treasure and other intelligent learning hardware. The customized design of the power consumption realizes core functions such as "millisecond-level check" and "0.5s fingertip search".

Based on self-developed AI core technology, combined with a deep understanding of learning scenarios, NetEase Youdao has developed a variety of businesses such as learning hardware and tools, literacy courses, university and workplace courses, and educational informatization, and is committed to helping users achieve efficient learning. . In the future, Youdao AI will continue to conduct forward-looking research on cutting-edge technologies and promote its implementation in products and real scenarios.

For the full text of the paper, please refer to "Read the original text":
read the original

Youdao AI paper selected for COLING 2022, introducing conditional mask language model for neural machine translation data enhancement

Research Background

Method introduction

Experiments and Results

practical application

有道AI情报局

引用和评论

速来体验！基于有道子曰的翻译大模型2.0正式上线

大模型中的Token究竟是什么？从原理到作用深度解析

Open WebUI：开源AI交互平台的全面解析

一文掌握 MCP 上下文协议：从理论到实践

MySQL × 向量数据库：大模型时代的黄金组合实战指南

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

有了DeepSeek等AI大模型，人人都能当医生吗？