Explore end-to-end ASR solutions in proprietary areas

Abstract: This article starts with "Shallow-Fusion End-to-End Contextual Biasing" to explore end-to-end ASR solutions in proprietary areas.

This article is shared from HUAWEI CLOUD COMMUNITY "How to solve the The Road to End-to-End ASR in the Proprietary Field (1) ", original author: xiaoye0829.

For product-level automatic speech recognition (Automatic Speech Recognition, ASR), being able to adapt to the contextual bias of the proprietary domain is a very important function. For example, for ASR on a mobile phone, the system needs to be able to accurately recognize the name of the app, the name of the contact, etc. spoken by the user, instead of other words with the same pronunciation. To be more specific, for example, the word "Yao Ming" may be our well-known athlete "Yao Ming" in the sports field, but on the mobile phone, it may be a friend named "Yao Min" in our address book. How to solve this kind of deviation problem as the application field changes is the main problem to be explored in this series of articles.

For traditional ASR systems, they often have independent acoustic models (AM), pronunciation dictionaries (PM), and language models (LM). When a specific field needs to be offset, the language model LM can be used for specific contexts. The process of offset recognition. But for the end-to-end model, AM, PM, and LM are integrated into a neural network model. At this time, the context shift is very challenging for the end-to-end model. The main reasons are as follows:

The end-to-end model only uses text information when decoding. As a comparison, the LM in the traditional ASR system can use a large amount of text for training. Therefore, we found that end-to-end models are more prone to errors when recognizing rare, context-dependent words and phrases, such as noun phrases, than traditional models.
The end-to-end model takes into account the decoding efficiency, usually only a small number of candidate words (usually 4 to 10 words) are retained in each step of beam search decoding. Therefore, rare word phrases, such as context-dependent n-gram ( n-gram phrase), most likely not in the beam.

Previous work is mainly trying to integrate the independently trained context n-gram language model into the end-to-end model to solve the problem of context modeling. This approach is also known as Shallow fusion. However, their method handles proper nouns poorly. Proper nouns are usually cut off during beam search. Therefore, even if the language model is added for offset, it is too late, because this offset is usually The offset is performed after each word is generated, and beam search is in grapheme/wordpiece (for English, grapheme refers to 26 English letters + 1 space + 12 common punctuation. For Chinese, grapheme refers to 3755 Prediction on sub-word units such as first-level Chinese characters+3008 second-level Chinese characters+16 punctuation).

In this blog post, we will introduce a piece of work trying to solve this problem: "Shallow-Fusion End-to-End Contextual Biasing", this work is a work published by Google on InterSpeech 2019. In this work, first of all, in order to avoid proper nouns being pruned without using the language model for offset, we explored offsetting on sub-word units. Second, we explore the use of contextual FST before beam pruning. Third, because contextual n-grams are usually used with a set of common prefixes ("call", "text"), we also explore the fusion of these prefixes in shallow fusion. Finally, to help model proper nouns, we explored a variety of techniques to utilize large-scale text data.

Here we first introduce Shallow fusion, given a string of speech sequence x=(x_1, …, x_K), the end-to-end model outputs a string of sub-word-level posterior probability distributions y=(y_1,…,y_L), That is P(y|x). Shallow fusion means to merge the end-to-end output score with an externally trained language LM score during beam search:
y^{*}=argmax logP(y|x)+\lambda P_C(y)y∗=argmaxlogP(y∣x)+λPC(y)

Among them, \lambdaλ is a parameter used to adjust the weight of the end-to-end model and the language model. In order to construct the contextual LM for the end-to-end model, we assume that we have known a series of word-level biased phrases and compiled them into n-gram WFST (weighted finite state transducer). This word-level WFST is then decomposed into an FST as a spelling converter. This FST can convert a string of graphemes/wordpieces into corresponding words.

All previous migration work, whether for traditional methods or end-to-end models, is to put the scores of contextual LM and base models (such as end-to-end models or ASR acoustic models) in words or sub-words ( sub-word) combined on the grid. The end-to-end model usually sets a relatively small beam threshold when decoding, resulting in fewer decoding paths than traditional methods. Therefore, this article mainly explores the application of contextual information to the end-to-end model before beam pruning.

When we choose to offset grapheme, one worry is that we may have a large number of unnecessary words that match the context FST, thereby overwhelming this beam.

For example, as shown in the figure above, if we want to offset the word "cat", then the goal of the contextual FST construction is to offset the three letters "c", "a" and "t". When we want to offset the letter "c", we may not only add "cat" to the beam, but also add unrelated words like "car" to the beam. But if we offset at the wordpiece level, related subwords have fewer matches, so more related words can be added to the beam. Take the example of "cat" again. If we offset according to wordpiece, then the word "car" will not enter the beam. Therefore, in this article, we use a 4096-size wordpiece vocabulary.

We further analyze that Shallow fusion modifies the posterior probability of the output, so we can also find that shallow fusion will harm those voices that have no words that need to be offset, that is, those that are decontextualized. Therefore, we explore to offset only those phrases that match a specific prefix. For example, when searching for a contact on a mobile phone, we usually say a "call" or "message" first, or when we want to play music, we will say a word first. "Play". Therefore, in this article, we consider these prefix words when constructing the contextual FST. We extract the prefix words that have appeared more than 50 words before the context offset word. In the end, we obtained 292 commonly used prefix words for finding contacts, 11 for playing songs, and 66 for finding apps. We constructed a weightless prefix FST and cascaded it with the context FST. We also allow an empty prefix option to skip these prefix words.

One way to increase the coverage of proper names is to use a lot of unsupervised data. Unsupervised data comes from anonymous voices in voice search. These voices are processed using a SOTA model, and only those voices with high confidence will be retained. Finally, in order to ensure that the speech we left is mainly about proper nouns, we use a proper noun tagger (that is, CRF in ner for sequence annotation), and retain the proper nouns. Using the above method, we obtained 100 million unsupervised voices and combined 35 million supervised voices for training. During training, 80% of the time in each batch is supervised data, and 20% is unsupervised. Supervised data. With unsupervised data, one problem is that the text they recognize may be wrong, and the recognition result will also restrict the spelling of the name, such as whether it is Eric, Erik, or Eric. Therefore, we can also use a large number of proper nouns, combined with the TTS method, to create a synthetic data set. We dig out a large number of context-shifting words from the Internet for different categories, such as multimedia, social, and app categories. Finally, we extracted about 580,000 contact names, 42,000 song names, and 70,000 app names. Next, we dig a lot of prefix words from the log, for example, "call John mobile", we can get the prefix word "call" corresponding to the social domain. Then, we use the prefix words and proper nouns of a specific category to generate speech recognition text, and use a speech synthesizer to generate approximately 1 million voices for each category. We further added noise to these voices to simulate indoor sounds. Finally, during training, 90% of the time in each batch is supervised data, and 10% is synthetic data.

Finally, we explored whether we can add more proper nouns to the supervised training set. Specifically, we use the proper noun tagger for each voice to find the proper nouns in it. For each proper noun, we obtained its pronunciation characteristics. For example, "Caitlin" can be expressed as phonemes "K eI tl @ n". Next, we find words with the same phonetic unit sequence from the pronunciation dictionary, such as "Kaitlyn". For real speech and replaceable words, we randomly replace them during training. This approach allows the model to observe more proper nouns. A more direct starting point is that the model can spell more names during training, so when decoding later, combined with the context FST, it is more able to spell these names.

Let's take a look at the experimental part. All experiments are based on the RNN-T model. The encoder contains a time reduction layer and 8 layers of LSTM, each with 2000 hidden layer units. The decoder contains 2 layers of LSTM, each layer has 2000 hidden layer units. The encoder and decoer are sent to a joint network with 600 hidden layer units. Then this joint network is sent to a softmax, and the output is graphemes with 96 units or wordpieces with 4096 units. In reasoning, each speech is accompanied by a series of offset phrases to construct a contextual FST. In this FST, each arc (arc) has the same weight. This weight is adjusted separately for the test set of each category (such as music, contacts, etc.).

The above figure is some results of Shallow Fusion, E0 and E1 are the results of grapheme and wordpieces, these models are not offset. E2 is the result of grapheme with offset, but without any of the lifting strategies in this article. E3 uses a subtractive cost to prevent poor candidate words from being retained in the beam. This operation brings improvements in almost all test sets. Then shift from the offset on the grapheme level to the offset on the wordpiece, that is, we offset on a longer unit, which helps to maintain relevant candidate words in the beam and improve the performance of the model. Finally, our E5 model applies the offset FST before beam search pruning, which we call early biasing, which helps to ensure that good candidate words can be retained in the beam earlier and bring additional Performance improvement. In short, our best shallow fusion model is offset at the wordpiece level, with subtractive cost and early biasing.

Since the context bias may exist in the sentence, we also need to ensure that when the context bias does not exist, the effect of the model will not decrease, that is, it will not damage the recognition of speech without bias words. To test this, we conducted experiments on the VS test data set. We randomly selected 200 bias phrases from the Cnt-TTS test set to construct a bias FST. The following figure shows the results of the experiment:

As you can see from this table, E1 is our baseline model. After adding the offset, the E5 model has a lot of effects on VS. To solve this problem, the traditional model includes prefix words in the offset FST. If we apply the offset (E6) only after seeing any non-empty prefix words, we can observe that the result is improved on the VS dataset compared to E5, but on other test sets with offset words, it appears The result fell. Further, when we allow one of the prefixes to be empty (mainly to solve the scenario with offset words), but we only obtained similar results to E5. In order to solve this problem, we use a smaller weight for the context phrase if it is preceded by an empty prefix word (that is, there is no prefix word). Using this method, we observe that compared to the E1 model, E8 has achieved a small degree of effect reduction in VS, but it can maintain an effect improvement on the test set with offset phrases.

After analyzing the above content, we further explore whether we can further improve the ability of offset when the model can perceive more proper nouns. Our baseline model is E8, which is trained on 35 million supervised data sets. Combining our unsupervised data and the generated data above, we did the following experiment:

The experimental results of E9 show that when there are unsupervised data training together, the effect is improved on each data set. When the generated data are trained together (E10), compared to E9, there is a greater improvement in the TTS test set, but there is a greater decline in the real scene data set Cnt-Real (7.1 vs 5.8) , This shows that the improvement in the TTS offset test set is mainly derived from the audio environment matched between the training set and the test set, rather than learning a richer vocabulary of proper nouns.

Click to follow and learn about Huawei Cloud's fresh technology for the first time~

Explore end-to-end ASR solutions in proprietary areas

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

Gitee MCP Server：AI 助力企业研发效率腾飞

试试智能体工作流，自动化搞定运维故障排查

如何通过工具实现流程自动化

一条命令配置移动端(Android / iOS)自动化环境

ArgoCD实战指南：GitOps驱动下的Kubernetes自动化部署与Helm/Kustomize集成

2025年互联网公司常用的DevOps工具推荐