头图

foreword

"Voice processing" is a very important scene in the field of real-time interaction. In the " RTC Dev Meetup丨Technical Practice and Application of Voice Processing in the Field of Real-time Interaction " initiated by Shengwang, technologies from Baidu, Huanyu Technology and Yitu Experts have shared on this topic.

This article is based on the content shared by Tan Xu, a researcher in charge of Microsoft Research Asia, at the event. Follow the official account "Shengwang Developer " and reply to the keyword " DM0428 " to download PPT materials related to the event.

Speech recognition error correction further improves recognition accuracy by detecting and correcting errors in speech recognition results. At present, most error correction models use an autoregressive structure based on the attention mechanism, which has a high latency and affects the online deployment of the model.

This article will introduce a low-latency, high-accuracy error correction model FastCorrect. By using edit alignment and multiple candidate results, the model can be accelerated by 6-9 times while achieving a 10% drop in word error rate. Related research papers have been Included in NeurIPS 2021 and EMNLP 2021.

01 Background Information

1. ASR (Automatic Speech Recognition)

The accuracy of speech recognition is the most critical factor affecting the wide application of speech recognition. How to reduce the error rate of speech recognition in the recognition process is very important to ASR. There are many different ways to improve the accuracy of speech recognition and reduce the error rate. The traditional method is the core model for improving speech recognition. In the past research process, the main focus is on how to improve the training model modeling paradigm and training data for speech recognition. In fact, in addition to improving the accuracy of the speech recognition model itself, the recognition results of speech recognition can also be post-processed to further reduce the recognition error rate.

2. ASR post-processing

What operations can be performed in the speech recognition post-processing scenario? The first is reranking, that is, reordering. Usually, when speech recognition generates text, multiple candidates are generated. We can sort the model and select a better result from multiple candidates as the final recognition result to improve the accuracy. . The second method is to perform error correction on the results of speech recognition, which can further reduce the error rate. These two methods are optional methods for post-processing of speech recognition, and they are also widely used methods to reduce the error rate. Today's sharing mainly focuses on error correction methods.

3. Why choose error correction

The reason for choosing the error correction method is that we believe that error correction is based on the existing speech recognition results to correct, which can produce better speech recognition results. And reranking is to generate a better candidate from the results returned by the existing speech recognition. If the error correction effect is good enough, it will have more advantages than reranking.

02 Form of ASR error correction task

The above describes the selection of technical solutions and why you should choose error correction methods. Next we define the form of the ASR error correction task. First, a training data set (S, T) is given, where S represents the input speech for speech recognition and T is the corresponding text annotation. The ASR model then recognizes the speech as text, resulting in M(S). The pairing of M(S) and T data constitutes a training set, and the error correction model is mainly trained in this training set. After the training is completed, we give the result identified by ASR, which is M(S), and return the correct result.

The task of the error correction model is a typical sequence-to-sequence learning task, where the input is a result generated by speech recognition, and the output is the correct result after error correction. Since it is a sequence-to-sequence model, previous work will naturally regard it as a sequence modeling task, and correct errors through encoder-attention-decoder autoregressive decoding. The input is a wrong sentence, and the output is correct. sentence.

In the decoding process, an autoregressive method is used, such as generating A, then generating the next word B, and then generating C and D in turn. There is a problem with this method, that is, the decoding speed will be relatively slow. We have conducted actual measurements. For example, the average latency of the online ASR model on the CPU is 500 milliseconds. If an autoregressive error correction model is added, it will bring an extra 660 milliseconds of delay, which will reduce the online recognition speed. more than twice as shown in Figure 1.

图片

■Figure 1

This scheme is obviously not advisable in actual deployment, so our goal is to reduce the delay and maintain the accuracy of error correction. We use a non-autoregressive method to speed up. The above-mentioned autoregressive method does not generate a token each time, but generates all tokens at one time, which can improve the decoding speed.

Because the non-autoregressive decoding model is widely used in machine translation, we directly use the non-autoregressive model in typical machine translation to try and find that it can not reduce the error rate of speech recognition, but will increase it. Why? how about this? First of all, we found that the non-autoregressive training task of text error correction for speech recognition is different from machine translation. For example, in machine translation, the input is Chinese and the output is English, then all tokens in the input sequence need to be modified to translate Chinese into English. But in the error correction task, most of the input sentences are correct, that is to say, most of the words in the input sentences do not need to be modified.

If the traditional method is still adopted, it is easy to cause two problems: missed corrections and wrong corrections. This brings challenges to the task of error correction. How to detect errors and how to correct them becomes the key to improving accuracy.

03 Naive NAR solution fails

We conduct a detailed analysis of this problem, expecting to discover features from the task to design specific non-autonomous modeling methods. First of all, machine translation in different languages (such as Chinese to English) has the characteristics of word order exchange, because Chinese expressions and English expressions are different in word order, but in the error correction task, the text generated by speech recognition is recognized. and the last correct text, in fact, there will be no word exchange error, but a monotonic alignment relationship.

Second, there are many possibilities for the words themselves to be wrong, such as insertion errors, deletion errors, and substitution errors. Based on these two kinds of prior knowledge, more detailed error patterns can be provided to the error correction process to guide error detection and error correction operations. We analyze this problem to inspire the design of corresponding models.

04 Introduction of FastCorrect series models

Microsoft has carried out a series of work on the FastCorrect model, including FastCorrect 1, FastCorrect 2 and FastCorrect 3. Each job addresses a different problem and scenario. FastCorrect 1 was published at the NeurIPS 2021 conference. It is mainly based on the prior knowledge of the previously analyzed tasks, and provides guidance signals for additions, deletions and modifications through the editing distance of the text to correct the results of speech recognition. The error correction is only for the best results of speech recognition, because speech recognition can get one result, and it can also get multiple results through beam search decoding. FastCorrect 1 can achieve a speedup of 7 to 9 times, and can achieve a WERR of 8%, which is a reduction in the word error rate. Although WERR looks small, it is actually not easy to achieve 8% WERR when the accuracy of speech recognition is already very high.

Although speech recognition will eventually return a candidate under normal circumstances, in the process of speech recognition decoding, multiple candidates will also be retained. If multiple candidates can provide mutually verified information, it can help us better achieve error correction. So we designed FastCorrect 2, which was published in EMNLP 2021 findings, to take advantage of the synergy of multiple candidates to further reduce the word error rate. Compared with FastCorrect 1, the error rate can be further reduced while maintaining a better speedup ratio.

These two works are currently open source under Microsoft's GitHub ( https://github.com/microsoft/NeuralSpeech ), if you are interested, you can try to use them. Next, the technical implementation details of the two works will be introduced in detail.

1. FastCorrect

The core of FastCorrect is to use the prior knowledge in text error correction, that is, the information of addition, deletion and modification operations, so we first align the wrong text with the correct text, and guide the alignment logic through the editing distance of the text. Alignment can know which words to delete, which words to add, which words to replace, etc. With these fine-grained supervision signals, modeling the model is easier. For example, in the deletion operation, we use the concept of duration. Duration means that information is given in advance for each input word, indicating that it is changed to the correct sentence of the target. This word will become several words, such as zero. A word means being deleted, becoming one word means unchanged or being replaced, and becoming two or more words means an operation of inserting or replacing.

With such fine-grained supervision signals, the performance of the model will be improved, rather than an end-to-end means of learning through data like machine translation. At the same time, the non-autonomous model design is also divided into three parts. The encoder takes the wrong text as input to extract information; the duration predictor predicts how many target tokens each source token should be changed to; and the decoder finally generates the target token.

(1) Edit alignment

Next, we introduce the editing and alignment operation in FastCorrect. The sequence on the left in Figure 2 is the result of speech recognition output BBDEF, and the Target sequence is the actual correct result ABCDF, which indicates that the speech recognition is wrong. We align it with the edit distance, and the upward Arrows indicate deletions, leftwards indicate insertions, and arrows pointing diagonally indicate replacements.

图片

■Figure 2

After the edit distance alignment, several different paths can be obtained. The edit distance of each path is the same. For each path, we can know the alignment relationship between each token of the source and each token of the target. After that, you can select some paths with a relatively high degree of match. For example, the match degree of the two paths path a and path b is higher than that of path c, so we select the appropriate alignment relationship based on the two paths of path a and path b. From these two paths, three different alignments can be obtained. For example, in Align a, the token of B corresponds to A and B, while B corresponds to C and so on. At the same time, the path will also have different possibilities. For example, in Align b1, B may also correspond to B and C, and in Align b2, D may also correspond to C and D. Next, you can find out which combination is common from the text corpus, and then select a reasonable alignment relationship by the collocation frequency of words.

From the BBDEF and ABCDF at the bottom of Figure 2, we can know that each source token should be changed to several tokens. For example, the first B in Align b1 will be changed to 2, the second B to 1, and D to 1. , E is changed to 0, F is changed to 1. With these signals, it is clear that each source token should be changed to several tokens.

(2) NAR model

As shown in Figure 3, the Encoder input is a wrong sentence, predict how many words each sentence will be changed into, and then spread the sentence according to this. For example, if you see that the first B will be changed to two words, we will spread the B twice. And this B is a word, so let's put it here. Then if it's going to be deleted, we'll delete it. Then it is finally used as the input of the Decoder, and then decoded in parallel. This is the design of the core method of the model.

图片

■Figure 3

(3) Pre-training

In the training of the error correction model, because the ASR word error rate is relatively low, the wrong cases are generally less, the effective training data is not enough, and the training effect of the model will also be reduced, so we additionally construct some wrong pairing data, that is, input errors but output the correct sentence. Because it was not enough to provide data from speech recognition models in the past, we fabricated such data on a large scale for pre-training, and then fine-tuned to real speech recognition datasets. We simulated deletion, insertion, and replacement when falsifying data. Because these operations are close to the actual error rate pattern of speech recognition, the probability of additions, deletions, and changes is close to the existing speech recognition models. At the same time, we will give priority to using homophones when making replacements, because speech recognition generally has homophone errors. After finding such data, it can help the model train well.

(4) Experiments

Next, we introduce some experimental details. We focus on Chinese speech recognition error correction in some academic data and Microsoft's internal speech recognition data set, and select about 400 million sentences from the pre-trained model.

图片

■Figure 4

The experimental results are shown in Figure 4. It can be seen that the word error rate of the original speech recognition is about 4.83, and if the autoregressive model just mentioned, that is, the encoder attention decoder, can achieve a 15% word error rate drop. But its latency is relatively high. This is the method used in the past, including non-autonomous methods in machine translation and some methods of text editing. Compared with the original speech recognition error, our method can achieve a 13% to 14% drop in the word error rate, which is close to the autoregressive model, which means that there is almost no loss of error correction ability. But the latency is 7 times faster than the autoregressive model. It can be seen that the FastCorrect method can maintain the word error rate well, while improving the speed and achieving the standard of online deployment.

We also study the method of pre-training the constructed data for each module, and the effectiveness of the method of alignment by edit distance. As can be seen from the two datasets shown in Figure 5, if the related modules of FastCorrect are removed, the accuracy will still be reduced, indicating that these modules of FastCorrect are more useful.

图片

■Figure 5

The autoregressive model is an encoder decoder. The decoder is time-consuming and requires autoregressive decoding of a word. You may have questions, in order to improve the speed of the autoregressive model, can the encoder be deepened and the decoder shallower to achieve the same speedup and maintain the accuracy? In this regard, we conducted a comparative experiment between FastCorrect and different variants of the autoregressive model. As shown in Figure 6, AR 6-6 represents 6 layers of encoder and 6 layers of decoder, while AR 11-1 represents 11 layers of encoder and 1 layer of decoder. . It can be seen that the FastCorrect method works better, or the word error rate is similar, but the acceleration ratio is more obvious, which also dispels the question just now.

图片

■Figure 6

As mentioned above, how to detect and correct errors in text error correction is very important. We also compared the detection precision and recall, as well as the error correction ability. Through comparison, it is found that the effect of the FastCorrect method is indeed better than that of the previous method, which also verifies some previous conjectures: providing some fine-grained addition, deletion and modification guidance signals through prior knowledge can help us better detect and correct errors.

2 FastCorrect 2

(1) Multiple candidates

FastCorrect 2 is an extended version of FastCorrect 1, because the results of the ASR speech recognition model are generally multiple sentences, which will provide some additional information, called voting effect. Suppose a piece of speech gets three possible sentences through the recognition model, namely "I have cat", "I have hat", and "I have bat". These three sentences confirm each other and can provide us with additional information. First of all, with a high probability, the recognition of the first two words is correct, because all three results recognize I have, but the last three words are different, indicating that many of them may be wrong or all wrong. But most likely, the word ends with the at sound. After obtaining such information, the difficulty of error correction and correction will be greatly reduced. When revising, you can choose a more reasonable word from it to help us narrow the space of the problem. That's how FastCorrect 2 was designed.

(2) Model structure

The results of the design model are shown in Figure 7. First, multiple candidate sentences for speech recognition are aligned before input, because the alignment can provide information that confirms each other. For example, in the previous example, we need to align cat, hat and bat, and align the input sentences according to this idea, and then the encoder will connect these candidate sentences as the input of the model, and predict the duration of each sentence, That is, after the modification, it will be changed into several words. It will also use a selector to select a better candidate, select which candidates are better through loss supervision, and then modify based on the better candidates. The third candidate in Figure 7 is better, so we use it as the decoder input. This is the high level design method of the entire FastCorrect 2.

图片

■Figure 7

(3) Align multiple candidates

Here is a detail, that is, how to align multiple sentences to make it have a more accurate correspondence. For this, we arbitrarily find an anchor candidate, and then align other sentences with this sentence. There are not too many details here. introduce. This alignment method is actually the same as that introduced in FastCorrect 1, that is, first calculate the edit distance, then get the edited path and select a more reasonable alignment relationship from this path. That is to say, after each sentence is aligned with the anchor sentence, the alignment relationship between all sentences and the anchor sentence will be obtained, and finally the candidate is merged to form a multi-way alignment. After alignment, it can be used as the input of the model.

There is a comparison here, that is, if you do not use the alignment method of FastCorrect 2, but use Naive Padding, you will see the situation in Figure 8(b), where B are all clustered together, but C and D are mixed. This is odd, because C and D don't actually have any relationship as far as the model is concerned. But because we use a very simple method to make it in the same position, the model cannot obtain the signal of mutual verification, which will cause D, E and F to be mixed together, resulting in cat, hat and bat can't corroborate each other to help us correct mistakes.

图片

■Figure 8

(4) Results

Next, the results are shown, as shown in Figure 9, the first row is the error rate of the speech recognition result, the second row is the error rate after error correction with the autoregressive model, and the third row is the result of FastCorrect 1. At the same time, we also made some settings. As mentioned earlier, there are two ways to post-processing speech recognition, one is reranking, and the other is error correction. Since multiple candidates are involved here, and reranking is based on multiple candidates for selection, we superimpose the two methods, first select from multiple candidates through reranking, and then use FastCorrect 1 for error correction. Suppose there are 4 candidates, correct each candidate separately, and choose the better one as the final result. The FastCorrect 2 method directly aligns multiple candidates with each other as input after aligning.

图片

■Figure 9

Finally, it can be seen that the effect of FastCorrect 2 is better than that of FastCorrect 1, because it uses more information. In terms of word error rate, FastCorrect 2 can continue to drop by more than two WERR, and the speed can also be better maintained. As can be seen from Figure 9, the R+FC method has more advantages, but the cost is higher, because multiple candidates need to be corrected separately and then reranked, so this method cannot be used, and finally FastCorrect 2 is selected. Strategy.

In the process of Aligning in the dataset, you can consider aligning words with relatively close pronunciation. For example, in the example mentioned above, how to align cat and hat in I have hat and I have cat? A very important factor at this time is the similarity of pronunciation phonetic symbols. The pronunciations of hat and cat are very close, and the words with similar pronunciation are given priority to construct the Align relation better. So does WER decrease if pronunciation similarity is not considered? As shown in Figure 10, it is found that after removing the pronunciation similarity, the WER does decrease slightly. It can be seen that if the words in the language model are easy to match, you can put these words together for Align first. In addition, it is unreasonable to use the Naive padding method at this time.

图片

■Figure 10

We use multiple candidates as input for error correction, so is it better to use as many candidates as possible? Experiments show that the more candidates, the worse the delay will be. As can be seen from Figure 9, as the candidates increase, they will eventually face the trade off of accuracy and latency.

Some people may question whether this is caused by the increase in data? Because compared to the previous one best correction, multiple candidate sentence inputs are additionally used as model training. To this end, we made a comparison, which is to split the sentences. For example, four candidates correspond to a correct sentence, and split them into four pairs. Each pair has a candidate corresponding to the correct sentence, so that the amount of data increases. Four times bigger. However, it is actually found that this method does not reduce the error rate, but increases the error rate. It shows that the increase of data is not the cause of this result, but the error correction effect is better after providing signals through reasonable alignment.

In response to the problem of how to reduce the error rate and improve the accuracy in speech recognition, under the condition that the online delay is acceptable, we have carried out the FastCorrect series of work. As shown in Figure 11, FastCorrect 1 and FastCorrect 2 are used in academic data sets and Microsoft's respectively. In the internal product data set, good results have been achieved, and the error rate of the autoregressive error correction model is reduced. If you are interested, you can pay attention to our GitHub. We are still doing some analysis and design based on this problem, and use the method-related insight to build the FastCorrect 3 model to achieve better error detection and error correction capabilities.

图片

■Figure 11

05 Introduction to Microsoft's research results and projects in the field of speech

Microsoft has also carried out a series of studies in the whole speech, as shown in Figure 12, including front-end text analysis of speech synthesis, modeling of low-resource data for speech synthesis, and how to improve the speed of inference when deployed online, and how to improve speech Robustness in synthesis, how to generalize speech synthesis capabilities, etc.

图片

■Figure 12

In addition, we have extended speech synthesis scenarios, such as talking face generation, input speech, and output videos such as talking faces and gestures; we also perform voice synthesis for vocals and instrumental music, and carry out detailed research in the field of TTS. Survey work, at the same time held a tutorial speech tutorial. Recently, we developed a speech synthesis system, NaturalSpeech, which can generate human-level speech. If you are interested in speech synthesis, you can communicate more.

Microsoft has also carried out some work in AI music, such as traditional music information retrieval and understanding tasks, as well as music generation tasks (including songwriting, accompaniment generation, arranging, timbre synthesis, and mixing). If you are interested in AI music, you can also pay attention to our open source project, as shown in Figure 13. Microsoft provides speech synthesis, speech recognition, speech translation and other services in Voice Azure. If you are interested, you can also use the website shown in Figure 14.

图片

■Figure 13

图片

■Figure 14

The Machine Learning Group of Microsoft Research Asia is currently recruiting formal researchers and research interns. The recruitment directions include speech, NLP, machine learning, and generative models. Welcome to join us!

图片

06 Q&A session

1. The relationship and difference between FastCorrect and BART

BART is a pretrained model in NLP for sequence-to-sequence tasks, which can be machine-translated and applied to any text-related sequence-to-sequence learning task. The text error correction task itself also belongs to sequence-to-sequence learning, which is a traditional autoregressive method. In the field of traditional methods, BART can be used directly because it is also decoded in an autoregressive manner. FastCorrect solves the problem of the slow decoding speed of the autoregressive method. It is a non-autoregressive model. Unlike BART, which reads word by word, it reads the entire sentence at one time, which improves the online inference speed. It is also the core of our design, so from this point of view, the two are quite different.

2. Is there any targeted design for error correction?

In addition to general speech recognition models, we also have many customized scenarios for which the data contains a large number of specialized vocabulary. In order to achieve a better recognition effect, an enhanced knowledge base or an adaptive operation can be introduced during error correction. Assuming that the general speech recognition model is to be used in legal, medical and other scenarios, and the specialized terms contained in these fields are rare, then a topic can be provided for the speech recognition model, and the related words of the topic involved in the scene of the currently recognized paragraph can be provided for model reference. for identification. Error correction can use this mechanism. In addition, in Chinese error correction scenarios, alignment is relatively easy, but in English or other languages, one word may correspond to some characters of another word. How to design methods for these languages is a problem that needs to be considered in the adaptation process.

About Shengwang Cloud Market

Shengwang Cloud Market is a real-time interactive one-stop solution launched by Shengwang. By integrating the capabilities of technical partners, it provides developers with a one-stop development experience, and solves the selection, price comparison, integration, account opening and integration of real-time interactive modules. Purchase, help developers quickly add various RTE functions, quickly bring applications to the market, and save 95% of the time to integrate RTE functions.

Microsoft's real-time speech recognition (multilingual) service has now been put on the Shengwang cloud market. With this service, audio streams can be transcribed to text in real-time and work seamlessly with Speech Services' translation and text-to-speech products/services.

You can click here to experience it now.


RTE开发者社区
663 声望973 粉丝

RTE 开发者社区是聚焦实时互动领域的中立开发者社区。不止于纯粹的技术交流,我们相信开发者具备更加丰盈的个体价值。行业发展变革、开发者职涯发展、技术创业创新资源,我们将陪跑开发者,共享、共建、共成长。