Abstract: In this article, we show CLAS, a full neural network composition, end-to-end contextual ASR model, which merges contextual information by mapping all contextual phrases. In the experimental evaluation, we found that the proposed CLAS model exceeds the standard shallow fusion bias method.
This article is shared from HUAWEI CLOUD COMMUNITY "How to solve the The Road to End-to-End ASR in the Proprietary Field (2) ", original author: xiaoye0829.
Here we introduce a work related to the end-to-end ASR in the proprietary field "DEEP CONTEXT: END-TO-END CONTEXTUAL SPEECH RECOGNITION". This work is also from the same research team at Google.
In ASR, the content of a user's speech depends on the context he is in, usually this context can be represented by a series of n-gram words. In this work, we are also studying how to apply this contextual information in an end-to-end model. The core approach of this article can be regarded as a contextual LAS[1] model. The Contextual LAS (CLAS) model is based on the LAS model and is optimized in conjunction with n-gram embedding. That is, during beam search, the independently trained n-gram and LAS models are subjected to shallow fusion.
In the work of this paper, we consider dynamically incorporating contextual information in the recognition process. In traditional ASR systems, a mainstream approach to incorporate contextual information is to use an independently trained online rescoring framework that can dynamically adjust the weights of a small portion of n-grams related to the context of a specific scene. It is very important to be able to extend this technology to the ASR model of Seq2Seq. In order to achieve the purpose of shifting the recognition process according to a specific task, there have been previous attempts to incorporate an independent LM into the recognition process. The common practice is shallow fusion or cold fusion. In work [2], the shallow fusion method is used to construct Contextual LAS, that is, the output probability of LAS is modified by a special WFST constructed by the speaker context, and the effect is improved.
Previous work used an externally independently trained LM for online rescoring, which runs counter to the benefits of joint optimization of the Seq2Seq model. Therefore, in this article, we propose a Contextual LAS (CLAS), which provides a series of contextual phrases (ie, contextual phrases) to improve the recognition effect. Our method is to first map each phrase into a fixed-dimensional word embedding, and then use an attention mechanism to summarize the available context information at each step of the model output prediction. Our method can be seen as a generalization of streaming keyword discovery technology [3], which allows a variable number of context phrases to be used in reasoning. Our proposed model does not require specific contextual information during training, and does not require careful adjustment of the weights of re-scoring, and can still incorporate OOV vocabulary.
This article will then explain the standard LAS model, the standard context LAS model, and the modified version of LAS we proposed.
The LAS model is a Seq2Seq model, including an encoder and a decoder with an attention mechanism. When decoding each word, the attention mechanism dynamically calculates the weight of each input hidden state, and obtains the current value through a weighted linear combination Attention vector.这个模型的输入x是语音信号,输出y是graphemes(即英文的character,包含a~z,0~9,<space>, <comma>, <period>, <apostrophe>,<unk>)。
The output of LAS is the following formula:
This formula depends on the state vector hx of the encoder, the hidden layer state dt of the decoder, and the Ct modeled as a context vector. Ct uses an attention gate to aggregate the state of the decoder and the output of the encoder.
In the standard contextual LAS model, we assume that a series of word-level offset phrases have been known in advance. And compiled them into a WFST. This word-level WFST G can be composed of a speller FST S. S can convert a string of graphemes or word-pieces into corresponding words. Therefore, we can obtain a context language model LM C=min(det(SоG)). The score Pc(y) from this context language model can then be used in the decoding process to enhance the standard log probability term.
Here, λ is an adjustable parameter to control the influence of the context language model on the overall model score. The total score in this formula is only applied at the word level. As shown below:
Therefore, if the related word (word) does not appear in the beam, then this technique cannot improve the effect. Moreover, we observed that although this method works well when the number of context phrases is small (such as yes, no, cancel), when the context phrase contains many nouns (such as song titles, contacts) , This method does not work well. Therefore, as shown in Figure c above, we explore to apply weight to the sub-word unit of each word. In order to avoid manually setting the weight of the prefix word (matching the prefix, but not the entire phrase), we also include a subtractive cost, such as the negative weight in Figure C above.
Below we begin to introduce the contextual LAS model proposed in this paper, which can use the additional contextual information provided by a series of bias phrases Z to effectively model P(y|x,z). A single element in Z is a phrase related to a specific context, such as a contact person, a song title, etc. Assume that these context phrases can be expressed as: Z = Z1, Z2 …, ZN. These bias phrases are used to bias the model toward output specific phrases. However, not all bias phrases are related to the current speech to be processed. The model needs to determine which phrases may be relevant and use these phrases to modify the model. Target output distribution. We use a bias-encoder to enhance LAS and encode these phrases into hz={h0z, h1z,..., hNz}. We use the superscript z to distinguish sound-related vectors. hiz is the mapping vector of Zi. Since all bias phrases may not be related to the current speech, we additionally include a learnable vector, h0z = hnbz. This vector corresponds to no bias, that is, no bias phrase is used in the output. This option enables the model to ignore all biased phrases. This bias encoder is composed of a multi-layer LSTM network. Hiz sends the embedding sequence corresponding to the sub-words in Zi to the bias encoder, and uses the final state of the LSTM as the output feature of the entire phrase. We then use an additional attention to calculate hz, using the following formula, when input into the decoder, Ct = [Ctx; Ctz]. The other parts are the same as the traditional LAS model.
It is worth noting that the above formula explicitly models the probability of seeing each specific phrase at the current moment, given the speech and previous output.
Let's look at the experimental part below. The experiment was carried out on 25,000 hours of English data. This data set uses a room simulator, adds noise and confusion of different intensities, and manually interferes with normal speech, so that the signal-to-noise ratio is between 0 and 30dB. The noise source comes from Youtube. And the noise environment of daily life recording. The structure of Encoder contains 10 layers of unidirectional LSTM, each layer has 256 units. The bias encoder contains a single-layer LSTM with 512 units. The decoder consists of 4 layers of LSTM, each with 256 units. The test set of the experiment is as follows:
First of all, in order to test whether the offset module we introduced will affect the decoding without the offset phrase. We compared our CLAS and ordinary LAS models. The CLAS model uses random offset phrases when training, but does not provide offset phrases during testing. Unexpectedly, CLAS does not have offset phrases. When provided, it also achieved better performance than LAS.
We further compared different online rescoring schemes, which differ in how to assign weights to sub-word units. As you can see from the table below, the best model performs offset on each sub-word unit, which helps to retain words in the beam. All of the following online re-scoring experiments are offset in sub-word units.
Next, we compared the effects of CLAS and the various programs above:
As can be seen from this table, CLAS significantly surpasses the traditional method and does not require any additional hyperparameter adjustments.
Finally, we combine CLAS with traditional methods, and we can see that both bias control and online re-scoring help to improve the effect.
In this article, we show CLAS, an end-to-end contextual ASR model composed of a full neural network, which merges contextual information by mapping all contextual phrases. In the experimental evaluation, we found that the proposed CLAS model exceeds the standard shallow fusion bias method.
[1] Chan, William, et al. "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition." 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016.
[2] Ian Williams, Anjuli Kannan, Petar Aleksic, David Rybach, and Tara N. Sainath, “Contextual speech recognition in end-to-end neural network systems using beam search,” in Proc. of Interspeech, 2018.
[3] Y. He, R. Prabhavalkar, K. Rao, W. Li, A. Bakhtin, and I. McGraw, “Streaming Small-footprint Keyword Spotting Using Sequence-to-Sequence Models,” in Proc. ASRU, 2017.
Click to follow and learn about Huawei Cloud's fresh technology for the first time~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。