Based on the speech recognition model of RNN and CTC, explore the solution to the context shift

Abstract: In the work described in this article, we show a speech recognition model based on RNN and CTC. In this model, WFST-based decoding can effectively integrate dictionary and language models.

This article is shared from the " How to Solve the Contextual Shift in HUAWEI Cloud Community?" The Road to End-to-End ASR in the Proprietary Field (3) ", original author: xiaoye0829.

In this article, we introduce a work combining CTC and WFST (weighted finite-state transducers): "EESEN: END-TO-END SPEECH RECOGNITION USING DEEP RNN MODELS AND WFST-BASED DECODING".

In this work, the acoustic model is modeled by using RNN to predict context-independent phonemes or characters, and then using CTC to align speech and label. One of the differences in this article is that it proposes a general decoding method based on WFST, which can be integrated into the dictionary and language model when CTC is decoded. In this method, CTC labels, dictionaries, and language models are encoded into a WFST, and then a comprehensive search graph is synthesized. This WFST-based approach can easily process blank tags in CTC and perform beam search.

In this blog post, we will no longer describe the content of RNN and CTC. Mainly focus on how to use WFST to decode the module. A WFST is a finite-state acceptor (FSA). Each transition state has an input symbol, an output symbol, and a weight.

The figure above is a schematic diagram of the language model WFST. The weight on the arc is the probability that when the previous word is given, the next word will be emitted. Node 0 is the start node, and node 4 is the end node. A path in WFST consists of a series of transmission sequences from input symbols to output symbols. Our decoding method represents CTC labels, lexicons, and language models as separate WFSTs, and then using highly optimized FST libraries, such as OpenFST, we can effectively merge these WFSTs into a single search graph. Below we begin to introduce how to start building a single WFST.

1. Grammar. a grammar WFST that encodes a sequence of words allowed by the language. The picture above is a simplified language model. It has two sequences: "how are you" and "how is it". The basic symbol unit of WFST is word, and the weight on the arc is the probability of the language model. Using this WFST form of representation, CTC decoding can in principle use any language model that can be converted into WFST. According to the representation in Kaldi, the WFST of this language model is represented as G.
2. Dictionary (lexicon). A dictionary WFST encodes the mapping from dictionary unit sequences to words. According to the RNN's corresponding label modeling unit, this dictionary has two corresponding situations. If the label is a phoneme, then this dictionary is the same standard dictionary as the traditional hybrid model. If label is character, then the dictionary simply contains the spelling of each word. The difference between the two cases is that the spelling dictionary can be easily expanded to include any OOV (out of vocabulary) word. On the contrary, expanding the phoneme dictionary is not so intuitive, it relies on some grapheme-to-phoneme methods or models, and is prone to errors. This dictionary WFST is represented as L. The following figure shows two examples of dictionaries constructing L:

The first example shows the construction of a phoneme dictionary. If the entry of the phoneme dictionary is "is IH Z", the following example shows the construction of a spelling dictionary, "is is". For spelling dictionaries, there is another complicated problem that needs to be dealt with. When using character as the CTC label, we usually insert an extra space between two words to model the word interval before the original transliteration. When decoding, we allow spaces to selectively appear at the beginning and end of a word. This situation can be easily handled by WFST.

In addition to English, we also show an entry in a Chinese dictionary here.

3. Token. The the frame-level CTC tag sequence to a single dictionary unit (phoneme or character). For a dictionary unit, token-level WFST is used to classify all possible frame-level tag sequences. Therefore, this WFST allows the appearance of blank tags ∅ and the duplication of any non-blank tags. For example, after inputting 5 frames, the RNN model may output 3 kinds of label sequences: "AAAAA", "∅∅AA∅", "∅AAA∅". Token wfst maps these three sequences to a dictionary unit: "A". The following figure shows a WFST for the phoneme "IH". This figure allows the occurrence of blank <blank> tags and the repetition of non-blank tags "IH". We represent the WFST of this token as T.
4. Search map. After compiling the three WFSTs separately, we combine them into a comprehensive search map. First, the dictionary WFST L and the grammar WFST G are synthesized. In this process, determinization and minimization are used. These two operations are used to compress the search space and speed up decoding. This synthesized WFST LG is then synthesized with the WFST of the token, and finally a search graph is generated. The overall sequence of FST operations is: S = T о min (det (LоG)). This search graph S encodes the process of mapping from a sequence of CTC tags corresponding to a speech frame to a sequence of words. Specifically, it first parses the words in the language model into phonemes to form an LG graph. Then RNN outputs the label (phoneme or blank) corresponding to each frame, and searches the LG map according to this label sequence.

When decoding the hybrid DNN model, we need to use the prior state to scale the posterior state from the DNN. This prior is usually estimated from the forced alignment in the training data. When decoding the model trained by CTC, we use a similar process. Specifically, we run the final RNN model on the entire training set. The labels with the largest posterior are selected as the frame-level alignment, and then using this alignment, we estimate the priori of the labels. However, this method does not perform well in our experiments, partly because the output of the model trained with CTC after the softmax layer shows a high peak distribution (that is, the CTC model tends to output a single non-empty label, Therefore, the entire distribution will have a lot of spikes), which shows that the label corresponding to most of the frames is the blank label, and the non-blank label only appears in a very narrow area, which makes the prior distribution estimate will be affected by the number of blank frames leading. Instead, we estimate a more robust label prior from the label sequence in the training set, that is, calculate the prior from the enhanced label sequence. Assuming the original label is: "IH Z", then the enhanced label may be "∅ IH ∅ Z ∅" and so on. By counting the number of labels distributed on each frame, we can get the prior information of the labels.

The method based on WFST was introduced above, let's look at the experimental part next. After regularization of the posterior distribution, the score of this acoustic model needs to be reduced, the scaling factor is between 0.5 and 0.9, and the best scaling value is determined through experiments. The experiment in this article was conducted on WSJ. The best model used in this article is a phoneme-based RNN model. On the eval92 test set, when using dictionaries and language models, this model reached a WER of 7.87%, and when only dictionaries were used, WER quickly increased to 26.92% . The following figure shows the comparison between the Eesen model in this article and the traditional hybrid model. From this table, we can see that the Eesen model is worse than the hybrid HMM/DNN model. But on larger data sets, such as Switchboard, the model trained by CTC can achieve better results than traditional models.

A significant advantage of Eesen is that compared to the hybrid HMM/DNN model, the decoding speed is greatly accelerated. This acceleration comes from a drastic reduction in the number of states. As can be seen from the decoding speed in the table below, Eesen has achieved a decoding speed acceleration of more than 3.2 times. Moreover, the TLG graph used in the Eesen model is also significantly smaller than the HCLG graph used in the HMM/DNN, which also saves disk space for storing the model.

In general, in the work described in this article, we show a speech recognition model based on RNN and CTC. In this model, WFST-based decoding can effectively integrate dictionary and language models.

Click to follow, and get to know the fresh technology of

Based on the speech recognition model of RNN and CTC, explore the solution to the context shift

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

入选AAAI 2025，浙江大学提出多对一回归模型M2OST，利用数字病理图像精准预测基因表达

LLM增强语义嵌入的模型算法综述

什么是模型上下文协议（MCP）？

2025免费云服务器盘点

因为懒得点鼠标，我给B站做了个语音助手

从按键到语音：家电设备交互的演进之旅