头图

foreword

"Voice processing" is a very important scene in the field of real-time interaction. In the " RTC Dev Meetup丨Technical Practice and Application of Voice Processing in the Field of Real-time Interaction " initiated by Shengwang, from Microsoft Research Asia, Shengwang, Digital The technical experts of the United States Technology shared related topics around this topic.

This article is based on the content shared by Li Tian, head of NLP technology of Shumei Technology, at the event. Follow the official account "Shengwang Developer " and reply to the keyword " DM0428 " to download PPT materials related to the event.

图片

01 The necessity of semi-supervised training in the field of ASR

Although the word accuracy rate of general ASR is very high, it still has the problem of scene mismatch when facing specific scenarios (game scene, private chat scene, group chat scene, anchor scene), because general ASR is used in these fields. The application is relatively difficult, mainly there are the following problems.

1. Scarcity of annotation resources

It is difficult to obtain the annotations of the corresponding scenarios, and usually it is impossible to quickly obtain a large number of annotation samples required by business scenarios. Even if the acquisition of samples is simple, it is still very difficult to obtain labeled samples because the cost of labeling is very high. When creating a project or determining product direction, you will find that ASR tasks involving domains need to be addressed with data first. In the past, when using phoneme and text to split, the amount of data required was relatively small. Now, end-to-end technology is often used, and the amount of data starting from 1,000 hours, whether it is self-labeling or with the help of a relatively well-known data company, is not yet available in the product. In the beginning, the cost is unacceptable.

2. Unstable labeling quality

In scenarios such as wake-up and Siri interaction, users know that the backend will transcribe, but in most business scenarios, people are unaware of ASR transcription.

For example, when communicating with Siri, if Siri does not understand the meaning expressed by the speaker, the person will make a second attempt to make the expression clearer. But at the real business level, in most cases, customers do not know that the backend is performing ASR transcription on them, such as live broadcast platforms. There may be requirements at the audit level. At this time, it is impossible to notify the anchor that the voice is being transcribed, and the wording needs to be clearer. The quality of annotations caused by unclear articulation and fragmentation of syntactic components is very unstable.

So how to solve these problems when marking? For Shumei business, because it covers a large number of similar social scenes in the entire Internet, it faces a variety of data and specific terms, etc., so it is very difficult to obtain such annotations, and it is difficult to guarantee the quality of annotations. , but the same source data can easily obtain the data of the scene, we believe that the semi-supervised scheme is an ideal choice.

If you have been exposed to NLP or CV, I believe you will have a clearer definition of semi-supervised. In the field of ASR, especially based on end-to-end, it is generally divided into two types: Self-training and Pre-training.

The Self-training system mainly revolves around the well-known Pseudo labeling. The core scheme is mainly based on consistency regularization logic. In theory, the Pseudo label is actually a kind of noise of the true label. When the model is trained, the Pseudo label and the true label are put together for training. This itself is the process of training anti-noise, which can make the model learn gradually. Pre-training is very simple. If you are from NLP, you will understand it better. Originally, it was more appropriate to train the corresponding field in the corresponding field. This kind of task generally revolves around the reconstruction of the meaning of the representation or the content, and does not require additional labels. These data can construct a pre-training training task of unlabeled/untranscribed text, and then use the manual transcription data of the corresponding scene. Conduct ASR task training.

01 The development of semi-supervised training in the field of ASR

1. Self-training

Generally speaking, Self-training starts with CV. Since the Pseudo label ICML first proposed the Pseudo label in 2013, various new systems have appeared, such as Learning with pseudo-ensembles in 2014 (the first system), which integrates the Pseudo label with the model Ensemble; 2016 In 2017, Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning believed that the generation logic of Pseudo label itself should also be different perturbations of the same model; in 2017, Mean teachers are better role models: Weight-averaged consistency targets focus on how to generate more High-quality labels, which use model averaging to obtain better teacher models, thereby ensuring the quality of pseudo-labels.

As early as two papers in 2014 and 2016, it has been mentioned that comparative learning in the hot field of CV, the formula argument in the paper is almost the same in many aspects, it can be said that the development of technology is historical 's reincarnation.

2. Pre-training

Pre-training is mainly concentrated in the field of NLP. Of course, there are also systems such as ladder network in the field of CV, including the concept of Pre-training. However, the field where Pre-training develops better is still NLP. The core problem is that the underlying features of NLP are characters, which are themselves a very discrete system and are difficult to compare with a dense data input such as CV.

From this system, NLP has experienced years of development, from the N-gram-based feature in 1994, to the NN-based system, and then to the rise of language models such as RNN and LSTM generated by the design of the internal framework of the NN system. , ELMO was born in 2017, and the transformer architecture appeared in 2018. Now, whether it is BERT or GPT, it has been fully verified in various downstream services in the NLP field.

3. Semi-supervised development in the field of ASR

Generally speaking, it will be divided into two sections according to the era of ASR itself:

①The era of phoneme/text splitting: In many cases, people still use kaidi as the underlying technical solution for ASR at the business level. The semi-supervised training logic of this solution is that the acoustic model can train a model to general phonemes, and then output the text required by the specific business through the downstream language model or rescore model, so as to achieve partial semi-supervised functions. From the process, it is more like a kind of transfer learning. But after Alex Graves completed CTC's doctoral dissertation in 2013, the end-to-end system began to emerge gradually. Two years later, the EESEN team re-transported CTC to the phoneme level, making the phoneme/text split system return briefly.

② The era of end-to-end: The rise of the LAS (listen attendance style) system and the rise of the CTC/LAS + LM hybrid system have enabled the end-to-end effect, data, model quality and inference speed to gradually surpass Kaldi or traditional phoneme/ Text splitting model architecture, the industry has gradually entered the era of end-to-end. Its time context is CTC, Deep speech, Listen, attend and spell, and Hybrid CTC/attention.

After 2017, with the release of the CTC/attention hybrid and ESPNET framework proposed by Watanabi, the end-to-end system has been initially improved and can be applied to various industries in the industry. It provides a set of joint decoding framework as flexible as Lattice: based on the design of hypotheses route, it gives a more flexible fusion scheme for subsequent shallow fusion. In fact, if you have used ESPnet, you can see that the entire hypotheses route design is very flexible, and various technical solutions can be introduced to jointly score or rescore routes.

Since the phoneme and other foundations are no longer used, and the training cost of CTC and Seq2Seq itself is very high, coupled with the difficulty of obtaining actual labeled data, the short board of the end-to-end system's dependence on data has gradually become the core bottleneck of its implementation. If you do ASR in a large factory in the early days, especially from 2015 to 2016, your actual experience is that you should consider end-to-end after 1,000 hours.

As a result, how to constrain end-to-end data requirements has become a later stage (from 2019 to 2020) to optimize end-to-end, and then solve the problem of end-to-end implementation, which is also a core consideration in academia and industry. Since then, ASR-based Pre-training and Self-training have gradually entered the stage of history. Previously, although related research had been conducted, the scope of influence was small. It was not until 2019 and 2020 that Facebook AI put forward two papers that could be industrially implemented in these two fields and had huge development prospects. Start paying attention.

wav2vec: Unsupervised pre-training for speech recognition is a technical solution based on Pre-training proposed by Facebook. Its principle is very close to that of word2vec, using negative sampling technique to train a task of predicting a future moment representation. Since its training results can be used as the characteristics of any audio downstream task, this system is a very important audio technology foundation currently used by many large factories in the industry.

Self-training for end to end speech recognition is a research by the Jacob team of Facebook AI, which aims to comprehensively analyze the practical application effect of the Pseudo label system on ASR. At that time, they gave strong baselines of the Pseudo label system on several core data sets in the English ASR field, and for the first time systematically expounded several core issues that need to be solved for the Pseudo label system to be implemented in the ASR field.

4. Pre-training VS Self-training in ASR

In 2020, as the number of customers gradually increases and the scene coverage becomes wider and wider, we are also faced with the need to build a separate ASR for some specific scenes to obtain better model effects than competing products. Simply using the phoneme/text architecture and replacing the language model to meet the needs of various fields can no longer achieve the desired effect. But at the same time, it is unacceptable from the data labeling to build its own end-to-end ASR for each scene. Therefore, we began to consider whether to choose Pre-training or Self-training.

Originally, we considered to choose a similar system from other big manufacturers, such as the wav2vec of Pre-training, but we tried the actual operation of wav2vec many times at that time, the cost was very high, and the downstream Post-pretraining training in the corresponding field plus Pre-training The training time itself is also very time-consuming, resulting in a longer model iteration cycle. Importantly, in the Pre-training+Post-pretraining stage, there is no ASR model output for the time being, which is unacceptable for scenarios where new business requires rapid iteration.

Based on the above contradictions, we ultimately tend to use the self-training technical solution in our business. Because the technical solution of Self-training can be evaluated every time a model is trained, it is used first and then optimized, which is a more friendly system for business.

5. The recent development track of Self-training in the ASR field

After anchoring the Self-training goal, we have been conducting research and follow-up in this field since 2020. We found that in this field, Facebook, Google, and Mitsubishi have done relatively well. Others, such as Nuance, an old ASR company, and some universities will also publish some improvement plans or problem researches for some specific problems. In 2020, their research directions are mainly as follows:

(1) 2020

Facebook:

SELF-TRAINING FOR END-TO-END SPEECH RECOGNITION,

END-TO-END ASR: FROM SUPERVISED TO SEMI-SUPERVISED LEARNING WITH MODERN ARCHITECTURES,

ITERATIVE PSEUDO-LABELING FOR SPEECH RECOGNITION

Its research context is the strong baseline and research of the naive Pseudo label on the CTC framework; the effect of the naive Pseudo label on the CTC/Attention hybrid architecture; the research on the multi-round iterative Pseudo label system.

Google:

Since Google's Iterative pseudo-labeling has a very strong technical background in the CV field, they gave their multi-round iterative Pseudo label+model ensemble solution: Noisy Student Training, and won the Librispeech100 + 860 that year. SOTA. Of course, there are actually many pits in Iterative training, especially the explosion in the number of data experiments brought about by multiple rounds of iterations. This is clearly stated in our scheme.

Mitsubishi:

In the Iterative mode, the process is to first perform multiple rounds of pseudo-labeling training on the teacher. Each time a pseudo-labeling is trained, the internal labeling must be marked again. Such multiple rounds will make training very cumbersome. So starting from 2021, we will gradually see the on-the-fly approach in various fields. For example, MPL (based on mean teacher evolution) proposed by Mitsubishi in 2012. But on-the-fly means that labels need to be generated in real time, and the label generation quality of ASR is directly related to the decoding calculation cost. Simple CTC's greedy search is faster, but the quality of the transcripts generated is poor; while the more common shallow fusion scheme only uses multiple models to fuse and score decode and transcribe text, which is basically impossible to generate in real time during training. So in general, the final effect of the on-the-fly mode is actually not as good as the Iterative mode.

other:

Saleforce came to a "renaissance" and re-used pseudo-label training on the Essen framework. Its label generation adopts CTC greedy search. As a veteran ASR technology manufacturer, Nuance interprets the theoretical essence of semi-supervised by expounding the FixMatch theory, which is actually Consistency Training.

(2) 2021

Mitsubishi:

Due to the flaws of the on-the-fly model, Mitsubishi released the advanced MPL in 2021 and returned to the Iterative model. They disassembled the teacher model and the subsequent on the flying training process, and switched to the Conformer framework, which is more robust to audio effects. In the end, it surpassed Google's NST program and became the current second place.

Facebook:

Facebook AI will use the cache mechanism in 2021 to synchronize another process to decode during the model training process. If the cache decode is full, the training will be divided into cache data and label data for joint training. After N steps, the catch will be emptied, and then restarted. decode. It can be seen that although Facebook AI says that it is in the on-the-fly mode, it is essentially the concept of rounds. It uses a 36-layer transformer and has obtained the SOTA of Librispeech100+860 so far, and can even train Librispeech960 directly on the same level as ESPnet.

03 Problems solved by our semi-supervised solution

1. Iterative or on-the-fly

In the context of the effect requirements and the current conclusions of the academia and industry, our technical direction is ultimately anchored in the Iterative model.

2. Iterative problem

However, iterative mode is very cumbersome to train, because the generation of pseudo-label data needs to be regenerated after each round of training, and to achieve good results, according to the experience of Google and Facebook, multiple rounds of iterations are required.

Then there are three problems in each round of iteration. First, how to generate high-quality data on pseudo-labels? This is actually the simplest problem in essence. We have all kinds of decoding algorithms, whichever algorithm is better to use. Second, how to filter out high-quality pseudo-label data? Because we don't know which label is right, no matter how high the quality is, there will be some problems. At this time, we need to study how to reduce the proportion of problems and what solutions can be reduced. Third, the biggest problem in the entire Iterative mode is how to balance the data between labeled data and unlabeled data.

Google's NST system requires five rounds of iterations, which means that the ratio of labeled and unlabeled is different in each round. The second round is about 2:7, and the third round is 1:3. In librispeech 100+860, this label: unlabeled maintenance is verified to be a reasonable ratio around 1:3. But in different quest lines, the ratios are also different. Facebook's experimental results on the Librispeech+LibriVox dataset show that the ratio needs to be above 1:10. This leads to a huge experimental cost when it is finally implemented in the business. For example, there are five rounds of experiments, and each round of training requires multiple data experiments with different ratios. After the training is completed, the model is selected for decoding evaluation, and then multiple data experiments with different ratios are carried out in the next round. This is repeated for five rounds. Due to the high cost of ASR training, the training rhythm of each round is very painful.

In addition, at the limited annotation level, how to cold start the model? Generally speaking, the initial training data is labeled, and the training data is very small. For example, the initial tag data in Iterative is generally very small, only accounting for about 1/10 of the data that can be obtained, so how to perform a cold start has become a core issue.

04 Improved NLPL solution

Based on these problems, we propose our own solution, published in Improved noisy Iterative Pseudo-Labeling for Semi-superivised Speech Recogntion. Now let me briefly explain what our solution looks like in advance.

1. Model framework

After 2020, we no longer use the Kaldi system, but switch to an ESPnet-like self-research framework. On the model framework, for the front-end sharedEncoder of CTC and the decoder of LAS, we use transformers. The left side of Figure 1 shows Watanabi's picture in the CTC/Attention hybrid paper, and the right side is the introduction to the model framework. The model In terms of parameters, SharedEncoder has a subLayer before, which uses a 2-layer (3 3+512) CNN with a stepping of 2, which may be slightly different from the framework in ESPnet, but basically the same. ransformer We currently use 12 8 transformers, 512 dimensions, and FFN is 2048, which is almost the same as most of the former base models. In addition, the AttentionDecoder we use is a 6-layer transformer, and its parameter configuration is the same as that of the Encoder. In terms of language models, LT people! In the inserted 4, we added a 6-layer transformer language model, and the rest of the parameter configuration is the same as BERT, 12 heads, 768 dims, and FFN is 3072. This is the overall model framework.

After 2020, we no longer use the Kaldi system, but switch to an ESPnet-like self-research framework. On the model framework, we use transformers for the front-end sharedEncoder of CTC and the decoder of LAS. The left side of Figure 1 shows the diagram in the CTC/Attention hybrid paper by Watanabi, and the right side is an introduction to our model framework. In terms of model parameters, the sublayer of SharedEncoder currently uses a 2-layer (3*3+512) CNN with a step of 2. Transformer currently uses 12 layers and 8 heads, 512 dimensions, and FFN is 2048, which is similar to most of the Transformer-based acoustic models are almost the same. In addition, the AttentionDecoder we use is a 6-layer transformer, and its parameter configuration is the same as that of the Encoder.

For the language model, we added a 6-layer transformer language model, and the rest of the parameter configuration is the same as BERT, 12 heads, 768 dims, and FFN is 3072.

图片

■Figure 1

2. Other general settings

Our experimental data uses Librispeech 100+860, 100 as labeled data and 860 as unlabeled data. The LM data is Librispeech's own training data, as well as the official 800W text corpus. Our vocal features are 100-dimensional Fbank + 3-dimensional pitch. In order to reduce the number of text labels, we use BPE to compress the number of words to 7002 pieces to reduce the final output and speed up the training of CTC.

The training configuration involves the learning rate. The learning rate is similar to the transformer, but there is a difference. When the decay reaches the final position, we will advance 5000step decay to the final stable value, and then slowly maintain it for a period of time. This is directly related to the technology of maintaining the stability of the model later, so that it can be trained stably for a period of time within that period, so that the model can keep up on average.

3. How to generate pseudo labels on unlabeled data

At present, the most common methods in the industry to generate decode algorithms with relatively high quality are shadow fusion and deep fusion systems. We used shadow fusion, and searched the acoustic model CTC, LAS and LM, and the bean size was 50. The general process is similar to ESPNET, but we have two small changes:

The first is that we use the CTC greedy search method to judge the end of the sentence, and ESPNET does not do this, it has its own end detact algorithm.

The second is that we will not prune too many paths, but keep as many paths as possible.

4. How to screen high-quality pseudo-label data for the next round of semi-supervised training

When generating pseudo-labels, the quality of a lot of data is actually unflattering, especially in the early training, such as the first or second round of NST or Iterative Labeling. At this time, the model is in the WER on librispeech dev and test. Probably closer to 9 or more than 10 points.

In response to this situation, Google and Facebook take the method of rough sorting and taking percentiles, similar to the hypothesis score in ESPNET, and then add the probabilities during the decoding process, sort the probabilities from small to large, and then take 90 of them %. There may be a cliff-like situation in the confidence rate. For example, the probability distribution of the previous 85% of the data is very similar, and then at the position of 85% to 95%, the probability suddenly appears very large, and the probability falls to the probability of a change of more than a few points. . In order to deal with the above problems, we use the distribution test method for sample extraction: we first assume that it obeys the Gaussian distribution, and then only retain the 90% or 95% bilateral confidence interval of the Gaussian distribution for training. The bilateral confidence interval 90%/95% here does not mean that the data retains 90% and 95%, but retains the data with the confidence interval in this in the case of Gaussian distribution, so it is likely to be less than the direct retention of 90 %data.

5. How to balance the ratio of labeled/unlabeled data so that the model will not be overfitted to labeled data with unlabeled data

How to balance the ratio of labeled/unlabeled data is the biggest problem when conducting multiple rounds of iterative semi-supervised training. All previous studies have not given how to perform proportional screening, but only given the approximate ratio of the corresponding tasks. Facebook What they are doing is Librispeed 960+LibriVOX, its ratio is between 1:10~1:54. Google is Librispeech 100 +800, the ratio is about 1:3.

None of the above opinions can guide the actual production ratio that can be used in practice. For example, the ASR of the live broadcast scene starts at 100 hours, and at the same time, it may be easy to obtain a lot of homologous unlabeled data. But in what proportion should these unlabeled data and labeled data be put together, so that the model will not be trained on all unlabeled data; how to train the model to ensure its stability and better effect, which will require endless data experiment. Of course, if there are enough machine resources in the company, it is indeed possible to do these experiments, but in many cases, everyone does not have as many machines as Google and Facebook, and they can directly brute force and exhaust them.

So how can we get guidance for each business line at this time? We carried out detailed experiments and qualitative and quantitative analysis on Librispeech 100/860, and obtained a guideline. This guideline is a very accurate guideline in our opinion. It can teach you how to choose data balance selection. Here we first make an assumption, which is directly related to why we do semi-supervised training with pseudo-labels. We think that when training pseudo-labels, because labeled data and unlabeled data are mixed together, for some pseudo-labeled data, we do not know whether the labeling is correct, and the model should be trained as much as possible on certain characteristics. "Be conservative", don't overfit to the wrong data or tail data. But it also guarantees a certain sample diversity, because if it is completely conservative, the model training will fall into the optimal brought by the data level it thinks, and then fall into the local optimal solution in place. Multiple rounds of iterative training can exacerbate this process, causing the model to overfit as it trains.

In order to confirm where to be conservative and where to ensure diversity, we divided the data into three portrait dimensions, the first portrait dimension is the audio length, the second portrait dimension is the text/pieces length, and the third dimension is the label itself. Distribution. The question can then be transformed into which dimensions should we ensure that training is as conservative as possible, and which dimensions should ensure the diversity of samples as much as possible. Based on this, we conducted a large-scale experiment. After each round of generating new pseudo-labels, we will construct candidates of multiple training samples according to different proportions, that is, the candidate set. Each batch of training in this candidate is data. Before each round of training, we compare each catenary cadidate with our last training data in the above three dimensions, and rank all candidates. For example, the candidate of 1:2 is ranked in three dimensions with the upstream, the candidate of 1:4 will also have a ranking, and the candidate of 1:5 and 1:6 will also have a ranking, and so on.

In evaluating the ranking scheme, we used the KS test because frame length and pieces length are single-dimensional statistics. But the label distribution itself is multi-dimensional, so we first normalize TF, and then use Euclidean distance to evaluate the distribution difference between the current round of data and the previous round of data, and then rank each candidate.

After a lot of experiments, I found a very clear rule, that is, under the premise that the smaller the difference between the pieces distribution itself, the larger the difference in the distribution of frame length and the distribution of pieces length will generally bring about a better new round of model effects. . The above logic can be described as a general paradigm, as shown in Figure 2.

图片

■Figure 2

6. How to ensure that the model does not overfit to the trick on the wrong pseudo-label during model training

This is a key point we have found throughout this system. Here we have two dimensions. The first dimension is the dimension of the data level. We have added specAug and specAug++ to make the whole data have better generalization. At the same time, at the model level, similar to MPL, we will generate online and offline generation, select the online result in the early stage, and select the offline result in the later stage. Generally speaking, the offline result will be stable higher than the online result after the fifth round. In addition, we will also carry out dropout improvement, which will gradually increase from 0.1 to 0.3 for dropout, because the pseudo-label training will have a great risk of overfitting, but basically there will be no new benefits after the increase to 0.4.

7. With limited labeled samples, how to perform cold-start supervision training of the model can achieve the best results

We also used a two-stage training. The first-stage training from dropout0.1 30epoch to the second-order dropout0.13 100epoch is the best. The specific experimental results are shown in Figure 3. This also shows a problem, that is, during cold start, a relatively small epoch and a relatively small dropout should be used to quickly fit the target, and then the dropout should be increased to make it use a relatively generalized training configuration, and then train more rounds to make the model optimal. This cold start method is basically the same as the cold start result of Google's NST system model.

图片

■Figure 3

Finally, the final effect of the entire improved NIPL is introduced. At present, as of the deadline for our submission of interspeech 2022, there are currently two companies that are stronger than us in Librispeech 100+860. The first is the conformer of Mitsubishi MPL, which is 3.8%/8.2%. But if the control variable is the same transformer, Mitsubishi is only 4.8%/10.1%, while ours is 3.93%/9.59%. The other is Facebook's simIPL. Its 36-layer transformer can achieve 3.8%/7.5%, and it does not require any language model. If you add language model and rescore, it can achieve 2.7%/5.2%. This effect is already beyond our comprehension. Because we have trained 960 data, ESPnet librispeech 960 supervised training is 96.96 should be 3.04%, which means that Facebook does not need 860 data, only 100 labels can achieve 2.7%/5.2%.

Finally, the final effect of the entire improved NIPL is introduced. At present, as of the deadline for our submission of interspeech 2022, there are currently two companies that are stronger than us in Librispeech 100+860. The first is the conformer of Mitsubishi MPL, which is 3.8%/8.2%. But if the control variable is the same transformer, Mitsubishi is only 4.8%/10.1%, while ours is 3.93%/9.59%. The other is Facebook's simIPL. Its 36-layer transformer can achieve 3.8%/7.5%, and it does not require any language model. If you add language model and rescore, it can achieve 2.7%/5.2%. This effect is already beyond our comprehension. Because we have trained 960 data, ESPnet librispeech 960 supervised training is 96.96 should be 3.04%, which means that Facebook does not need 860 data, only 100 labels can achieve 2.7%/5.2%.

05 Q&A session

1. How does it compare to WER?

Our test clean was 3.93 and test other was 9.59, but we continued with the seventh and eighth rounds of NIPL training, and the test other could be even lower. While test clean remains at 3.93, test other has dropped to about 9.3 as of today. Mitsubishi's converter is 3.8%/8.2%, lower than our 3.93, but their transformer is 4.8%/10.1%. Facebook's simIPL is 3.8%/7.5%. We don't believe much about Facebook's simIPL. The effect is a bit scary. From this point of view, we should be the third in the world, which is a little better than the NST published by Google in 2020.

2. Introduce the use of CTC

When CTC first appeared, due to the high difficulty of its training optimization and the more stringent requirements for the amount of data, the use of CTC at that time was a bit tricky. Such as ESSEN mentioned above, use CTC for training phonemes, and then still pick up WFST like everyone else. Since the number of phonemes is much smaller than that of words, the training difficulty of CTC is greatly reduced, making it comparable to MMI, LFMMI and other schemes in some fields. The cost of directly uploading CTC end-to-end ASR data will be very high.

If you are asking this question in 2020, you will be recommended to try the ESSEN program in your new business. But it's 2022, and a lot has changed in the industrial use of CTCs. Watanabi's paper tells you that the CTC and LAS hybrid system can have very good results, and the data quality will not be as high as the original CTC, because the LAS system has many optimization techniques that can be used to help training. Therefore, CTC LAS is a relatively standard usage program at present. If you don't have your own ASR training platform, I suggest you try ESPnet/Wenet. If streaming recognition is the core business requirement, Wenet can be the first choice.

Upcoming Events

"RTC Dev Meetup - Hangzhou Station", we will focus on big front-end technology, and invite technical experts from Shengwang, Ant Group and Hikvision to share with us the business structure and cross-end practice in the real-time interaction field in the big front-end era.

It's better to take action, scan the QR code or click here to register!


RTE开发者社区
658 声望971 粉丝

RTE 开发者社区是聚焦实时互动领域的中立开发者社区。不止于纯粹的技术交流,我们相信开发者具备更加丰盈的个体价值。行业发展变革、开发者职涯发展、技术创业创新资源,我们将陪跑开发者,共享、共建、共成长。