Ma Zhiqiang: Research progress and application landing sharing of speech recognition technology丨RTC Dev Meetup

The content of this article comes from the speech sharing of " RTC Dev Meetup丨Technical Practice and Application of Speech Processing in the Field of Real-time Interaction ", and the lecturer is Ma Zhiqiang, head of speech recognition research at Huanyu Technology.

01 Status Quo of Speech Recognition Technology

1. Voice has become the key entry point for human-computer interaction in the era of the Internet of Everything, and the market space for voice recognition has steadily increased

In recent years, voice recognition technology has gradually entered our life and work, especially the voice interaction function represented by AI voice assistant has also been implemented and applied to various consumer products, such as smart phones, Smart cars, smart home appliances and smart homes, etc. Users only need to wake up the voice assistant and provide corresponding instructions, so that it can help us complete common functions such as making calls, checking the weather, and navigating. According to the research report of relevant consulting agencies, the development scale of China's intelligent voice vertical industry will reach the level of 100 billion by 2025. It can be seen from the development space that voice interaction and voice recognition have gradually become the key entry points for human-computer interaction in the era of the Internet of Everything.

2. The construction of the “Belt and Road” requires language interoperability, and the demand for multilingual identification is increasingly strong

The market space of speech recognition is steadily improving. Under the social background of the country's construction of the "Belt and Road", the "Five Links" proposed by it also requires language interoperability to provide support. At present, the "Belt and Road" has covered more than 100 countries and regions, involving dozens of official languages. In this context, the country has an increasingly strong demand for multilingual technology, and multilingual speech recognition technology is also one of the most important and basic technical capabilities. Since 2020, we have planned and built multilingual technical capabilities, such as the general technical capabilities of multilingual recognition, multilingual translation, and multilingual synthesis that will be introduced today.

3. AI subtitles for real-time audio and video services effectively improve user experience and communication efficiency

In the past two years, due to the impact of the epidemic, people's work and lifestyle have also undergone major changes. For example, the more popular online education, online live broadcast, online conference and other forms have gradually been accepted by everyone. The corresponding AI subtitles Technology has also been successfully applied to these scenarios. For example, AI subtitles can display recognition results and translation results to users in real time in the form of subtitles through speech recognition technology and speech translation technology. This form can help users better understand the content of the live broadcast or video, and facilitate users to record meeting minutes. Overall, AI subtitle technology provides a systematic solution for voice transcription and translation, which greatly improves user experience and communication efficiency.

4. Technical challenges faced by voice assistant business scenarios

Voice assistant and AI subtitles are two typical speech recognition application scenarios. At present, deep learning technology is constantly developing and improving. In many scenarios, the effect of speech recognition has actually reached a usable level, but in the two just mentioned In a typical scenario, there are still major technical challenges. For example, there is a problem in the recognition of high-noise scenes for voice assistant scenarios, especially in the far-field environment, speech is very susceptible to the interference of reverberation and noise, resulting in a large number of multiple people speaking and overlapping voices and other difficult problems, resulting in Sharp drop in far-field speech recognition.

The second problem is the problem of massive entity recognition. For example, in the voice assistant scenario, it may be necessary to use it to perform some navigation commands, which requires the ASR model to have the ability to recognize at least tens of millions of entities across the country. It is conceivable that there will definitely be a large number of entities with homophones and different words, so that the model recognizes It is also very easy to crosstalk each other. At the same time, the distribution of entities of this magnitude is actually very unbalanced, especially for the entities in the tail, it is very sparse, and it will be very difficult to model the ASR model at this time.

The third problem is multilingual speech recognition. Students who have performed ASR or related tasks online may know better. For some languages with a wide range of applications, such as Chinese, English, etc., the training data is relatively rich; but For small languages, such as Tamil or Urdu, training data is extremely scarce, and may only be on the order of hundreds or tens of hours. In this case, the ASR models trained in these languages , the recognition effect is generally very poor.

Similarly, the AI subtitle business scenario also faces some technical challenges. The first challenge is complex transcription recognition scenarios. For example, in the scenario of audio and video subtitle transcription, audio and video usually contain various noises and background sounds, which will affect the effect of speech transcription. in addition. Online live broadcasts or entertainment live broadcasts will also be accompanied by some special effects sounds and music, which pose a huge challenge to voice transcription.

The second challenge for AI subtitle scenarios is its own high real-time requirements. In general, users require subtitles to be synchronized with the audio and video they are watching as much as possible. Usually, the delay is controlled at 1 to 2 seconds. User experience and user perception are generally very good. But this actually raises the requirements for ASR models, especially AI transcription models.

The third challenge for AI subtitles is the on-screen effect experience of AI subtitles, which mainly includes two parts: first, the transcribed results are delivered to the user in the form of subtitles, usually with additional punctuation, such as periods or commas. For sentence segmentation, the user's understanding of the subtitle is relatively good; but if there is no punctuation, the subtitle content looks very difficult to the user. The second is the erasure rate of subtitles. Taking Figure 1 as an example, there are three sentences at this time, which are the results of three times on the screen. For example, "today" actually changed twice in the process of three screens, first from "today" to "surprised", and then from "surprised" back to "today". During this process, the subtitles actually changed. After jumping twice, the jumping process is very frequent. For users, the look and feel and understanding may not be particularly friendly. This is also a problem that AI subtitle technology needs to solve.

■Figure 1

02 Research progress of speech recognition technology

The first part mainly introduces the current status of speech recognition technology, and then focuses on the research progress of speech recognition technology. First, three key technologies for speech recognition tasks are introduced. These three key technologies can also be considered as the two typical scenarios just mentioned - the basic and common technologies of voice assistant and AI subtitles.

1. Key Technologies

(1) Engineering construction of speech recognition data resources

Students who have performed ASR tasks or deep learning-related tasks may know that training data is very critical to the model. In general, we can obtain massive amounts of unsupervised data from the live network (production environment), such as text data, voice data or video data. For these unsupervised data, there are currently two main processing flows: The first processing flow is to directly label voice or video with the help of the existing ASR model to generate some weakly supervised labeling data. The second process is to use machine-assisted methods. First, pre-labeling is performed. After the pre-labeling is completed, some linguistic experts will perform manual correction and inspection based on the pre-labeling results. In this way, a supervised accurate Annotated parallel data. Relying on the established data resource annotation platform, it is currently able to support and provide large-scale ASR training data construction capabilities.

(2) Unsupervised/Weakly Supervised Training Data Augmentation Framework

We propose a semi-supervised speech recognition framework based on speech synthesis and self-training at the data level. As shown in Figure 2, for unsupervised speech data, a large amount of pseudo-label data can be obtained through ASR model; for unsupervised text data, synthetic data can be obtained through TTS model. The pseudo-label data, synthetic data, and supervised real data are then mixed together to jointly train the ASR model.

■Figure 2

The right side of Figure 2 shows a reflow process in which we can iterate, update, and train the ASR model repeatedly. Through multiple rounds of iterations, the ASR model can train a large amount of unsupervised speech and unsupervised text into the model. With multiple rounds of training, the effect of the ASR model is actually gradually improving. We used this method to conduct experiments, and through verification, we can find that in 100 hours of supervised data, plus a large number of unlabeled speech and text data, the final ASR model trained can reach thousands of hours of supervised training. Effect.

(3) Multilingual end-to-end unified modeling framework

Everyone should know that the more popular end-to-end model actually includes two parts: encoder and decoder. For the encoder part, the model can directly input multilingual audio into the unified acoustic model encoder, and then learn the unified acoustic representation between the languages. For the decoder part, the language-related text generation and decoding network is generally selected. The decoding networks of each language or each language family are independent of each other. The advantage of this is that the characteristics or differences of each language text can be preserved as much as possible. For example, Chinese, English and Russian texts are quite different, but at the acoustic level, their acoustic features can be shared.

At present, the end-to-end model framework is mainly used. For ASR end-to-end, it abandons some traditional solutions, such as translation dictionaries that require linguistic knowledge resources, which can reduce our knowledge of linguistic experts and large-scale data annotation. rely.

2. Advances in Voice Assistant Technology

The two typical scenarios just mentioned will be described in detail next.

(1) ASR acoustic model structure

There are many cases of far-field recognition in the voice assistant scene. For large TV screens and speakers, it is necessary to wake up commands in the far-field. In this scenario, we design a unique acoustic model structure, as shown in Figure 3. The acoustic model shown in Figure 3 uses an attention mechanism with a reinforcement layer and a filter layer to suppress far-field noise and the interference of other people's speech. Design and composition, in which the filtering layer mainly passes the convolutional network to reduce the temporal resolution to remove disturbances in the acoustic features. The reinforcement layer retains the important acoustic information by performing Self-Attention between the output features.

■Figure 3

It is also found in experiments that the currently designed acoustic model structure has good robustness in complex scenes, especially in noisy environments, and has relatively strong modeling capabilities.

(2) Massive entity system solutions

For massive entities, we also provide a complete set of systematic solutions end-to-end. Taking the voice assistant navigation scenario as an example, when using the voice assistant, you may need to navigate to various places in the city, but for our country, this is at least tens of millions of POI entities. The approach we take is to first model fine-grained along the dimensions of the city, and then build a separate language model decoding network for each city.

When the user is actually using it, the voice assistant system can dynamically load the modeled city Patch package just mentioned according to the user's location information and the user's current intention information. By loading these Patch packages, on the one hand, the recognition effect of the user's city can be improved. On the other hand, because each city adopts an independent modeling method, it can reduce the crosstalk problem of unified entities between different cities and further improve the recognition results of massive entities. .

For this problem, we have made some entity optimization schemes not only for Chinese, but also for other multi-languages. As shown in Figure 4, the optimization method shown on the left is mainly used. It can be seen that for most languages, the recognition of entities is accurate. The rate can basically reach more than 85%. Such an accuracy rate has basically reached a relatively usable level for users.

■Figure 4

(3) End-to-end unified modeling solution for multilingual speech recognition

For the multilingual recognition problem mentioned above, we also propose a unified modeling method based on language family grouping based on the unified multilingual modeling. This scheme takes into account the linguistic commonalities between different languages and clusters the languages we develop based on these linguistic properties. At present, we have divided all languages into four categories. Each language family corresponds to some languages that are linguistically similar. Each language family corresponds to the ASR model and also includes a corresponding independent decoding network.

For example, the Latin language family may have a corresponding Latin language decoder, and the Arabic language family also has a corresponding Latin language independent decoder, so that the shared information of linguistics between the various languages can be utilized to the greatest extent. From the experimental results, it can be seen that there are many languages in the Latin family, so its training data is correspondingly more. We also use this unified modeling method of language grouping for English, French, German, Spanish, etc., compared to Modeling alone can provide a relative improvement of more than 10% on average. For the Arabic language family, there are fewer languages and less training data, but the baseline improvement is larger, about 20%. It can be seen that this solution of multilingual end-to-end unified modeling based on language grouping can greatly improve the low-resource languages.

3. AI subtitle technology progress

Next, the overall research progress of AI subtitle technology is introduced. AI subtitles have relatively high requirements for data, scenes and real-time performance. Therefore, the AI subtitle technology is first studied at the data level, as follows.

(1) Weakly supervised data generation technology

For a large number of unsupervised subtitle video data, the corresponding audio and video and the subtitle information of the video can be extracted from the video data, and the corresponding speech recognition technology and OCR recognition technology can be used to obtain two different dimensions of the same audio respectively. The recognition result: one can be considered as the recognition result of speech, and the other can be considered as OCR, that is, the recognition result of the video with its own subtitles. Next, the two recognition results are aligned and fused with each other through two methods of pronunciation correction and glyph correction, and finally a text label is obtained, which is close to the result of manual annotation.

In this way, the unsupervised subtitle video data on the current network can be used to obtain a large amount of AI subtitle training data in a short time. If manual annotation is used, the time cost is very large, and the labor cost is also very expensive. Therefore, for the field of AI subtitles, weakly supervised data generation technology is a very critical link. With this technique, a large amount of weakly supervised data can be generated that can be used for AI captioning training.

(2) Low-latency end-to-end transcription recognition technology

The application scope of AI subtitle scenes is more complex than that of voice assistants. For example, online entertainment live broadcasts, conference scenes, and real-time transcription of film and television dramas contain a lot of noise, music, and human voices. Therefore, in the speech process, we have added a sound event detection function to the VAD module. This function first detects the sound events of the audio input by the ASR model, and detects that the audio contains common noise, music, applause and other sound events. These sound events are then presented to the user through the screen in the form of labels and subtitles, thereby greatly reducing the problem of false triggering of the recognition effects of general transcription models such as noise, music, and applause.

AI captioning scenarios are very complex, so we supplement data augmentation strategies at the data level. First, the training data is noised at the data level, reverberation and background music are added, and even sampling rate conversion and speech rate conversion are included. In this way, the richness of AI subtitle model training data can be improved as much as possible, so that the model or data can be adapted to real user scenarios. In addition, in response to the high real-time requirements of AI subtitles, we also proposed an acoustic model structure with dynamic delay, and adopted multi-task learning, that is, multi-task learning. In this way, the trained model can dynamically adapt to various latency requirements. This multi-task learning method can also improve the effect of the ASR model. For example, in a real application, the model only needs to be set to 200 milliseconds, then its hard delay or overall delay is about 200 milliseconds. In this way, the real-time requirements of various scenarios can be adapted.

(3) Post-processing optimization technology for transcription recognition

The user's subjective experience is very important. After the subtitle recognition is completed, how to make the screen more smooth and friendly is very important. This applies the post-processing optimization technology of transcription recognition.

First of all, AI subtitles transcribed text mainly adopts a streaming punctuation model, marking the results of speech recognition in real time. When a punctuation is detected, the recognition result before the punctuation is directly extracted as a sentence or a paragraph. If no punctuation is detected, the maximum probability position predicted by the punctuation is extracted as a complete semantic segment, and the method of punctuation prediction and extraction of semantic segments, The subtitle text is sentenced and segmented, which is convenient for users to understand and can also improve the intelligibility of the subtitle content. The recognition result can then be sent to the back-end translation model, and the translation model can then perform translation work based on the recognition result.

Next, considering the erasure rate of the upper screen subtitles, we propose a constrained decoding algorithm according to the actual application scenario. As shown in Figure 5, the first recognition result is "today". After the result is passed to the user through subtitles, the word "today" is fixed, and the word "weather" continues to be recognized, so that the word "today" is The two words will not change in the subsequent recognition process, that is, the red part will remain unchanged on the screen. By adopting this constraint decoding method, the number of changes of the historical identification result can be reduced. The subsequent recognition results can also continue to be decoded according to the previously decoded results, which is what we call constrained decoding, which can reduce the erasure rate of the entire subtitle text and improve the effect of speech transcription. End users will be more friendly in terms of look and feel and understanding of AI subtitles, thereby improving the subjective experience.

■Figure 5

03 The application of speech recognition technology is implemented

Next, I will share the overall implementation of speech recognition technology, mainly involving our products and specific cases of common applications in speech recognition technology.

At present, we have the speech recognition capability of 70 languages in the multilingual direction. In other aspects, such as speech synthesis and machine translation, we also have dozens to hundreds of capabilities. In addition, we participated in the open ASR multilingual speech recognition competition last year, and achieved first place in all 15 languages restricted track and 7 languages unrestricted track. It also demonstrates the multilingual capabilities we currently have. These capabilities have been opened to our AI open platform, and are available to developers through the calling interfaces and services provided on the AI open platform.

In the application of voice assistant scenarios, we have also conducted in-depth cooperation with domestic mobile phone manufacturers. At present, we have voice assistant capabilities in 12 languages, including domestic and foreign products. In terms of specific recognition scenarios of voice assistants, our voice recognition accuracy can basically reach more than 90%, supporting nearly 30 vertical voice assistant skills such as search, music, encyclopedia, weather, and navigation. When it comes to products, we currently mainly include mobile phones, large screens, speakers, watches, and some specific products for smart homes.

In recent years, our voice assistant capabilities have been widely used and implemented. At present, dialect capabilities other than Mandarin are also available in China. In the field of smart cars, the voice assistant function is also a feature that reflects the intelligence and differentiation of the car. We are currently able to provide a complete set of "cloud + terminal" system solutions, including terminal-side voice assistant recognition capabilities. Similarly, we have also cooperated with some domestic manufacturers. At present, the systematic solution of "cloud + terminal" has also been implemented in some domestic models.

In terms of AI subtitles, we currently support at least four languages, namely Chinese, English, Japanese and Korean. The real-time transcription capabilities of these languages have also been applied to some mobile phone products, and the audio and video being played can pass the AI subtitle function. Transcribed in real time, with translation included. On the whole, AI subtitles have also made great progress in the past two years. For example, they have been widely used in scenes such as film and television audio and video, online live broadcast, online education and even online conferences.

question Time

1. What is the acoustic model modeling unit in the AI subtitle scene and how to deal with the contradiction between the accuracy and low latency of streaming recognition?

First of all, the modeling acoustic modeling units in the AI subtitle scene we are currently using mainly include one or two. The first is an end-to-end modeling scheme, and the ASR model is based on end-to-end. In this case, for Chinese, the modeling unit is generally Chinese characters, and for English, it is English words. For how to deal with the contradiction between the accuracy and low latency of stream recognition, the accuracy and latency can be considered as a constant balance process. If you want to obtain a higher accuracy, the latency should not be too high; if the latency low, the accuracy will be poor.

Based on this problem, we propose a dynamic delay acoustic model training method, which can achieve dynamic delay through the mask mechanism, for example, a delay of 200 milliseconds or 600 milliseconds can be set. The multitask training method can also be used to support acoustic models with different delays in the model at the same time, so that the accuracy and delay effect of ASR can be jointly improved from two aspects. For example, some scenarios have very high latency requirements, so a delay of 200 milliseconds is used; in some scenarios, a delay of 600 milliseconds or more can be used to improve the accuracy.

2. Why does the smart speaker suddenly respond? Are they listening to surrounding sounds in real time?

Smart speakers are an end-to-end solution, which not only involves voice recognition technology, but also front-end wake-up, which is voice wake-up technology. When we use smart speakers, we may suddenly say hello. This may be the front-end wake-up module, that is, the voice wake-up has a false trigger, which will recognize human voice or noise as wake-up words, such as Siri, etc. If some words are mistakenly triggered as wake-up words, then the voice assistant will respond, so the overall main reason is still the case of false wake-up.

3. What is the difference between the codec introduced here and the codec in the usual sense, such as opus?

opus is a sound encoding format that encodes the audio itself. The encoder and decoder of the model we mentioned is a deep learning network structure. The end-to-end model consists of an encoder and a decoder. The encoder is what we often call the encoder, which performs feature extraction and acoustic modeling related work on the input audio. The decoder is also commonly referred to as the decoder, which is a feature representation of the vector encoded by the encoder, and then decoded into the corresponding text and or other recognition results. The codec we introduced today is equivalent to an end-to-end structure. In this end-to-end structure, the input is our common audio, and the output is the corresponding recognition result, that is, text. This is their essential difference.

About Shengwang Cloud Market

By integrating the capabilities of technical partners, Shengwang Cloud Market provides developers with a one-stop development experience, one-stop solutions for the selection, price comparison, integration, account opening and purchase of real-time interactive modules, helping developers quickly add various RTEs features to quickly bring applications to market and save 95% of the time to integrate RTE functions.

The Xunfei Voice Real-time Transcription (Chinese/English) plug-in is currently available on the Shengwang cloud market, supports Chinese and English real-time transcription, returns a text stream with accurate timestamps, can be used to generate subtitles, and is suitable for all kinds of live broadcasts and voice social networking , video conference and other scenarios. You can click this link to experience it now.

Ma Zhiqiang: Research progress and application landing sharing of speech recognition technology丨RTC Dev Meetup

01 Status Quo of Speech Recognition Technology

1. Voice has become the key entry point for human-computer interaction in the era of the Internet of Everything, and the market space for voice recognition has steadily increased

2. The construction of the “Belt and Road” requires language interoperability, and the demand for multilingual identification is increasingly strong

3. AI subtitles for real-time audio and video services effectively improve user experience and communication efficiency

4. Technical challenges faced by voice assistant business scenarios

02 Research progress of speech recognition technology

1. Key Technologies

2. Advances in Voice Assistant Technology

3. AI subtitle technology progress

03 The application of speech recognition technology is implemented

question Time

RTE开发者社区

引用和评论

ElatoAI：开源 ESP32 AI 语音 AI 玩具方案；凯叔推出 AI 故事玩偶「鸡飞飞」丨日报

从 DeepSeek 看25年前端的一个小趋势

Open WebUI：开源AI交互平台的全面解析

大模型中的Token究竟是什么？从原理到作用深度解析

一文掌握 MCP 上下文协议：从理论到实践

MySQL × 向量数据库：大模型时代的黄金组合实战指南

Mac 安装 DeepSeek-R1 本地化部署