Challenges of voice processing in real-time audio and video with YITU丨RTC Dev Meetup

foreword

"Voice processing" is a very important scene in the field of real-time interaction. In the " RTC Dev Meetup丨Technical Practice and Application of Voice Processing in the Field of Real-time Interaction " initiated by Shengwang, technologies from Baidu, Huanyu Technology and Yitu Experts have shared on this topic.

This article is based on the content shared by Zhou Yuanjian, the technical director of Yitu AI SaaS, at the event. Follow the official account "Soundnet Developer " and reply to the keyword " DM0428 " to download PPT materials related to the event.

在这里插入图片描述

Yitu is a provider of AI infrastructure and AI solutions. It has a relatively wide range of AI technical capabilities, including image, video, voice, natural language processing, etc. In addition to AI algorithm capabilities, it can also provide AI computing power .

After everyone understands Yitu's background, let me talk about the challenges Yitu encountered in the live broadcast scene related to audio content review.

01 The business process of live content review

在这里插入图片描述

■Figure 1

Figure 1 shows the business process of content review in the live broadcast scenario.

The basic process is as follows: the anchor first uploads the live broadcast, and then the stream is pushed to the platform, the platform sends the audit request to the supplier, and the audited supplier (such as Yitu) obtains the stream through the address, decodes it, and analyzes it in real time. Find the content of the violation, and then return the data to the customer through the callback form. After the customer receives the data, a second manual review is generally required. If it is confirmed that the content is illegal, then background processing will be carried out, such as stopping the live broadcast or deleting the account.

02 Live audio auditing algorithm module

Expand the algorithm module inside the system, as shown in Figure 2, it can be divided into three categories, one is basic speech recognition (ASR); Violating Content. The third category is non-verbal identification. If the offending content is not expressed in words, it can be identified through this part.

在这里插入图片描述

■Figure 2

2.1 Technical Difficulties of Speech Recognition (ASR)

The challenges encountered in ASR are first introduced.

In general, there are two main challenges: the first is the interference of strong background sounds. In the Internet voice scene, there is usually background music or game sound effects, the environment is generally noisy, and there are even many people. In the case of speaking, compared with ordinary scenes, the difficulty of speech recognition due to the superposition of these features will be greatly increased.

The second point is the identification of specific proper words. Some illegal words do not appear often in life, so in speech recognition, if there is no special optimization, syllables will tend to be recognized as more common words, resulting in the underreporting of illegal words.

2.1.1 Performance optimization of strong background sound

So, how to deal with such a problem? For the problem of strong background interference, after various attempts, we concluded that the most effective method is to solve it from the data perspective.

In terms of data, there are two main optimizations: the first is to create a relatively sophisticated ambient sound simulator according to the business scenario, and perform data enhancement through the simulator . This method has been verified in other fields, such as Tesla's automatic Driving models use similar techniques to improve performance during training.

According to the diagram, a simulator is constructed from multiple dimensions such as sound generation simulation, room simulation, sound reception simulation, and channel simulation. Parameters can be adjusted in each dimension, such as the number of speakers, the speed of speech or the background sound, the position and direction of the sound source, the effect of loss of voice, reverberation, etc. Overall, there are probably hundreds of parameters that can be adjusted. Through the simulator, the richness of the original relatively simple training data can be improved, and the training data can be closer to a specific scene, thereby achieving a good performance improvement effect.

Another way to improve is to train through hard example mining . In the training process of a normal model, there are both positive and negative data. In the case of a large amount of data, there will always be some cases where the positive data is similar to the negative data. Such data is usually called a difficult case, which is a comparison difficult data. Online hard case mining is to repeatedly add hard case data to the training process during the model training process. Similar to the wrong question book, you can improve your grades by recording the questions you are not familiar with in the wrong question book.

This method is applied to the training of difficult examples, which can allow the model to learn more details that are not easy to distinguish, and thus obtain a good performance improvement. Through the above techniques, the model can also achieve good performance under the data distribution with strong background sound.

2.1.2 Recognition of specific proper words

Another challenge mentioned earlier is the identification of proper words. Here is an example, as shown in Figure 3, here is the translation of a piece of audio Chinese text, you can see that if you have not heard the word "knock bubble" before, you may not be able to recognize the meaning of this paragraph. . It is possible to hear "knocking" as "terrible".

在这里插入图片描述

■Figure 3

In response to this problem, we have tried and found that there are two effective ways to improve: the first method is to increase the weight of the loss intensity of the proprietary word during model training, that is, if the proprietary word is wrong , will give a higher penalty. For example, in the above example, under normal circumstances, 1 point will be deducted for uttering a wrong word. With this mode, the model works harder to avoid proper word recognition errors.

The second method is to adjust the range of candidate words in the search thesaurus during decoding. As shown in Figure 4, when the speech recognition algorithm works, it first recognizes each phoneme through the signal of the speech spectrum, and then converts the phoneme into possible text.

微信图片_20220616205446

■Figure 4

For the optimization of proper words, more candidate words can be selected when translating a series of phonemes into text. For example, in the previous example, if the word "knock bubble" is not in the list of candidate words. Then in any case it is impossible to correctly identify the word "knock bubble gun".

Such an idea is relatively intuitive, but a new problem will be introduced after implementation, that is, the amount of calculation will be greatly increased. Basically, the increase in the amount of calculation is a quadratic complexity. If it is in a non-real-time business scenario, the impact of the increase in the amount of calculation may not be particularly large. However, if it is in a live broadcast scenario, the increase in the amount of calculation may lead to a longer delay.

This has a greater impact when live broadcasts are sensitive to delays, so to solve the speed and speed problem, generally speaking, better live broadcasts are reviewed in seconds, and the worst requirements are in minutes. Yitu's acceleration solution is to dynamically determine the search range of candidate words, returning to the business scenario. Content review does not require that all sentences must be identified very accurately. The most critical issue is to accurately identify the offending words, so this can be used for optimization.

Specifically, when it is found that there may be illegal words in the previous phoneme, the decoding search range of the subsequent candidate words is expanded. In this way, low-frequency illegal words are not missed, and calculations that have no impact on the final business result can be avoided, thereby greatly reducing the amount of calculation as a whole and ensuring the real-time performance of the business.

2.2 Non-verbal recognition

In the live broadcast scene, the needs of non-verbal recognition mainly focus on the voiceprint recognition of important people, sensitive audio detection, language classification and result fusion.

2.2.1 Sensitive audio detection

First, we introduce sensitive audio detection. Sensitive audio detection is to identify whether a piece of audio contains ASMR and other illegal speech. There are two main technical difficulties encountered in the detection of sensitive tones: the first is that the sensitive content is very short and of variable length. In the live broadcast, in order to avoid censorship, the publisher may mix sensitive tones with normal speech. As a result, the duration of the sensitive sound is generally short, so it is concealed. The second is the low concentration of data violations, which means that there must be a low number of false positives to reduce the cost of manual review. In the case of low false positives, it is necessary to maintain high recall at the same time, which has high requirements on the robustness of the algorithm.

For the problem that the sensitive content of sensitive audio detection is relatively short, as shown in Figure 5, it is mainly optimized from the algorithm network level.

微信图片_20220616205448

■Figure 5

Usually, when an algorithm performs detection, it will process a piece of data as a whole. When the violation content is relatively short, the sound signals of other normal content will cover up the abnormal violation signal, and the recall will be reduced.

The way to avoid this situation is generally to divide the entire data into smaller segments, which can indeed avoid the interference of normal sound, but at the same time, it also loses the original context information of the audio, resulting in false positives. Yitu uses the Attention mechanism to solve such problems through many attempts and investigations.

In the development of Attention in recent years, it has achieved good results not only in machine translation, but also in text, image, voice and other aspects. To put it simply, given a sequence of data, first calculate which positions of data in the sequence are more important, and then pay more attention to the data of these more important positions.

Corresponding to the scene, when a piece of audio data is received, the Attention mechanism can not only retain the complete information, but also can determine which places are more likely to be sensitive sounds, so as to allocate more recognition attention. improve the performance of the algorithm.

Another challenge is the challenge of requiring low false positives and high recall at low concentrations. The solution we take is to use the transfer learning pre-training method to improve the performance, as shown in Figure 6. Transfer learning has also been widely used in various fields. We perform additional training on the model we want on the basis of other well-trained models, and finally get a better model, which is equivalent to doing the follow-up work on the shoulders of giants.

微信图片_20220616205451

■Figure 6

Previously, YITU has achieved good results in voiceprint competitions at home and abroad, because sensitive audio is actually related to voiceprint, and voiceprint itself is an algorithm task of the same type, so we naturally considered transferring this advantage to Sensitive audio detection on this task.

As shown in Figure 7, the feature of the Yitu voiceprint model is that it can learn the invariance of channels, environments, etc., so it has algorithm blocking properties for various channel environments. We choose to use our own voiceprint model as the initialization model of the sensitive sound detection model, so that the sensitive sound detection model inherits the characteristics of the voiceprint model, so that its algorithm performance in various channel environments has good robustness.

微信图片_20220616205454

■Figure 7

2.2.2 Language Classification

The task of language classification is to determine the language types contained in the input audio. Generally speaking, in the live broadcast scene, it is more dangerous for the platform for the anchor to speak non-Chinese language content. For example, the anchors who specialize in English teaching on Douyin dare not use English for teaching all the time. If they use it all the time, for example, for one to two minutes, they will soon receive a violation reminder from the platform.

If there is a function of language classification, this risk will be greatly reduced for the platform. The platform can quickly identify risky live broadcast rooms. If the platform's audit team can understand the language of the anchor, they can carefully observe whether there is any illegal content; if the audit team does not understand. Then the easiest way is to close the live broadcast room, and the platform can avoid this risk.

There are three main challenges encountered in language classification:

The first is that data with a low signal-to-noise ratio is prone to false positives or false negatives. The reason may be environmental noise, reverberation echo, far-field radio distortion, channel distortion, etc. If the interference of background music or live special effects is added, it will increase the difficulty of language classification.

The second challenge is that the large number of languages makes it difficult to train. There may be thousands of languages in the world, which makes it very difficult to collect or label data, and it is difficult for us to obtain a large amount of high-quality training data.

The third challenge is that traditional algorithmic thinking generally has limitations. If a person can speak multiple languages, it may not be possible to judge only by the voiceprint information; when classifying in scenes such as singing, the model is prone to overfitting to the background music, resulting in poor generalization; When the segment is relatively short, it may be difficult to extract more accurate pronunciation features.

These problems are similar to the challenges introduced before, and we will not analyze them here. As shown in Figure 8, they can be solved by means of data enhancement, algorithm network improvement, and pre-training. At present, the customers of Yitu Online have been using the language classification function. From the observation of actual combat scenarios, the overall quasi-call is still good.

微信图片_20220616205458

■Figure 8

About Shengwang Cloud Market

Shengwang Cloud Market is a real-time interactive one-stop solution launched by Shengwang. By integrating the capabilities of technical partners, it provides developers with a one-stop development experience, and solves the selection, price comparison, integration, account opening and integration of real-time interactive modules. Purchase, help developers quickly add various RTE functions, quickly bring applications to the market, and save 95% of the time to integrate RTE functions.

Yitu's real-time voice transcription (Chinese) is currently available on the Shengwang cloud market. YITU real-time voice transcription provides streaming speech recognition capabilities, supports Mandarin Chinese, and is compatible with multiple accents. Provides transcription results while receiving audio data, allowing you to capture and utilize textual information in real time.

Challenges of voice processing in real-time audio and video with YITU丨RTC Dev Meetup

01 The business process of live content review

02 Live audio auditing algorithm module

2.1 Technical Difficulties of Speech Recognition (ASR)

RTE开发者社区

引用和评论

ElatoAI：开源 ESP32 AI 语音 AI 玩具方案；凯叔推出 AI 故事玩偶「鸡飞飞」丨日报

从 DeepSeek 看25年前端的一个小趋势

Open WebUI：开源AI交互平台的全面解析

大模型中的Token究竟是什么？从原理到作用深度解析

一文掌握 MCP 上下文协议：从理论到实践

MySQL × 向量数据库：大模型时代的黄金组合实战指南

Mac 安装 DeepSeek-R1 本地化部署