The practice of audio AI algorithm in RTC

Guide: On October 21, 2021, the "QCon Global Software Development Conference" will be held in Shanghai. As the producer, Chen Gong, VP of NetEase Intelligent Enterprise Technology, launched the "Converged Communication Technology in the AI Era" special session and invited NetEase Yunxin Technical experts from NetEase Audio and Video Lab and NetEase Cloud Music will share with you the trend and evolution direction of integrated communication technology, the exploration and practice of key video communication technologies, the practice of audio AI algorithms in RTC, and the cross-platformization of NetEase Cloud Music Network Library Practice and other topics.

We will introduce and share four lecture topics one by one. This issue is our third issue, the practice of audio AI algorithms in RTC.

guest introduction: Hao Yiya, NetEase Yunxin audio algorithm expert, IEEE reviewer. Published 15 academic journals/papers and applied for 7 patents. He has participated in the NIH hearing aid audio algorithm project of the US Department of Health, Apple AirPods audio algorithm research and development, Facebook Reality Labs AR/VR audio project, and Zoom Video Communications real-time audio algorithm research and development. At present, Netease Yunxin is mainly responsible for audio laboratory construction, 3A algorithm research and development, AI audio algorithm research and development, and RTC audio standard formulation.

Foreword

With the continuous development of artificial intelligence technology, computer technology, neural network and other fields, AI audio algorithms are also emerging in academia and other industries, including the real-time communication (RTC) field of online real-time communication.

Since 2013, no matter what industry you can hear, the effect of using AI for this thing is very good. I want to use this technology to do it intelligently. I feel that everything can be AI, everything can be smart, and everything can be done with AI. Do, and the effect of doing it is very good.

But in fact, many technologies are still using traditional methods. For example, in the audio field, more are still using methods based on signal processing that have been accumulated for hundreds of years. The AI method still stays in the laboratory stage and stays in the simulation stage.

This sharing will share the practical experience of AI audio algorithms and RTC from three dimensions: "Difficulties in RTC application of AI audio algorithms", "AI audio algorithm landing solution", and "NetEase Yunxin AI audio algorithm landing case" . I hope that through this sharing, everyone can have new understandings and opinions on these issues, and also get some inspiration from them, whether in audio AI or RTC, or even in other industries.

Difficulties in the

AI audio algorithm trends

The following picture is proposed by Tsahi, in the field of RTC, the trend of AI audio algorithms. The abscissa in the figure represents the development of time, and the ordinate is the quality of the algorithm.

If it is a noise reduction algorithm, it can be understood as the noise reduction strength/the amount of noise reduction of the noise reduction algorithm, and the protection of the voice while reducing noise.

The orange line represents the traditional method based on digital signal processing. This type of algorithm has been developed for hundreds of years, but with the development of the trend, these algorithms have encountered many bottlenecks in the optimization process. For example, traditional algorithms require some preset conditions, and only under these conditions can the target effect be achieved. However, in current applications, the actual business scenarios are very complex, such as online entertainment scenarios and conference scenarios. In these scenes, the useful signal and the noise signal are very complicated and aliased together, and in many cases, the preset conditions of the algorithm cannot be met. Therefore, it is very difficult to break through using traditional algorithms alone in the optimization process.

AI-based, or data-driven algorithms, develop relatively slowly. I think one of the reasons is that they are trapped in the computing power of our computer hardware. However, with the continuous improvement of computing power, its development speed will become faster and faster, and it will replace traditional algorithms to some extent in the future. And Tsahi believes that we are currently in the same position as the two. I personally think that our time node is a bit to the left, because many AI algorithms cannot replace traditional algorithms, and traditional algorithms are still the cornerstone of the current RTC field. As for the AI algorithm, we are more to use its characteristics to solve special problems.

Why is the final trend that the development of AI algorithms will become faster and faster? One of the most important reasons is the increase in computing power. The picture on the right shows the increase in computing power from one dimension. It can be seen from the figure that since 2000, the computing power of supercomputers has been increasing at a rate of ten times every five years.

AI audio algorithms to RTC

What are the difficulties that our current AI audio algorithm encounters in RTC? Mainly in these three aspects.

Computational complexity: When the AI model is relatively large, the computing power demand will also be relatively large. In RTC services, many algorithms need to be landed on terminal devices, and these devices include some low-end devices. On these devices, whether the AI algorithm can run effectively is one of the important criteria for us to measure whether the AI algorithm can be applied. In addition, there are many algorithms in the entire processing pipeline of RTC, and the computing resources of a terminal device are limited. How to allocate computing resources reasonably is also a challenge to AI algorithms.

Generalization ability and robustness: Generalization ability refers to the ability of an AI model to expand in other unseen scenarios after learning the scenarios involved in Training Data. Robustness refers to whether the AI model's effect can be maintained at a relatively good level in various scenarios. These two points are the difficulties of AI audio algorithm landing RTC, because the equipment and scenes in RTC are very diverse. In the early days, RTC was more understood as a meeting. This scene was relatively simple, because the entire acoustic scene was relatively closed, indoor, and the distortion of the collected signal was relatively small. This is a good scenario in RTC. But the RTC scenarios we are facing are very rich, such as the entertainment industry, the top of the three industries that Netease Cloud Trust now focuses on. The entertainment industry scene is much more complicated than the conference scene. One is that our Desired Signal may not only be a voice, but also a music signal. Many audio algorithms of WebRTC are not friendly to music signals. In addition, complex scenes will bring complex background noise. For example, the noise of a car on the road is not only a transient noise that is very challenging for noise reduction algorithms, but the SNR is also quite low. To apply an AI algorithm, it is necessary to consider a variety of possible scenarios in a limited data set.

AI audio algorithm landing solution

We just talked about the overall trends and challenges of AI audio algorithms. In the next chapter, we will divide it into two parts. The first one takes AI noise reduction as an example. We will discuss the specific challenges in the implementation of the AI algorithm. In the second part, we take the VAD algorithm as an example to analyze the AI algorithm, especially the advantages of the deep learning algorithm, by comparing the traditional algorithm, the Machine-Learning algorithm, and the deep learning algorithm.

Examples of difficulties in the process of AI audio algorithm landing RTC

Take AI noise reduction as an example, let's take a look at some of the difficulties in the specific landing.

The first one is computational complexity. First of all, in RTC, audio needs to be processed in real time, so the input signals are all audio streams, which are continuous frames. We assume that each frame is a set of ten milliseconds of audio sampling data. Real-time processing is to ensure that the i-th frame of data is processed before the i+1 frame comes. And these ten milliseconds is the maximum process time. If there is a situation that cannot be processed in the process time, sample points will be lost, which has a very serious impact on the sound quality. AI algorithms need to achieve real-time processing, which is more difficult for a slightly larger model, so there will be a real-time track in many AI audio competitions. But real-time operation is only a very basic first step for RTC, just to ensure that we can process it in real time, but the real-time rate problem is not touched. Looking at the entire audio APM processing, AI noise reduction is only one of the modules, and the others also include AEC, AGC, VAD and other modules.

In addition, in addition to audio processing, there are other module processing, such as video processing, network congestion control, and so on. Our job is to allow so many algorithms to meet the real-time requirements at the same time, and to reduce the CPU usage as much as possible at the same time, to reduce the CPU overhead and power consumption of our SDK on user devices.

The second difficulty is Data. It can be said that the most important thing about AI algorithms is Data. To implement an AI algorithm, as long as there is Data, it has been more than half successful. However, the data of audio algorithms is relatively small compared to some other fields. When we are developing AI algorithms, we mainly build data sets through some open source data sets and real laboratory equipment recordings. At the same time, we also used some data enhancement methods to expand the data set. In the audio algorithm, the AI denoising data set is relatively easy to do, because the open source data is relatively large, and the data label is also easier to operate. But for other audio algorithms, collecting and labeling data sets is a very difficult task. For example, our AI howling detection algorithm, there is basically no data set about howling on the Internet, and howling signals are Collection and labeling are also more challenging.

Although there are so many challenges, our Yunxin Audio Lab has been accumulating data and constantly expanding the corpus. We also traverse different terminal devices and use their microphones to collect, so that more realistic microphone and room responses can be added to the data set. This is very important for AI algorithms to be applied to different terminal devices, because we need AI models to learn the responses of different microphones and rooms during training.

With the data, we also encountered difficulties in the process of tuning.

Compared with the traditional algorithm, this is a traditional noise reduction algorithm based on MMSE: the signal in the upper left corner comes in, and we can get the amplitude and phase of the signal through FFT, and then we can calculate the a priori/posterior signal-to-noise ratio, and calculate the gain, Then apply Gain to the signal amplitude, and then superimpose it with the phase to do iFFT to get the output time domain signal. When we are debugging this algorithm, if we find that the voice is mistakenly suppressed as noise, we can take out those few frames of signals and analyze the output value of each module in the algorithm section by section to find out the problem. Why can we do this? The reason is that each parameter here has a specific physical meaning. But for the deep learning method, the entire algorithm is actually a black box. For example, a convolutional neural network. The input parameters are rolled up layer by layer, and the intermediate parameters of each layer cannot correspond to a physically meaningful value. It is also impossible to locate the problem during the optimization process. Although we have encountered these problems in algorithm optimization, we have accumulated a lot of parameter tuning methods through continuous experience summing up. Although tuning parameters is still a challenge, we can quickly find suitable models and corresponding parameters by accumulating methods to solve different audio problems.

The above are the three most difficult points encountered when actually landing AI noise reduction. Talking about so many bad and difficult AI algorithms, what benefits can AI algorithms bring to us?

The choice between traditional audio algorithms, Machine-Learning, and AI

Let's talk about the choice between various algorithms in conjunction with VAD (Voice Activity Detection).

The right figure in the figure below shows a VAD comparison result. The upper part is a clean voice signal. What VAD needs to do is to detect the voice signal, which is marked as 1 in the time dimension, and the other is 0, which represents noise.

Let’s take a look at how traditional VAD is generally done. This article is more representative. Look at the flowchart on the left. The left side is divided into two parts, and the right is an energy-based VAD. This is the most easy to understand. . For example, I’m talking in this meeting place. Everyone hears my voice. The sound must be higher than the noise floor of the environment. If the SNR is greater than 0, then we will set an energy threshold. Voice, the energy below this is noise. Of course, energy-based VAD has also done a lot of other work, including how to more accurately estimate the energy, and the threshold value is constantly adapted to the new environment and noise. The method on the left is based on other features of voice, such as harmonic information such as spectrum Peak. The last method is a method that combines both energy VAD and spectrum Feature VAD. This is also the method proposed in this article. The detection effect is also the best in this audio.

Take the energy VAD as an example. What is higher than the threshold is our voice, and what is lower than the threshold is noise. Then this can be regarded as a single-feature Classifier. If two features are used, they can be represented like the picture in the upper left corner below. We can think of the red dots as noise and the blue dots as speech. Most of the traditional algorithms are linear distinctions. It is possible to make a cut in this way. After the division, many blue points will fall on the side of the noise, and many red points will fall on the voice side. These are wrong judgments. And missed judgment.

Is there a way to distinguish these two things in a higher dimension? We can introduce machine learning to extract these points in a higher dimension, just like the method shown in the following pictures. We only mentioned the use of two features here. In solving practical problems, we often use more features and combine different models to handle them from different dimensions.

In Yunxin's audio processing, we have many different VADs. These VADs have different responsibilities and different effects. Let me introduce one of the VADs below. Let's first look at the two pictures in the middle. These are two continuous speech pictures in the time domain. Comparing the speech above and below, you can see that the framed speech is denser and the noise gap in the middle is smaller. We cut the framed audio frequency of about five seconds into a thousand small segments, and then count the energy of each segment. The calculated energy is placed in the histogram on the right.

You can see that in the histogram above, there is a peak at -10dB on the right and a peak at -35dB. Then -10dB here means that the average energy of the speech is -10dB, and -35dB means that the energy of the noise gap in the middle is -35dB. If you count the following speech, you can clearly see that there is also a peak at -10dB in the histogram below, but it is smaller than the peak at -10dB in the histogram above, indicating that -10dB appears less frequently, while the peak at -35dB It becomes relatively obvious because of the increase in the number of noise gaps. On this basis, we added a step to learn the two peaks based on the data and use the machine learning model to estimate the two peaks. So even if the SNR is relatively low and the two peaks overlap very much, the machine learning method can distinguish the two.

Now that we use the idea of machine learning, can we use more complex models to learn more data? So let's talk about how to use neural network methods to achieve VAD. Here is a CNN-based VAD algorithm. This model uses three convolutional layers and a fully-linked layer. It uses Log-Mel Energy as the input feature. There is no other complicated design. It is a very straightforward solution. We can look at the result on the right. This result is chosen because it is a more difficult case. The first is because the SNR is 0dB, and the second is because its noise is all non-stationary noise in life. In the comparison method, G729 is similar to the signal processing-based VAD we just introduced, Sohn and Random Forest are based on statistics, and CNN is the method based on neural networks. From the analysis results, we can see that the statistics-based method is not good enough in most noises, but it is lower than G729, and CNN's VAD performs best in all noises.

I want to summarize that the advantage of AI VAD based on deep learning is that it can distinguish speech and noise more deeply, and the upper limit of algorithm capability will be higher. For example, transient noise, the traditional algorithm basically does not cover this situation, but AI VAD can be solved. But its weakness is still the complexity and generalization ability that we analyzed before.

NetEase Yunxin AI audio algorithm landing case

I just talked about the specific difficulties encountered during the specific landing and why we still choose the AI algorithm under these difficulties. Let's now talk about how Yunxin solves these problems.

specific implementation process

Just now I talked about the problem of AI noise reduction. Our solution is mainly the three points on the left.

The first is to use more suitable input features. On the one hand, minimize the amount of input features to reduce the complexity of the entire model; on the other hand, use features that are more suitable for the current problem as the input features of the model, and the selection is more effective Feature representing signal characteristics. This is our first approach.

Second, we use a lightweight and more suitable network. For example, in the noise reduction problem, we chose the GRU model because it has the memory of the front and back timing and is lighter than LSTM. In the process of tuning, we also try to compress the number of network layers to ensure computing speed.

The third method is targeted optimization. There are many targeted optimization tools, such as data set optimization, which includes data enhancement and data cleaning. There is also compression of the model to reduce the size of the model. In the process of integrating algorithms into the SDK, all of our AI-based algorithms at Yunxin use our self-developed NENN inference framework, which further accelerates the computing speed of our AI model and reduces overhead. In the end, on different terminal devices, including many low-end mobile devices, the computing time of our AI noise reduction per 10 millisecond frame is almost 100-200 microseconds, and the real-time rate is close to 0.01.

It is also worth mentioning that our AI algorithms often need to be processed in conjunction with traditional algorithms. For example, in echo cancellation, we can throw all the near-far-end signals and far-end signals to the model for training, and then let the model directly process the entire AEC link; we can also use traditional delay estimation and linear processing Method to do it, take out the nonlinear processing separately and use an AI model to train. The latter achieves the decoupling between several tasks, so the algorithm effect can be improved in the end, and the computational cost is much smaller than the former.

Cloud letter landing AI algorithm example

The AI algorithms we are currently implementing include AI noise reduction, AI audio scene detection, and an AI howling detection, and 3D sound effects are not included in the PPT. The AI echo cancellation module is being implemented. Below is a demo video of AI noise reduction.

Click on the original text to watch the video

As you can see from this demo, there are a lot of difficult burst noises, but our algorithm suppresses them very well. However, while reducing noise, our human voice also feels dull. In response to this problem, we are now in the process of implementing AI Noise Reduction 2.0 and plan to launch it by the end of the year. The new version of AI noise reduction does a better job in protecting the voice quality, and it can be basically lossless for clean voice.

Finally, I want to share with you. Let’s talk about Yunxin’s 3D sound effects through the popular meta-universe. NetEase Yunxin is now the only RTC manufacturer that has realized 6DoF spatial sound effects.

Facebook recently released a complete spatial sound effect technology (MSA). This picture shows all the capabilities that MSA can achieve, and Yunxin’s 3D sound effects also achieve all the capabilities in the picture, such as room reflection and far-field attenuation. Here are two examples of the landing of Yunxin’s 3D sound effects. The bottom left is the FPS game "Wild Action". In this type of game, the direction of special effects such as gunfire is familiar to everyone, while Yunxin’s 3D sound effects are in special effects. Above the sound, the voice of the black teammates also has a real-time sense of direction.

The second application case is Yaotai, an immersive event system developed by NetEase Thunder and Fire. This system can take on large conferences such as Qcon and turn it into a full-line, full-virtual scene. Participants can participate in the meeting immersively with a virtual role and a first-person perspective. Yunxin 3D sound effects provide real-time spatial sound effects in this system, allowing participants to hear real-time voices around virtual characters, providing a more in-depth and real immersive experience.

The practice of audio AI algorithm in RTC

Foreword

Difficulties in the

AI audio algorithm trends

AI audio algorithms to RTC

AI audio algorithm landing solution

Examples of difficulties in the process of AI audio algorithm landing RTC

The choice between traditional audio algorithms, Machine-Learning, and AI

NetEase Yunxin AI audio algorithm landing case

specific implementation process

Cloud letter landing AI algorithm example

网易数智

引用和评论

InfoQ官媒报道|网易云信裴明明：云原生架构下中间件联邦高可用架构实践

一文掌握 MCP 上下文协议：从理论到实践

LRU算法，你别跑，我就要吃透你

AI Agent爆火后，MCP协议为什么如此重要！

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

MCP 协议为何不如你想象的安全？从技术专家视角解读

🔥吐血整理 Bolt.diy 部署与应用攻略