Real-time noise suppression based on deep learning-an example of deep learning landing on the mobile terminal

Today, with the rapid development of real-time communication technology, people's requirements for noise reduction during calls are also constantly increasing. Deep learning is also applied to real-time noise suppression. In the LiveVideoStackCon 2021 Shanghai station, Feng Jianyuan, head of audio algorithm of Agora, shared the example of deep learning on the mobile terminal, the problems encountered and the future prospects.

Text / Feng Jianyuan

Organize/LiveVideoStack

Dear guests, I’m Feng Jianyuan from Shengwang. Today, I will introduce to you how we are doing real-time noise suppression based on deep learning. This is also an example of deep learning on the mobile terminal.

We will introduce them in this order. First of all, there are actually different types of noise, how they are classified, how to choose algorithms and how to solve these noise problems through algorithms; in addition, I will introduce how to design some such networks through deep learning and how to pass The AI model is used to design the algorithm; in addition, we all know the computing power of the deep learning network, and the model will inevitably be relatively large. When we are landing some RTC scenes, we will inevitably encounter some problems. What problems need to be solved by us, how to solve the problem of model size and computing power; finally, we will introduce what effect the current noise reduction can achieve And some application scenarios, and how to make noise suppression better.

01. Classification of noise and choice of noise reduction algorithm

Let's first understand what types of noise we usually have.

In fact, noise will inevitably follow the environment you are in, and all objects you face will make various sounds. In fact, every voice has its own meaning, but if you are communicating in real time and only the human voice is meaningful, then you may think of other voices as noise. In fact, a lot of noise is a steady-state noise, or steady noise. For example, when recording this kind of mine, there may be some background noise, which you may not hear now. For example, there will be some whirring noises when the air conditioner is running. Noises like these are some stationary noises, which will not change with time. This way, I can know what the noise was like before, and I estimated it out in this way. After that, if the noise keeps appearing, it can be removed by a very simple method of subtraction. Such smooth noises are actually very common, but they are not all that smooth and can be easily removed. In addition, there are a lot of noises that are unstable. You can't predict whether someone in this room will suddenly ring a cell phone; suddenly someone puts on a piece of music or the sound of cars roaring by on the subway or on the road. This kind of sound appears randomly, and it is impossible to solve it by predicting it. In fact, this is also the reason why we will use deep learning, like traditional algorithms, it is difficult to eliminate and suppress the unsteady noise.

In terms of usage scenarios, even if you are in a very quiet conference room or at home, it may inevitably be affected by some background noise or some sudden noise introduced by the device. This piece is also an inevitable pre-processing procedure in real-time communication.

Put aside the sensory understanding of these noises that we usually encounter. See how it behaves in terms of numbers and signal levels. Noise and sound are all transmitted through the airborne medium and finally reach your ears, through the induction of your ear hairs, and finally form the perception of your heart. In these processes, for example, we use some microphone signals. In some collections, it is a wave signal. It is some waveforms that oscillate up and down. If it is a clean human voice, he will see some waveforms when he speaks. When he is not talking, it will be basically 0. Then if some noise is added, it will become the same on the right side, and there will be some aliasing on the waveform. The vibration of the noise will be mixed with the vibration of the human voice, and there will be some ambiguities. There will be some waveforms even when not talking. This is directly from the level of the wave signal. If it is transformed to the frequency domain through Fourier transform, the pronunciation of the human voice is generally between 20 Hz and 2k Hz at different frequencies. People still have Fundamental frequency, vibration peaks, and harmonic generation. You can see that people are shaped like this on the frequency spectrum, but if you add noise, the frequency spectrum becomes blurred, and there is a lot of energy in places where energy should not appear in the frequency spectrum.

Doing noise suppression is actually doing an inverse, a reverse process. Turn these time-domain signals into a pure signal through some filtering methods. These noisy noises can also be removed in the frequency domain to form some relatively pure corpus.

Noise reduction algorithms existed a long time ago. When Bell Labs invented the telephone, it was discovered that noise has a great influence on communication. Different signal-to-noise ratios will affect your bandwidth due to Shannon's theorem. You are a pure signal and can even be transmitted with a relatively small bandwidth. Before 2000, we can call these algorithms collectively, knowing is knowing.

The first piece, they are mainly for the relatively stable noise is Stationary Noise, why is it called knowing, that is, when you no longer speak and there is no human voice, there is only noise. In addition, you can construct noise by capturing noise in the silent segment. Some distributions. Because it is a steady-state noise, it does not change so drastically over time. Even if there is a human voice in the future, you can use your estimated model to perform some spectral subtraction or Wiener filtering to solve it. Stationary noise like this is because at the beginning our components have a lot of noise floor, so they will be the first to get rid of this kind of Stationary Noise noise. In fact, the methods are some spectral subtraction and Wiener filtering. Later, there may be advanced wave difference and wavelet decomposition. These methods are inseparable from the original. It will estimate its noise through the mute section. In the future process It can be solved by some spectral subtraction methods.

Slowly everyone will find that apart from Stationary Noise, you want to keep only the human voice during the call, and other noises have to be dealt with. After 2000, we will say, because the distribution of human voices and the distribution of wind sounds are different. Similarly, some wind sounds passing through the microphone, such as my blowing, the low frequency part may be higher, and the high frequency part may attenuate faster. In fact, the human voice and noise can be separated by clustering. The main idea is to project the sound signal into a higher-dimensional space for clustering. The clustering method will be somewhat adaptive. It can be used slowly, which is similar to the predecessor of deep learning. It will divide the sound into different types. When noise reduction is performed in a high-dimensional space, the characteristics of the human voice are retained, and the other parts can be left out. This method, such as Subspace spatial decomposition, has achieved great success in the image field, and non-negative matrix decomposition is also good for wind noise removal in the audio field. Another example is that there is more than one type of noise, and many types of noise have to be decomposed. This method like dictionary learning can also be done.

Like a common type of noise, we call it Non-Stationary Noise with Simple Patterns, which is unstable noise, like the whistling wind, but it may have a fixed pattern. For example, the whirring of wind sometimes appears and sometimes does not appear, but it follows the characteristics of the low frequency of the wind, which is denser and so on. Among them, you can learn one by one, such as wind, thunder and lightning, noise at the bottom, etc., which can be achieved through learning. Now we have discovered that when things are clustered together, the types of noise are endless, and the eddy currents caused by each type of machine, each type of friction, and each type of wind blowing sound may be different. In this case, we cannot exhaust a lot of noise aliasing. At this time, we think of training a model through a large amount of data, so that the collected noise can be mixed and added to human voices, and we can learn through continuous learning. Call it practice makes perfect 2020. Through training, through a large number of data samples, the model can learn enough knowledge and be more robust to noise, without having to decompose one by one.

According to this kind of thinking, there are already many deep learning models that can achieve such noise suppression, while ensuring that it has a suppression effect on different noises.

Many noises do not exist alone, especially some compound noises. For example, in a coffee shop, you may hear those coveted voices mixed with the voices of various people chatting and talking. We call the background vocals Babble noise. Babble is the whispering sound. You also want to remove this background noise. If multiple sounds are mixed together, you will find that its frequency spectrum is like a flood passing by. Everything is mixed in it and it will be difficult to remove. If you use the traditional algorithm, it will retain the obvious human voice, and the aliasing of the higher frequencies will be more serious. In fact, it is difficult to distinguish. It will remove the high frequencies above 4k as noise. This is some shortcomings of traditional noise reduction methods.

Like the deep learning method, there are two main points to judge the quality of a noise reduction method:

First, what is the degree of retention of the original vocals? Is the damage to the language spectrum as small as possible?

The second point is to keep the noise as clean as possible.

To meet these two points, the right is the deep learning method, the spectrum can be preserved at high frequencies, and noise is not mixed in between.

02. Algorithm design based on deep learning

Now how to design for deep learning methods.

Like other deep learning, these steps will also be included.

The first step is to feed the model what kind of input, the input can be selected, our sound wave signal can be given in the form of wave, through the form of frequency spectrum, or the form of more high-dimensional MFCC or even the form of psychological hearing threshold BARK domain. it. Different inputs determine the structure of your model. In the model structure, it may choose a similar image, if it is a spectrum, it may be done in a similar way to CNN. The sound has a certain time continuity, you can also do it directly through the waveform. Different model structures are selected for this part, but we found that on the mobile terminal, it will also be limited by computing power and storage space. There may be some combinations of models instead of a single model. There will be some consideration in the selection of the model, and another important aspect is to choose a suitable data to train the model.

The process of training the model is relatively simple, that is, mixing the human voice signal and the noise signal and feeding it into the program, so that the model will give you a pure human voice signal. At this time, I will choose whether my data is to cover all the different languages. The last meeting mentioned that the factors of different languages are also different. For example, Chinese will have five or six phonemes more than Japanese. If it is English, There are five or six phonemes that are different from Chinese. In order to cover these languages, multilingual data may be selected. The other gender is also different. If the corpus training is not balanced enough, the noise reduction ability of male and female voices may be biased. In addition, there may be some selection considerations on the type of noise, because it is impossible to exhaust all the noise, so some typical noise will be selected. Here is a rough list, the selection of different features, the design of the model, and the preparation of the data. Come back and see which directions should be paid attention to.

Let's take a look at what kind of data we will choose to give to the model.

The first consideration is to do an end-to-end processing of the most primitive wave signal to survive a wave signal. This idea was rejected at the beginning, because the wave signal is related to its sampling rate. The 16K sampling rate may have 160 points per frame and 10 milliseconds. The amount of data is very large. If it is fed directly, it may lead to a lot of model processing. Only a larger model can handle. We were thinking about whether we could convert it into the frequency domain before, and do the input that can reduce the data in the frequency domain. Before 17 and 18 years, this was done in the frequency domain, but in 2018, the Tasnet model has been able to generate an end-to-end noise reduction effect through the time domain.

The frequency domain may be earlier. The noise removal was done in the frequency domain before, and the noise problem was solved in the form of a mask. For example, the energy of the noise is removed and only the energy of the human voice is retained.

In 19 years, a paper made a comparison. Both the time domain and the frequency domain can get a better noise reduction effect, and the computational complexity of the model is not equivalent. This input signal will not determine the computing power or effect of your model to a large extent, it is okay.

On this basis, if the time domain and frequency domain are both possible, we may need to use some high-dimensional forms like MFCC to further reduce the computing power of the model. This is also the place to consider when designing the model at the beginning. According to the limit of computing power, there are only 40 bins from more than 200 frequency points to MFCC, so the input can be reduced. Because the sound has some masking effects, you may divide it into small enough sub-bands to achieve the effect of noise suppression, so it is also an effective method to reduce the computational power of the model.

I just talked about the input of signals. When making model structure selection, there will also be many considerations for the calculation power of the model structure. The complexity of the model calculation power and the model parameters can be drawn on an XY axis to indicate the correction. Like some CNN methods, because of the existence of convolution, many operators in it can be reused, and the convolution kernel can be reused on the entire spectrum. In this case, its computational complexity will be the highest in the same parameter structure, because it is multiplexed and its parameter amount is very small. If some mobile apps have restrictions on the amount of parameters, for example, the mobile app cannot be larger than 200M, and the space given to you by the model may be 1-2 megabytes. In this case, try to choose the CNN model.

The amount of parameters is not a big limitation and the computing power may be challenged. For example, a chip with poor computing power is only 1GHz. At this time, the method of convolutional neural network is not suitable. At this time, some linear layers may be used to characterize, so linear is also a matrix multiplication. The computing power of matrix multiplication is not very high in some DSP chips and traditional CPUs. The disadvantage is that each operator is not reusable. In this case, the amount of parameters is relatively large, but the calculation power may be even smaller. But using linear only has a linear layer just like DNN, that is, it has large parameters and large computing power.

As mentioned earlier, people’s speaking time is continuous. You can use RNN, which has short-term or long-term memory, to memorize the current noise state through real-time adaptation. This can further reduce it. Computing power.

In general, when you choose a model, try to use linear layers as little as possible. This will bring a lot of parameter increase and increase in computing power. You can integrate these different structures. For example, use CNN first and then use the CRN form of RNN. The first step is to compress the dimensions of your input, and then use long and short-term memory to further reduce the model's computing power.

According to different scenarios, if you do offline processing, it may be the best to use a two-way artificial neural network to do it. It is not possible to increase the delay in the RTC scene. A unidirectional network like LSTM may be more suitable. If you want to further reduce the computing power, the three-door LSTM is still too large, then use the two-door GRU, etc., to improve the algorithm's ability in some details.

How to choose the model structure is related to the usage scenario and computing power. The other piece is how to choose the data to be fed to the model. One piece of the data is the damage of the corpus. It is necessary to prepare a more adequate and clean corpus, which includes different languages, genders, and the corpus itself may contain noise. Try to choose a relatively pure corpus recorded in the anechoic room of the recording studio. In this way, your reference determines that your goal may be purer and the effect will be better.

There is also a piece of whether you can cover the noise. The noise is endless. You can choose some typical office human voices, mobile phone prompts, etc. according to your scene, such as a meeting scene, as training corpus. In fact, many noises are some combination of simple noises. When the number of simple noises is large enough, the robustness of the model will be improved, even some unseen noises can be covered. If the noise cannot be collected sometimes, you can make some by yourself and synthesize some artificially. For example, fluorescent tubes, noise caused by the glow effect, and 50 Hz alternating current are always releasing 50 Hz and 100 Hz harmonic noise. This kind of noise can be added to the training set by artificial methods to improve the robustness of the model.

03.RTC mobile terminal dilemma

Assuming that we already have a better model, what difficulties will we encounter when we land it?

In the real-time interactive scene, first of all, it is different from offline operation and requires higher real-time performance. It requires frame-by-frame calculation, non-causal is not available, and future information cannot be obtained. In this scenario, some two-way The neural network is not available.

In addition, it is necessary to adapt to different mobile phones and different mobile terminals. This is affected by the computing power of various chips. If you want to use a wider range of model computing power, there will be restrictions and the size of the model parameters should not be too large, especially when the chip is called. The model parameters are large and the computing power is not very high, but the IO operation of reading the parameters will also affect the final performance of the model.

The richness of the scene has also been mentioned earlier, some of the more successful, different voices such as Chinese, English, Japanese, and the type of noise. In a real-time interactive scene, it is impossible for everyone to say the same thing in the same scene, and the richness of the scene should also be considered.

04. How to land the mobile terminal

Under these conditions, how to implement deep learning? We can solve these problems from two aspects.

First of all, the algorithm can be used to break through the algorithm. I just mentioned a point, like fully convolutional and fully linear, its parameters have different computing power. Through the combination of different models, different computing power can be combined into different computing power structures. In terms of effects, there may be some biased differences. What kind of model can be applied to what kind of algorithm can be solved by this model structure. On the whole, it is a combined algorithm, and its computing power can be made through the combination of models. Try to meet its chip and storage space requirements.

Second, the scenarios of the entire algorithm are different, so different models will be selected to solve them. If the scene can be selected at the beginning, such as a meeting scene, it is impossible to have music or animal sounds. These noise indicators are Don't pay special attention, these things can be used as the direction of model cutting.

The algorithm itself may have such a large model, and it is still a parameter of 5-6 megabytes. You may think it is not enough. In other words, its computing power is not optimized on the mobile terminal, and it may have problems in memory calls and chip storage cache. It will affect its results in the process of inference and actual use. It is clear that it runs ok during training, but it runs differently when landing different chips.

There will also be breakthroughs in engineering, mainly for model reasoning and some processing methods will be different. First of all, some operators will be optimized in terms of models. When training and building models, they are added layer by layer, but many operators can perform some fusion, including operator fusion and convex optimization. Some parameters are used for model pruning and quantification, which can further reduce the computing power of the model and the size of the parameters.

The first step is to perform some tailoring and quantification of the model, which can already make your model the best and most suitable for the scene. In addition, its chips are different in different mobile terminals. Some mobile phones may only have CPUs. Some mobile phones have GPUs, NPUs, and even DSP chips that can even open up its computing power.

This piece we can better adapt to the chip, there will be some different reasoning frameworks, and each company will have some more open source frameworks to use, such as Apple's Core ML, Google's TensorFlow Lite, which will schedule the chip to compile the layer. The optimization is done inside. At this stage, the difference between doing and not doing is very huge, because how the whole algorithm is operated is one thing, and how to do memory calls, matrix calculations, and floating-point calculations is another. Do engineering optimization, this effect may be a hundredfold improvement. The optimization can be done with an open source framework, or you can do some compilation optimization yourself. If you are familiar with the computing power of the chip, such as how to call different caches and what their size is, you can do it yourself. Maybe the results you make are more targeted than this open source framework, and the effect will be better.

After we integrate the model and the inference engine, it is our final product. We can adapt to almost all terminals, and a product that is fully engineered on all chips, so that it can be used in real time.

05. Noise reduction demo

Let's listen to what the noise reduction effect looks like.

[Because the platform cannot upload audio, interested developers can listen the link

06.Can we do it better？

After listening to these demos, let's see what we can do to make the effect better and the scene to become more.

We still have many difficult problems to solve. Including the retention of some music information. If you use noise reduction in a music scene, you will find that there is no accompaniment and only the human voice. These scenes may be more refined, such as the way of separating the sound source. The sound of the instrument cannot be preserved, but some music sounds like noise is an area that is more difficult to solve. The other is like human voice, like Babble noise. This kind of background noise is sometimes more difficult to distinguish from human voice, especially like the cocktail effect. Everyone is talking. It is more difficult to determine which person is really effective through AI. Noise suppression, for example, what we do are all single-channel, and some microphone arrays may be used to do some directional noise reduction, but these are also a more difficult place, what sound is worth keeping, how to distinguish between human voice and background sound Blocks are also a more difficult direction, which is also a clearer direction that we will explore in the future.

That's all for my sharing, thank you all.

The cover from creativeboom.com

Real-time noise suppression based on deep learning-an example of deep learning landing on the mobile terminal

01. Classification of noise and choice of noise reduction algorithm

02. Algorithm design based on deep learning

03.RTC mobile terminal dilemma

04. How to land the mobile terminal

05. Noise reduction demo

06.Can we do it better？

RTE开发者社区

引用和评论

实时多模态如何重塑未来交互？我们邀请 Gemini 解锁了 39 个实时互动新可能丨Voice Agent 学习笔记

一文掌握 MCP 上下文协议：从理论到实践

LRU算法，你别跑，我就要吃透你

AI Agent爆火后，MCP协议为什么如此重要！

2025年医疗大模型各医疗场景赋能实践研究报告130+份汇总解读|附PDF下载

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

MCP 协议为何不如你想象的安全？从技术专家视角解读