The most important thing in photography is the light, looking for the light, waiting for the light, and shooting the light, while in the outdoor recording and video recording, it is to wait for the sound. To be precise, wait for the noise to pass, especially the transient noise, the whizzing of the plane, the school bell, the horn of the car, and so on. Follow [Rongyun Global Internet Communication Cloud] to learn more

These require intelligent noise reduction processing through AI algorithms.
AI noise reduction can identify the instantaneous noise that needs to be filtered out through model training.

In the Rongyun RTC Advanced Practical Master Class on June 9, the Rongyun audio algorithm engineer fully shared the AI noise reduction related technologies from the aspects of noise reduction technology, AI noise reduction technology, and Rongyun AI noise reduction exploration and practice. .

This article will sort out and display the content of the courseware, welcome to collect and forward~

WeChat background reply [AI noise reduction] to obtain the complete courseware


Noise reduction technology

Noise is actually a relative concept, and the definitions of useful sound and noise in different scenarios are inconsistent.

For example, when talking in an environment with background music, the background music is noise and needs to be removed with noise reduction technology. In the live broadcast, the background music sung by the anchor becomes a useful signal, not only cannot be removed, but also needs to be preserved without distortion.

Therefore, we need to design noise reduction schemes according to different scenarios.

Noise reduction technology has been developed for many years, and there will be some typical algorithms and important technological breakthroughs at each stage. For example, the early linear filtering method and spectral subtraction method, and the later statistical model algorithm and subspace algorithm.

In recent years, the noise reduction algorithm based on deep learning has developed rapidly, which is the AI noise reduction algorithm shared in this article. The main ones are deep learning algorithms based on amplitude spectrum , as well as deep learning algorithms based on complex spectrum , and later deep learning algorithms based on time domain signals .

(The main noise reduction technology in different stages)

Traditional algorithms are modeled by researchers summarizing noise laws, and then implementing background noise processing, including linear filtering, spectral subtraction, statistical model algorithms, and subspace algorithms.

The linear filtering method is to filter out the signal of the known frequency band with a high-pass filter or the like. For example, if there is a 50 Hz interference, we can use a high-pass filter with a cutoff frequency above 50 Hz to filter out the 50 Hz interference signal.

Spectral subtraction, recording a noise energy in the non-speech segment, and then subtracting the noise spectrum from the noisy speech spectrum to obtain pure speech.

The statistical model algorithm calculates the speech and noise components of each frequency point based on the statistical method.

The subspace algorithm maps the noisy speech to the signal subspace and the noise subspace, and estimates the truly useful speech signal by eliminating the noise subspace components and retaining the useful signal subspace components.

(Principle block diagram of noise reduction algorithm)

The figure above (left) shows the typical principle of traditional noise reduction methods.

The signal y(t) is subjected to short-time Fourier transform to obtain the amplitude spectrum and phase spectrum of the noisy speech. The traditional method usually focuses on the amplitude spectrum of the speech signal , and uses the amplitude spectrum information to estimate the noise through the Noise Estimator (noise estimation module). , and then calculate the final Gain value through the Gain Estimator, multiply the amplitude spectrum of the noisy speech with the Gain value to obtain the final enhanced speech amplitude spectrum, and then combine it with the phase spectrum of the noisy speech to perform iSTFT to obtain the enhanced speech .

Since the noise estimation module usually calculates the estimated value of noise through smooth recursion, it is difficult to accurately estimate non-stationary noise. Therefore, we need to introduce AI noise reduction to further improve noise reduction performance.

The above picture (right) is a schematic diagram based on AI noise reduction. The noisy speech is input into the trained neural network through feature extraction, and the enhanced speech after denoising is obtained.

Its essence is to use the neural network model to learn the characteristics and differences between speech and noise, so as to remove the noise and retain the speech.


AI noise reduction

AI noise reduction mainly studies three aspects.

The first is the model Model , from the earliest DNN network to the later RNN network, to the later CNN network, GAN network and the recent Transformer, etc. The development of AI noise reduction models is developed with the development of deep learning models.

These neural network models are not only used in AI noise reduction, but are also widely used in speech recognition, speech synthesis, image processing, natural language processing and other fields.

Then there is the training objective , which is generally divided into two categories: Mask class and Mapping class.

Mask mainly includes: ideal binary masking IBM, ideal ratio masking IRM, spectral amplitude masking SMM, all of which only use amplitude information; the later phase sensitive masking PSM is the first masking training target that uses phase information for phase estimation; Subsequent complex ratios mask the CRM while enhancing the real and imaginary parts.

The Mapping method first maps the input speech to the amplitude spectrum, which is equivalent to recovering only the amplitude spectrum of the speech, and then to the later complex spectrum, using the complex spectrum information of the speech signal for mapping, and then directly mapping the speech waveform, without doing time-frequency transformation.
Finally, there is the loss function Loss Function.

The earliest one is the minimum mean square error MSE, which calculates the average of the squared difference between the target and the predicted value, and its variants such as LogMSE, which calculates the average of the squared difference between the target and the predicted value in the logarithmic domain.

Also widely used are SDR or SiSDR.

Neither MSE nor SDR can directly reflect the audio quality of speech.

Speech quality is usually evaluated by PESQ, STOI, etc., so using PESQ and STOI as Loss calculation can more accurately reflect speech quality and intelligibility. Because PESQ is not continuously differentiable, PMSQE is actually used for calculation.

Based on the ideas of different training objectives, AI noise reduction mainly has the following types.


Mask class

After transforming the noisy speech signal to the time-frequency domain, the Mask value of the noisy speech is calculated and multiplied by the time spectrum of the noisy speech, so as to achieve the effect of noise suppression in each frequency band and obtain the enhanced speech.

(Mask class method)

The training process is shown in the figure above. Pass the noisy speech through STFT to obtain the amplitude spectrum of the noisy speech → Pass the amplitude spectrum through the deep learning network to obtain a Mask value → Input this Mask value and the ideal Mask value into the training Loss Function module to obtain Loss Value → Instructs the network to update and iterates the trained model continuously.

The inference process is to perform STFT transformation on the noisy speech to obtain the amplitude spectrum and phase spectrum; also, the amplitude spectrum information is processed by the trained network to obtain the Mask value, and the Mask value is multiplied by the amplitude spectrum of the noisy speech, thereby Obtain the amplitude spectrum of the denoised speech; then combine it with the phase information to obtain the time domain waveform of the denoised speech through iSTFT processing.

The Mask class is the earliest method. In principle, it is similar to the Gain value used in the traditional method, except that the Mask is the result inferred by the model. Similar to the Gain value of the traditional method, Mask will also limit the range to ensure that it is within a reasonable range, thereby ensuring a small degree of voice distortion while effectively reducing noise.

Mapping class

The Mapping class method does not need to obtain the intermediate value Mask to calculate the denoised speech spectrum, but directly uses the deep learning network to predict the speech spectrum.

However, this method also has two sides. Although the model can directly output the denoised speech spectrum, the output abnormality will increase, especially in the case of scenes that the model has not seen before.

(Mapping class method)

The training process is shown in the figure above. After the noisy speech is transformed by STFT → the amplitude spectrum is passed through the deep learning network to obtain the enhanced speech → the enhanced speech and clean speech are input into the Loss Function module, and the Loss is obtained to guide the model update until convergence.

The inference process is to transform the noisy speech through STFT to obtain the amplitude spectrum and phase spectrum; the amplitude spectrum is processed by the trained model to obtain the denoised speech amplitude spectrum, and then combined with the noisy speech phase spectrum and transformed by iSTFT to obtain the enhanced speech.

Mask and Mapping fusion

The mask and Mapping fusion method, the core idea is also similar to the above-mentioned Mask method to find the Mask value, but when seeking the Loss, it is not to find the Loss of the Mask, but to use the Mask to find the denoised speech, using the denoised speech and Clean speech to calculate Loss.

The reason for this is that the Mask cannot fully reflect the degree of fit between the voice and the original voice. In the same Loss case, there are multiple possibilities for the Mask, and the voices obtained based on different Masks are not unique. Therefore, the voice is used as the calculation of Loss. will be more in line with the real goal.

(Mask and Mapping fusion method)

The training process is also to first perform Fourier transform on the noisy speech, take the amplitude spectrum, input it to the network, and get the Mask value, so that the enhanced speech and the clean speech are passed through the target computing module together, and the Loss value is obtained to guide the model update.

The inference process is consistent with the calculation process of the Mask class.

Complex Mapping

Since the use of the amplitude spectrum only uses the amplitude information of the noisy speech, and the phase information is not used, the algorithm has a certain bottleneck. If the use of the phase information is increased, all the speech information can be used more effectively, which is more effective for noise suppression. Therefore, the complex number spectrum is introduced for design, and the complex number Mapping method is explained here.

(Complex Mapping method)

In the training process , the noisy speech is also processed by STFT, and the complex spectrum of the enhanced speech is obtained after passing it through the network, and then it and the pure speech are input into the Loss Function module for Loss calculation, which guides the continuous update of the model and finally converges.

In the inference stage , after the noisy speech is processed by STFT, it is input to the model to obtain the amplitude spectrum, and then the iSTFT is processed to obtain the denoised speech.

Waveform class

This type of approach puts almost all processing into the model, giving the model a lot of flexibility to learn.

The previous methods are all processed in the time-frequency domain, while the Waveform method decomposes and synthesizes the data by using a CNN network, etc., so that the signal changes to the domain where the model converges; it is precisely because of this flexibility that we It has less control and is more likely to encounter some abnormal cases.

(Waveform class method)

The process of training and inference is shown in the figure above. It is worth mentioning that in the actual method selection, it is still necessary to select an appropriate method according to the scenario and requirements for algorithm design and tuning.


Traditional noise reduction vs AI noise reduction


(Comparison of traditional noise reduction and AI noise reduction)

Noise suppression amount

For stationary noise , both traditional noise reduction algorithms and AI noise reduction algorithms can play a good role in performance.

But for non-stationary noise , whether it is continuous non-stationary or transient non-stationary noise, the effect of traditional methods is not very good, especially in the processing of transient noise, the performance is the worst. Because of the variety of non-stationary noises, the rules are difficult to summarize, and it is difficult for traditional methods to model non-stationary noises.

In this regard, the method of AI noise reduction can introduce a large amount of non-stationary noise to allow the model to learn its features, so as to achieve good results.

voice distortion

Traditional noise reduction methods are difficult to accurately estimate the amount of noise, and once too much estimation will cause speech distortion.

The AI noise reduction algorithm mainly introduces various kinds of noise into the training set, so that the model can estimate the speech and noise relatively accurately. Usually, the speech distortion is relatively small.

Algorithmic Robustness

The performance of traditional methods is relatively stable in old and new environments, and the algorithm complexity is not very high, so the classic traditional noise reduction methods are still used in some scenarios.

The AI noise reduction algorithm has outstanding performance in known environments, which is unmatched by traditional methods, while AI noise reduction has a certain probability of unsatisfactory situations in unknown environments. But I believe that with the development of AI noise reduction technology, its algorithm robustness will get better and better.

music scene

Direct use of traditional noise reduction algorithms will cause serious damage to music signals, because its noise tracking principle cannot distinguish music signals from background noise well; while AI noise reduction technology can be processed in the model through the expansion of training data. Through the data of music noise, it has the ability to distinguish between music and noise, so as to achieve good results.

low signal-to-noise ratio

It is difficult for traditional noise reduction algorithms to accurately estimate the noise level, and it is more likely to cause higher speech distortion and more residual noise, while AI noise reduction can improve the model by introducing multiple signal-to-noise ratio data including low signal-to-noise ratio data. Effects on low signal-to-noise ratio scenes.

(with irritable voice)

As shown in the figure above, this is a comparison of the effect of low signal-to-noise ratio + non-stationary noise .

From the spectrogram of noisy speech, it can be seen that the low-frequency speech spectrum is difficult to observe clearly, and the signal-to-noise ratio is very low. Its sound quality can hear obvious noise, and some voices are not clear.

(traditional noise reduction effect)

The spectrogram processed by the traditional noise reduction algorithm can clearly see the residual noise, and the sound quality effect is obviously still able to hear some noise.

(The effect of AI noise reduction)

The spectrogram after AI noise reduction will not see residual noise obviously, and the sound quality is better than the traditional noise reduction algorithm.

How about the effect of traditional noise reduction and AI noise reduction algorithms in live music scenes?

Below is the raw audio of speech + music in normal environment.


(original audio)

From the spectrogram, it can be seen that there is a continuous spectrum with strong energy, and it is difficult to distinguish music from human voice. Music and vocals can be perceived from the audio.

(traditional noise reduction effect)

The effect of the traditional noise reduction algorithm has obvious damage in the frequency domain, and the feeling of audio damage can also be heard.

So, how effective is the AI noise reduction algorithm? Let's take a look at Rongyun's practice in this regard.


Rongyun AI noise reduction practice

It mainly discusses the AI noise reduction scheme in the full-band audio live broadcast scene.

Scenario Challenge

First of all, the full-band audio live broadcast scene needs to use a 48kHz sampling rate to ensure that the human ear's hearing sense of the audio does not decrease significantly. This is wider than the effective frequency band of 16kHz sampling audio often used in AI noise reduction in academia, and requires higher and more complex algorithms and models.

Secondly, the music signal needs to be preserved. The characteristics of the music signal are more complex than that of the speech signal. The speech is mainly based on the human voice, and the music signal contains a wide variety of musical instruments, and the difficulty is escalated.

Again, open source AI noise reduction algorithms that can be referenced for audio live broadcast scenarios are very hard to find, and there are almost no.

Finally, due to the high sampling rate, there are relatively few open source datasets available.

(Solution 1)

Solution one

The core idea of this scheme is to combine the traditional noise reduction scheme with the AI music detection algorithm, which not only maintains the advantages of traditional noise reduction, but also introduces the advantages of deep learning, so that the performance of the algorithm achieves a stable effect and has a great performance. improvement.

The blue box in the figure is the principle block diagram of the traditional noise reduction algorithm. Specifically, the amplitude spectrum of the noisy speech y(t) is obtained by STFT processing, and the noise is obtained by the quantile noise estimation, feature update, and speech occurrence probability module. Update the value, calculate the Gain value, and finally get the enhanced audio through iSTFT transformation;

The yellow module corresponds to the AI music detection module, in which the input speech is processed by STFT and input to the RNN-based music detection module, and then the detection result is input into the Noise Factor module to calculate the factor that guides the noise update, so as to achieve a noise that can effectively retain the music signal. estimated value, thereby effectively protecting the music signal.

Solution 1 can effectively improve the fidelity of music signals without increasing too much computational complexity. Among them, the data set of the training network is selected as the data set of the target signal, and the noise signal of multiple scenes is used as the background noise set.

(Solution two)

Solution two

The core idea is to directly use the full AI noise reduction algorithm to complete the design. The general block diagram is as follows: After the noisy data is transformed by STFT, it goes through the deep learning network, and then selects Mask or voice as the target to complete the model training iteration.

The data used also requires speech and music signals as the target signal, and the noise of various scenes as the background noise set.

(original audio)


(traditional noise reduction)


(AI noise reduction)

The results show that the music spectrum of the AI noise reduction algorithm is stronger in music fidelity than the traditional method.

In the future, from the perspective of the integration of traditional noise reduction algorithms and AI noise reduction, Rongyun will further explore the introduction of deep learning noise estimation modules and other methods to further leverage the advantages of traditional algorithms and AI algorithms.

In the full AI scenario, the core network part continues to carry out further research on RNN, GAN, Transformer, etc., as well as the impact of different goals and Loss.

At the same time, deep learning technology is still developing, and we will continue to explore AI noise reduction technology based on new models.


融云RongCloud
82 声望1.2k 粉丝

因为专注,所以专业