A Preliminary Exploration of the Sound Network AI Noise Reduction Evaluation System

Author:

foreword

The optimization of audio quality is a complex system engineering, and noise reduction is an important part of this system engineering. After decades of development, the traditional noise reduction technology has fallen into a bottleneck period, especially in the suppression of non-stationary noise. It is increasingly unable to meet the needs of new scenarios. In recent years, the rise of AI technology represented by machine learning/deep learning has brought new solutions for audio noise reduction in special scenarios. With the development of online audio and video live broadcast services, Agora has gradually formed its own accumulation. This article is a series of audio evaluation articles under special scenarios produced by the Agora audio technology team of Agora - AI Noise Reduction. Since there are still different opinions on the evaluation standards of audio in the industry, Agora's practice is more focused on the engineering implementation from participation to non-participation.

Background introduction

As developers, we hope to provide users with a real-time interactive experience with high definition, fluency, and high-fidelity sound quality, but due to the constant existence of noise, people are disturbed during calls. Different occasions have different noises. Noise can be stationary or non-stationary or transient. Stationary noise does not change with time, such as white noise; non-stationary noise changes with time. For example, human speech, road noise, etc., transient noise can be classified as non-stationary noise, which is short-duration, intermittent noise, such as keyboard knocking, desk knocking, door closing and so on. In the actual interactive scenario, when two parties use mobile devices to talk, one party is in a noisy environment such as a restaurant, a noisy street, a subway or an airport, and the other party will receive a large number of voice signals containing noise. When the noise is too large, Both parties on the call cannot hear the content of the other party's speech clearly, which is prone to irritability and negative emotions, which affects the end-user experience. Therefore, in order to reduce the interference of noise on the voice signal and improve the pleasantness of the user's call, we often do Noise suppression (NS, noise suppression) processing, the purpose is to filter out the noise signal from the noisy voice signal, and to the greatest extent possible. Preserve the voice signal so that the voices heard by both parties are not disturbed by noise. An ideal NS technology is to remove noise while retaining the clarity, intelligibility and comfort of the voice.

The research on noise reduction first started in the 1960s. After decades of development, great progress has been made. We roughly divide noise reduction algorithms into the following categories.

(1) The subspace method, the basic idea is to map the noisy speech signal to the signal subspace and the noise subspace, and the pure speech signal can be estimated by eliminating the noise subspace components and retaining the signal subspace components;

(2) Short-term spectral subtraction, this method assumes that the noise signal is stable and changes slowly, and uses the spectrum of the noisy signal to subtract the estimated spectrum of the noise signal to obtain the noise-reduced speech signal; (3) Wiener Filter, the basic principle of the algorithm is to estimate the speech signal with the Wiener filter according to the minimum mean square error criterion, and then extract the speech signal from the noisy signal;

(4) The method based on the auditory masking effect, which simulates the perceptual characteristics of the human ear, determines the lowest threshold of noise energy that a human ear can perceive at a certain frequency at a certain time, and controls the noise energy below the threshold. , so as to achieve the purpose of masking residual noise to the greatest extent and preventing speech distortion;

(5) The method based on noise estimation. This method is generally based on the difference between noise and speech characteristics, and distinguishes noise components and speech components through VAD (Voice Activity Detection, speech endpoint detection) or speech existence probability. When the characteristics are similar, this algorithm often cannot accurately distinguish the components of speech and noise in noisy speech;

(6) AI noise reduction, AI noise reduction technology can solve the problems existing in traditional noise reduction technology to a certain extent, such as some transient noises (noise with short duration and high energy, such as door closing sound, knocking sound, etc.) The advantages of AI noise reduction are more obvious in the processing of some non-stationary noises (fast changing with time, random fluctuations unpredictable, such as noisy streets).

Whether it is traditional NS technology or AI NS technology, we need to consider the impact of package size and computing power when the product is launched, so that it can be applied to mobile terminals and IoT devices, that is, we must ensure that the model is lightweight. It is also one of the most challenging places for the actual product to be launched. Among them, the magnitude of the model can be guaranteed after the launch, so can the performance of NS meet the standard? Here we focus on how to evaluate the performance of NS. For the parameter adjustment of NS, the reconstruction of NS, the proposal of new NS algorithms, and the comparison of different NS performance, how do we evaluate the performance of NS technology from the perspective of user experience? ?

First of all, we classify the methods of evaluating NS into subjective testing methods and objective testing methods, in which objective testing is further divided into Intrusive and Non-intrusive, or called with and without parameters. The following Explain its meaning and advantages and disadvantages.

Way	meaning	Advantages and disadvantages
subjective test	The subjective evaluation method takes people as the main body to make a subjective grade opinion or some kind of comparison result on the quality of speech on the basis of a certain preset principle, which reflects the subjective impression of the listener on the quality of the speech. Generally, Absolute Category Rating (ACR) is used, which is mainly used for subjective evaluation of sound quality through Mean Opinion Score (MOS). In this case, there is no reference voice, and the listener only listens to the distorted voice, and then makes an evaluation of 1 to 5 points for the voice.	Advantages: directly reflect the user experience; Disadvantages: high labor cost, long test cycle, poor repeatability, affected by individual subjective differences.
objective test	Intrusive: Predicts Subjective Mean Score (MOS) scores by relying on some form of distance feature between the reference speech and the test speech. For example, most of the literature and Paper used to evaluate their own NS algorithms using PESQ, signal-to-noise ratio, segmented signal-to-noise ratio, board warehouse distance, and so on.	Advantages: automated testing in batches, saving labor costs and time costs; Disadvantages: (1) It cannot be completely equal to the user's subjective experience; (2) Most objective indicators only support 16k sampling rate; (3) The reference signal and the test signal are required to be equal to each other. The data must be aligned by frame, and real-time RTC audio is inevitably affected by the network, so the data cannot be aligned by frame, which directly affects the accuracy of objective indicators.
Non-intrusive: Predict the quality of the speech based only on the test speech itself.	Advantages: The voice quality can be directly predicted without the original reference signal, and the RTC audio quality can be evaluated in real time. Disadvantages: High technical requirements, and model building is difficult

We believe that the subjective test can directly reflect the user experience, and the consistency of the subjective test results and the objective test results can prove the correctness of the objective test. At this time, the objective test can also reflect the user experience. Let's take a look at how SoundNet evaluates the performance of NS.

Sound Network NS Evaluation

We are building a comprehensive, reliable and long-term reliable NS evaluation system, we believe it can cope with any future noisy scene (currently covering more than 70 types of noise) and any NS technology, and we do not specify a specific The test corpus, sampling rate and effective frequency spectrum of any person can be used as the test object. Taking this purpose as a starting point, we have verified the existing NS evaluation techniques and found that they cannot cover all our call scenarios, nor can they completely cover the types of noise we tested, and they cannot represent subjective feelings. Therefore, we fit a new full-reference NS indicator, and at the same time use a deep learning algorithm to make a reference-free model, and the two schemes are carried out simultaneously. The following briefly describes the existing NS evaluation indicators, our verification method, and how we do full-reference and no-reference NS evaluation models.

1. Existing NS evaluation index developed a An objective indicator library for evaluating NS, including common PESQ, speech segment signal-to-noise ratio SegSNR, short-term intelligibility STOI, etc., as well as some form of distance feature between reference speech and test speech, such as cepstrum Cepstrum Distance (CD) can reflect the influence of nonlinear distortion on sound quality, Log Spectral Distance (LSD) is used to describe the distance metric between two spectrums, NCM (Normalized Covariance Measure) evaluation method is to calculate Covariance between the clean speech signal and the envelope signal of noisy speech in the frequency domain. The comprehensive measures Csig, Cbak, and Covl represent predicted rating [1-5] of speech distortion, predicted rating [1-5] of noise distortion, and predicted rating [1-5] of overall quality, respectively, which are obtained by combining multiple objective measurement values. A composite measure is formed. The reason for using a composite measure is that different objective measures capture different characteristics of the distorted signal, so combining these measures in a linear or non-linear fashion may significantly improve the correlation.

Each indicator corresponds to the change of some features of the audio before and after NS, and each indicator measures the performance of NS from different perspectives. We can't help but have a question? Can these indicators be equated with subjective feelings? Besides being algorithmically sound, how do we ensure it is consistent with the subjective? Is it true that these objective indicators are no problem, and there will be no problems if they are subjectively observed? How do we ensure the coverage of these indicators?