Author:
foreword
The optimization of audio quality is a complex system engineering, and noise reduction is an important part of this system engineering. After decades of development, the traditional noise reduction technology has fallen into a bottleneck period, especially in the suppression of non-stationary noise. It is increasingly unable to meet the needs of new scenarios. In recent years, the rise of AI technology represented by machine learning/deep learning has brought new solutions for audio noise reduction in special scenarios. With the development of online audio and video live broadcast services, Agora has gradually formed its own accumulation. This article is a series of audio evaluation articles under special scenarios produced by the Agora audio technology team of Agora - AI Noise Reduction. Since there are still different opinions on the evaluation standards of audio in the industry, Agora's practice is more focused on the engineering implementation from participation to non-participation.
Background introduction
As developers, we hope to provide users with a real-time interactive experience with high definition, fluency, and high-fidelity sound quality, but due to the constant existence of noise, people are disturbed during calls. Different occasions have different noises. Noise can be stationary or non-stationary or transient. Stationary noise does not change with time, such as white noise; non-stationary noise changes with time. For example, human speech, road noise, etc., transient noise can be classified as non-stationary noise, which is short-duration, intermittent noise, such as keyboard knocking, desk knocking, door closing and so on. In the actual interactive scenario, when two parties use mobile devices to talk, one party is in a noisy environment such as a restaurant, a noisy street, a subway or an airport, and the other party will receive a large number of voice signals containing noise. When the noise is too large, Both parties on the call cannot hear the content of the other party's speech clearly, which is prone to irritability and negative emotions, which affects the end-user experience. Therefore, in order to reduce the interference of noise on the voice signal and improve the pleasantness of the user's call, we often do Noise suppression (NS, noise suppression) processing, the purpose is to filter out the noise signal from the noisy voice signal, and to the greatest extent possible. Preserve the voice signal so that the voices heard by both parties are not disturbed by noise. An ideal NS technology is to remove noise while retaining the clarity, intelligibility and comfort of the voice.
The research on noise reduction first started in the 1960s. After decades of development, great progress has been made. We roughly divide noise reduction algorithms into the following categories.
(1) The subspace method, the basic idea is to map the noisy speech signal to the signal subspace and the noise subspace, and the pure speech signal can be estimated by eliminating the noise subspace components and retaining the signal subspace components;
(2) Short-term spectral subtraction, this method assumes that the noise signal is stable and changes slowly, and uses the spectrum of the noisy signal to subtract the estimated spectrum of the noise signal to obtain the noise-reduced speech signal; (3) Wiener Filter, the basic principle of the algorithm is to estimate the speech signal with the Wiener filter according to the minimum mean square error criterion, and then extract the speech signal from the noisy signal;
(4) The method based on the auditory masking effect, which simulates the perceptual characteristics of the human ear, determines the lowest threshold of noise energy that a human ear can perceive at a certain frequency at a certain time, and controls the noise energy below the threshold. , so as to achieve the purpose of masking residual noise to the greatest extent and preventing speech distortion;
(5) The method based on noise estimation. This method is generally based on the difference between noise and speech characteristics, and distinguishes noise components and speech components through VAD (Voice Activity Detection, speech endpoint detection) or speech existence probability. When the characteristics are similar, this algorithm often cannot accurately distinguish the components of speech and noise in noisy speech;
(6) AI noise reduction, AI noise reduction technology can solve the problems existing in traditional noise reduction technology to a certain extent, such as some transient noises (noise with short duration and high energy, such as door closing sound, knocking sound, etc.) The advantages of AI noise reduction are more obvious in the processing of some non-stationary noises (fast changing with time, random fluctuations unpredictable, such as noisy streets).
Whether it is traditional NS technology or AI NS technology, we need to consider the impact of package size and computing power when the product is launched, so that it can be applied to mobile terminals and IoT devices, that is, we must ensure that the model is lightweight. It is also one of the most challenging places for the actual product to be launched. Among them, the magnitude of the model can be guaranteed after the launch, so can the performance of NS meet the standard? Here we focus on how to evaluate the performance of NS. For the parameter adjustment of NS, the reconstruction of NS, the proposal of new NS algorithms, and the comparison of different NS performance, how do we evaluate the performance of NS technology from the perspective of user experience? ?
First of all, we classify the methods of evaluating NS into subjective testing methods and objective testing methods, in which objective testing is further divided into Intrusive and Non-intrusive, or called with and without parameters. The following Explain its meaning and advantages and disadvantages.
Way | meaning | Advantages and disadvantages |
---|---|---|
subjective test | The subjective evaluation method takes people as the main body to make a subjective grade opinion or some kind of comparison result on the quality of speech on the basis of a certain preset principle, which reflects the subjective impression of the listener on the quality of the speech. Generally, Absolute Category Rating (ACR) is used, which is mainly used for subjective evaluation of sound quality through Mean Opinion Score (MOS). In this case, there is no reference voice, and the listener only listens to the distorted voice, and then makes an evaluation of 1 to 5 points for the voice. | Advantages: directly reflect the user experience; Disadvantages: high labor cost, long test cycle, poor repeatability, affected by individual subjective differences. |
objective test | Intrusive: Predicts Subjective Mean Score (MOS) scores by relying on some form of distance feature between the reference speech and the test speech. For example, most of the literature and Paper used to evaluate their own NS algorithms using PESQ, signal-to-noise ratio, segmented signal-to-noise ratio, board warehouse distance, and so on. | Advantages: automated testing in batches, saving labor costs and time costs; Disadvantages: (1) It cannot be completely equal to the user's subjective experience; (2) Most objective indicators only support 16k sampling rate; (3) The reference signal and the test signal are required to be equal to each other. The data must be aligned by frame, and real-time RTC audio is inevitably affected by the network, so the data cannot be aligned by frame, which directly affects the accuracy of objective indicators. |
Non-intrusive: Predict the quality of the speech based only on the test speech itself. | Advantages: The voice quality can be directly predicted without the original reference signal, and the RTC audio quality can be evaluated in real time. Disadvantages: High technical requirements, and model building is difficult |
We believe that the subjective test can directly reflect the user experience, and the consistency of the subjective test results and the objective test results can prove the correctness of the objective test. At this time, the objective test can also reflect the user experience. Let's take a look at how SoundNet evaluates the performance of NS.
Sound Network NS Evaluation
We are building a comprehensive, reliable and long-term reliable NS evaluation system, we believe it can cope with any future noisy scene (currently covering more than 70 types of noise) and any NS technology, and we do not specify a specific The test corpus, sampling rate and effective frequency spectrum of any person can be used as the test object. Taking this purpose as a starting point, we have verified the existing NS evaluation techniques and found that they cannot cover all our call scenarios, nor can they completely cover the types of noise we tested, and they cannot represent subjective feelings. Therefore, we fit a new full-reference NS indicator, and at the same time use a deep learning algorithm to make a reference-free model, and the two schemes are carried out simultaneously. The following briefly describes the existing NS evaluation indicators, our verification method, and how we do full-reference and no-reference NS evaluation models.
Each indicator corresponds to the change of some features of the audio before and after NS, and each indicator measures the performance of NS from different perspectives. We can't help but have a question? Can these indicators be equated with subjective feelings? Besides being algorithmically sound, how do we ensure it is consistent with the subjective? Is it true that these objective indicators are no problem, and there will be no problems if they are subjectively observed? How do we ensure the coverage of these indicators?
2. Our verification method : In order to verify the accuracy of the objective indicator library we established and its correlation with subjective experience, we did a crowdsourcing-based subjective audio test, and developed a dedicated crowdsourcing subjective audio test. For the marked APP, we followed P808, P835 and the NS Challenge for the entire process, and made requirements for test data, duration, environment, equipment, testers, etc. We mainly focus on three dimensions, vocal clarity SMOS, noise comfort NMOS, and overall quality GMOS, with a range of 1 to 5 points. The corresponding description of the MOS score and the APP page design are given below.
So what is the correlation between the results of subjective labeling and the indicators in the objective indicator library we mentioned earlier? We have counted all the objective indicators in the objective indicator library, and here we only give the PLCC (Pearson linear correlation coefficient) of PESQ and subjective annotation:
PLCC | PESQ |
---|---|
Subjective SMOS | 0.68 |
Subjective NMOS | 0.81 |
Subjective GMOS | 0.79 |
The subjective SMOS, NMOS, and GMOS here are calculated from the average of 200 pieces of data/32 people's annotations for each piece of data.
3. How to do a full-reference and no-reference NS evaluation model : With the accumulation of subjectively labeled data, we found that the accuracy of the existing indicators is not enough to cover all our scenarios, noise types, and even less representative of subjective feelings. Therefore, we fit a new comprehensive measure MOS score to evaluate the performance of NS.
Our solution one is a full reference model, that is, the indicators in the objective indicator library are used as feature input, and the results of crowdsourcing annotation are used as labels to train three models. The outputs of the three models are the scores for measuring speech, noise and overall .
The following is a data set consisting of 800 pieces of data. 70% of the data is randomly selected as the training set and 30% of the data is used as the test set; the model selects GBDT (Gradient Boosting Decision Tree) for GMOS training and testing. The upper half of the figure below The part is the real GMOS of the training set and the predicted GMOS of the model prediction training set after training the model, the lower part is the real GMOS of the test set and the predicted GMOS of the model prediction test set after training the model, of which the real GMOS and predicted GMOS of the test set The PLCC between GMOS can reach 0.945, the SROCC (Spearman rank-order correlation coefficient) can reach 0.936, and the RMSE (Root Mean Square Error) is 0.26.
Our solution 2 is a no-reference model. Because the objective indicators of full reference require frame alignment between the reference signal and the test signal, and real-time RTC audio is inevitably affected by the network, the data is not aligned by frame, which directly affects the objective indicators. accuracy. In order to avoid the influence of this factor, we are also working on a parameter-free SQA (Speech Quality Assessment) model. The core of the current technology is to convert the audio into a Mel spectrogram, then cut the Mel spectrogram, and use CNN to extract each The quality features of the segment after cutting, and then use self-attention to model the feature sequence in time to realize the interaction of the feature sequence in time, and finally calculate the contribution of each segment to the entire MOS score through the attention model, so as to map to the final MOS.
Here we give the current training accuracy of the no-parameter SQA model. The training set consists of 1200 pieces of noisy data, 70% of which are training sets and 30% are test sets. The abscissa represents epoch, and the blue line represents the training loss with epoch. Change, the red line represents the training set as the epoch increases, and the PLCC of the training set label, the green line represents the test set as the epoch increases, and the PLCC of the test set label, we can see the current offline effect. It is ideal, and we will add more scene data for model training in the future.
future
In the future, we will directly conduct end-to-end audio quality assessment (Audio Quality Assessment, AQA), and noise is only a factor in audio that affects subjective experience. We will build a complete real-time audio evaluation system online. This evaluation system will be reliable and high-precision for a long time. It will be used to evaluate the degree of disgust or pleasure generated by users in real-time audio interaction. The whole process includes the establishment of the scheme, the construction of the data set, the crowdsourced labeling (the establishment of labeling standards, the cleaning and screening of the labelled data, and the verification of data distribution), model training and optimization, and online feedback, etc. Although we are facing some challenges now, as long as the smart goal is set, then this goal will be achieved.
Dev for Dev column introduction
Dev for Dev (Developer for Developer) is a developer interactive innovation practice activity jointly initiated by Agora and the RTC developer community. Through various forms of technology sharing, communication and collision, and project co-construction from the perspective of engineers, the power of developers is gathered, the most valuable technical content and projects are mined and delivered, and the creativity of technology is fully released.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。