Today I participated in "The Past, Present and Future of Real-time Voice Quality Monitoring System" by Agora, and shared some of my own understanding based on some experience in audio processing at work before.

Audio (generally refers to all sounds in nature that humans can hear, and the spectrum range of human ears is generally 20~20000HZ) and voice (voice refers to the sound of human speech, and most of the spectrum energy range of human speech is distributed (300~3400HZ) The two are different. It can be seen that people can hear a wider range of sounds than people speak; this is that people can hear sounds like musical instruments, nature, and screams, but people cannot Issued to.

There are several reasons for the quality evaluation. For example, in addition to face-to-face communication, the audio in the activities such as calls, video viewing, listening to music, etc. is codec compressed to facilitate transmission and at a lower cost. Storage; like the removal of noise in the original sound, the enhancement of the original speech sound, etc.; it can be seen that whether it is codec processing or other speech processing, the purpose is to make people sound more comfortable, so the quality evaluation method is to evaluate The perception of the person's sound after the sound is processed.

Audio evaluation methods are divided into subjective evaluation and objective evaluation.

Subjective evaluation is actually people score speech based on their auditory perception. Commonly used are MOS, CMOS and ABX Test; like AB TEST is often used in my early work, such as making small optimizations to speech enhancement algorithms to get actual auditory perception. If you feel the improvement of the situation, the original algorithm and the speech processed by the optimized algorithm will be grouped, and the friends will help to test the score to judge whether it is better or worse. The International Telecommunication Union (ITU) standardizes the subjective evaluation method of voice quality, codenamed ITU-T P.800.1. Among them, the Absolute Category Rating (ACR) of listening quality is a subjective evaluation method widely used. Participants in the evaluation score the overall quality of the voice, with a score ranging from 1-5 points. The higher the score, the best voice quality. This MOS score was later applied to objective quality evaluation. Generally, the MOS should be 4 or higher, which will be considered as a better voice quality. Once the MOS is lower than 3.6, the voice quality is basically unacceptable.

The objective evaluation is mainly to use algorithms to replace the work of people scoring, and to evaluate the quality of the sound through algorithms. In the objective evaluation, it is divided into reference evaluation and non-reference evaluation.

  • The intrusive method, as the name implies, requires the comparison of sound source materials. Therefore, this method can only be used for offline processing, which is impossible for real-time call processing; common ones like ITU-T P.861 ( MNB), ITU-T P.862(PESQ)[2], ITU-T P.863(POLQA)[3], STOI[4], BSSEval[5],
  • No reference evaluation (non intrusive method) does not require sound source material, common ones are ITU-T P.563[6], ANIQUE+[7], ITU-T G.107(E-Model)[8], based on AI Deep learning AutoMOS[9], QualityNet[10], NISQA[11], MOSNet[12], etc.

The following table shows the MOS value test scores of mainstream voice codecs (from the Opus official website, and later MOS9 was released, which is the highest score of 9 points.

img

Here we will focus on PESQ and POLQA

PESQ belongs to the objective evaluation program with reference. It takes two audio signals as input, one of which is provided by the itu organization, and the other input is the output signal processed by the tested voip system. The Pesq algorithm extracts the difference in time-frequency domain or transform domain characteristic parameters from the two input signals, and then maps the characteristic parameter differences through a neural network model to obtain an objective sound quality score. The PESQ score is actually a mapping to the MOS value.

img

The POLQA algorithm is a new generation of voice quality assessment standards, which is suitable for voice quality assessment in fixed networks, mobile communication networks and IP networks. POLQA is determined by ITU-T (International Telecommunication Union) as the recommended specification P.863, which can be used for high-definition voice, 3G, 4G/VoLTE, and 5G network voice quality assessment. It is used to replace and upgrade PESQ (ITU-T Recommendation P.862) released in 2001

img

The difference from traditional pesq is that the POLQA algorithm has the following advantages:

  • Increase the ability to evaluate broadband (Wideband) and super-wide (SuperWideband) voice quality, and support broadband (48khz).
  • It supports the latest voice coding and VoIP transmission technology, and has been specially optimized for the existing opus and silk encoders.
  • Support multi-language environment, all languages are supported. The ITU organization provides standard test corpus for targeted testing.

Of course, audio quality evaluation is not just to evaluate the codec, there are also other factors that will affect it, such as VAD transmission, packet loss compensation, network quality changes (delay/jitter/packet loss), and even equipment acquisition.

Regardless of whether there is a reference or no reference, it has its application limitations, including problems such as narrow usage scenarios, poor robustness, and high complexity. To overcome the above problems, a set of covering multiple scenarios is required to run with performance. There is almost no perceptual quality evaluation algorithm and system, so the sound network has developed a set of unique audio quality evaluation methods. Including uplink quality assessment and downlink quality assessment.

The uplink sound undergoes a collection-AEC (Echo Cancellation)-NS (Noise Suppression)-AGC (Gain) processing process, so the quality assessment includes the processing effects of equipment acquisition stability/echo cancellation capability/noise suppression capability/volume gain capability .

The downlink is mainly played by the device, after encoding and decoding-network transmission-weak network confrontation (I understand VAD/PLC/error correction and other processing)-device playback, and finally multiple weak networks, multiple devices, and multiple modes Under the test, the error between its algorithm and POLQA is less than 0.15, which can be said to have achieved good results.

Regarding audio quality evaluation, I personally think that the follow-up will be developed in a more detailed field, including different elements, such as voice evaluation and music evaluation should be different; including different scenarios, such as real-time online processing and offline evaluation, real-time Processing requires high real-time performance and low performance consumption; offline evaluation does not require such high requirements, and requires higher accuracy, so you can make more use of the advantages of AI artificial intelligence and optimization on the algorithm system.


RTE开发者社区
647 声望966 粉丝

RTE 开发者社区是聚焦实时互动领域的中立开发者社区。不止于纯粹的技术交流,我们相信开发者具备更加丰盈的个体价值。行业发展变革、开发者职涯发展、技术创业创新资源,我们将陪跑开发者,共享、共建、共成长。