In daily audio and video meetings, we will encounter these scenes more or less: "Hello, can you hear me? I hear your voice intermittently", "Hey, how can I hear the echo?", "Too It's noisy, I can't hear what you are saying" Wait. These voice quality problems affect the experience of audio and video meetings. If it is an important meeting, it is enough to make people "annoyed into anger". So how to effectively reduce the occurrence of these problems? This series of articles will share the test experience of Alibaba Cloud Video Cloud in ensuring RTC voice quality.
Author | Ke Huai
Review|Taiyi
Background introduction
Audio quality refers to auditory quality and audio 3A algorithm quality under normal network. Auditory quality is the subjective feeling of the human ear on the pros and cons of voice in the case of lossless network. However, in real life, different people may have different judgments on the quality of the same sound, and it will also be affected by the listening environment and listening psychology. In the test, we can start from the three elements of sound: loudness, pitch, timbre latitude, and quantitatively evaluate some indicators. In addition, industry standards will pass these quantitative indicators through a certain weighting process in order to expect to fit subjective feelings, such as POLQA, PESQ, and so on.
The audio 3A algorithm refers to:
AGC: Automatic gain control
ANS: Adaptive noise suppression
AEC: Acoustic echo cancellation
There are many articles in the public account of this part that introduce the principle and implementation in detail, so I won't repeat them here.
Previous articles
Hard Goods Column|WebRTC AEC (Acoustic Echo Cancellation)
This series of articles will audio quality, adaptation test, Qos quality, and automation solution 16156b1fd9bfd0. This article first introduces the audio quality part (auditory quality and audio 3A under normal network) Algorithm quality).
RTC voice test link disassembly
Before the formal test, we first understand the entire link frame diagram of RTC voice transmission. The sound is collected by the microphone, and then the upstream audio algorithm is pre-processed, and the codec is transmitted and played out through the speaker. If you want to test the upstream audio algorithm, you can input the sound at (1), and then pull the output audio at (2) for analysis. When testing the system, we often evaluate it from an end-to-end perspective, that is, input the sound from (1) and then pull the sound for analysis in (4). The subsequent test methods in this article are all based on end-to-end.
Audio quality test program
Alibaba Cloud Video Cloud uses a combination of objective indicators + subjective evaluation commonly used in the industry to ensure audio quality. For specific indicators, please refer to the following figure:
Objective test method
Effective bandwidth
Line in input sweep file + 48K sampling rate human voice audio (audio material reference is as follows), Line out record output audio, read the effective bandwidth through frequency analysis;
End-to-end delay
Method 1: Use the VQT test, and output the delay time in the test result.
Method 2: Self-study. Line in test material, Line out record without transmission and output audio, calculate audio delay time.
- Test material: a continuous single tone.
- Index calculation: The starting time of reading the audio that has not been transmitted in the recording file is recorded as t1, and the starting time of reading the audio that has been transmitted through the conference is recorded as t2, then Delay=t2-t1.
ANS
Investigate the performance of the ANS algorithm in pure noise and speech noise mixed scenarios. The analysis indicators include: noise reduction consistency, signal-to-noise ratio improvement, convergence time, and voice quality after noise reduction.
Test topology
Input the background material and voice material through the volume Line in or external speaker, and record the output audio at the streaming terminal Line out for indicator analysis.
Test material
Index calculation
- Signal-to-noise ratio improvement: To obtain the signal-to-noise ratio of the audio after denoising is A, then the signal-to-noise ratio improvement value=A-input signal-to-noise ratio.
- Noise reduction consistency: Calculate the residual value of the noise after various noise inputs, and count whether the noise residual under various noises is consistent.
- Convergence time: record the time when the noise energy begins to fall as t1, record the initial time t2 when the noise has converged to a plateau, and the convergence time=t2-t1.
- Sound quality: Modify the VQT POLQA test script to calculate the output audio MOS score under different signal-to-noise ratio inputs. The following table shows that the input signal-to-noise ratio is 10dB with noisy human voice, and the output audio quality is MOS:
AGC
Investigate the performance of the AGC algorithm under different volume levels. The analysis indicators include: sound stability and output loudness.
Test topology
Refer to the ANS test topology diagram, input the voice material through the volume Line in or put it out, and record the output audio at the streaming end Line out for indicator analysis.
Test material
Index calculation
- Sound stability: Calculate the average RMS of each volume segment of the output audio, and then solve the variance of the average RMS of the output audio. The following is the calculation formula of the average RMS:
- Output loudness: Line out method calculates the average RMS of the output audio; the external amplifier method uses a standard sound pressure meter, and the loudness value is recorded in the A weighting method.
- Sound quality: Transform the VQT POLQA test script to calculate the output audio MOS score under different volume inputs. The following table shows the output audio sound quality MOS score under high, medium and low volume input:
AEC
Investigate whether there are echo leakage and vocal suppression problems in the single-talk and dual-talk scenarios of the AEC algorithm.
Test topology
【Single Talk】
The streaming end plays single-talk voice materials, and the streaming end is configured in an open meeting room by default. Line out Record the output of the streaming end, and judge whether there is a leakage echo at the streaming end.
【Double Lecture】
At the same time, the dual-talk test material is played to the streaming end and the streaming end, and Line out records the output of the streaming end to determine whether there is echo leakage and vocal suppression at the streaming end.
At the same time, the dual-talk test material is played to the streaming end and the streaming end, and Line out records the output of the streaming end to determine whether there is echo leakage and vocal suppression at the streaming end.
Test material
Index calculation
- Echo leakage: Read the residual amount of human voice in the recorded audio file. Theoretically, the value is 0-there is no echo leakage.
- Vocal suppression: evaluate this indicator in a dual-talk scenario. The 3gpp TS 26.132 standard was used to evaluate the shear condition. The final evaluation was based on the D type (continuous shear greater than 150ms). The closer the value is to 0, the better the quality.
- Convergence time: The start time of the test is recorded as t1, the time when the AEC convergence is completed and the appearance of no leakage echo is recorded as t2, and the convergence time = t2-t1.
- Human voice quality: evaluate this indicator in a dual-talk scenario. Modify the VQT POLQA test script to calculate the sound quality score of the human voice in the dual-talk scene.
STOI
Short-term objective intelligibility, current academically accurate, reliable objective evaluation method to calculate speech intelligibility, objective test results can reflect the intelligibility and naturalness of speech to a certain extent. There are limitations: need to downsample to 16K for calculation.
- Test topology: Refer to ANS test topology.
- Test material: ITU-P863 provides standard human voice material.
- Index calculation: The following frame diagram shows the STOI calculation process. Currently, there are already matlab and python engineering implementations of this algorithm in the industry.
POLQA
ITU-T P.863 provides test methods to get MOS points and audio delay. Support 8K, 16K, 48K test, the limitation is that the equipment is expensive.
- Test topology: Refer to ANS test topology.
- Test material: ITU-P863 provides standard human voice material & VQT built-in voice test material.
- Indicator calculation: POLQA MOS points.
PESQ
ITU-T P.862 provides a test method that can get MOS points. The limitation is that it can only support 8K and 16K.
- Test topology: Refer to ANS test topology.
- Test method: Test material: ITU-P863 provides standard human voice material.
- Index calculation: PESQ MOS points
Subjective testing methods
Adopt the scoring rules and dimensions mentioned in "YD/T 2309 Audio Quality Subjective Test Method (ITU-R BS.1284)" to conduct scoring tests for experts and ordinary users in different scenarios.
Scoring method
Evaluation dimension
testing scenarios
The test materials used "Hvi Audition Disc" and "TUT-acoustic-scenes-2017-development".
This article is the first RTC audio test series. We will introduce how Alibaba Cloud Video Cloud guarantees RTC voice quality from the dimensions of adaptation testing, Qos quality, and automation solutions. Welcome to the public account "Video Cloud Technology".
"Video Cloud Technology" Your most noteworthy audio and video technology public account, pushes practical technical articles from the front line of Alibaba Cloud every week, and exchanges and exchanges with first-class engineers in the audio and video field. The official account backstage reply [Technology] You can join the Alibaba Cloud Video Cloud Product Technology Exchange Group, discuss audio and video technologies with industry leaders, and get more industry latest information.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。