Introduction: With the popularization of 5G networks and the impact of the epidemic, there will be more and more application scenarios for real-time audio and video technology, including conferences, connected microphones, audio and video calls, online education, telemedicine, etc. Real-time interactive scenes place higher and higher requirements on the quality of RTC audio. How to test the effect of RTC audio and ensure good audio transmission quality by constructing an objective, standard and repeatable evaluation system has become a more urgent and important topic at present.
Text|Ma Jianjian, Senior Audio and Video Test Engineer of NetEase Yunxin
Ideal Communication Model
Face-to-face communication in daily communication generally has better results. If the interference and influence of the environment are reduced in a quiet laboratory, the ideal communication effect will be obtained. Let's abstract this model again, and we can generally see that it has the following characteristics:
quiet environment: NR15 noise floor, which is equivalent to very quiet night, the human ear can not be disturbed by other influences, concentrate on listening to the target voice.
reverberation environment suitable for listening: reverberation usually affects the understanding of the listener. The greater the reverberation, the longer the tail of the speech and the lower the intelligibility. For example, in a concert hall with large reverberation, there will be a certain beautification effect for musical instruments and singing, but it is not conducive to human communication.
Clear and natural speech: speaker is in excellent psychological and physical state, with clear pronunciation, balanced frequency, fluent speech and moderate speech rate.
moderate volume: research shows that volume has a significant impact on sound quality. Under other conditions being the same, the higher the volume, the better the subjective sense of hearing. The speaker speaks loudly, which can improve the intelligibility of the listener to a certain extent.
responds in a timely manner and communicates smoothly: In the real-time communication of RTC, delay is also a very important indicator. Generally speaking, people's delay within 200ms has no obvious obstacles and hysteresis, and 200ms-400ms can In normal communication, if it exceeds 400ms, there will be a sense of lag, and if it is more serious, there will be a phenomenon of grabbing calls, which directly affects the call experience. In the face-to-face communication scenario, the delay is only about 3ms.
RTC Volume Link
The above picture shows two people communicating in real time through RTC. It can be seen from the picture that speaker A starts to speak, and the sound goes through air transmission, microphone acquisition, A/D conversion, enhancement processing (noise reduction, echo cancellation, volume control , de-reverberation), encoding, packet transmission, receiver decoding, NetEQ, D/A conversion to downstream playback, and then B hears the sound. This is the complete sound transmission path in simplex state.
Compared with the ideal communication model, there are many types of interference and effects in the actual RTC link, such as environmental effects, hardware effects, link effects and network effects, each link may introduce audio quality degradation. Taken together, these effects can lead to sound problems in the following aspects.
- volume problem : silent, the volume is low, the clipping and harshness caused by the loud sound, etc., suddenly big and small.
- echo problems: echo leakage, echo residual, speech damage such as suppression, clipping, intermittent.
- noise class problem: noise residue is not stable.
- system introduction problems: noise, current sound, popo sound.
- narrow sound quality problems: blurred voice, voice distortion, dull voice, sharp voice, mechanical sound.
- network problem: , intermittent, fast playback, slow playback, mechanical sound.
Subjective test method
The earliest subjective test is based on two people talking. A and B establish an RTC link. By speaking separately or at the same time, the user's usage scenario of the real scene is restored, and the following three dimensions are mainly concerned.
Listening Quality : The sound quality of the listener is a simplex usage scenario. For example, when A is speaking, the quality of the sound heard by B is Listening Quality. The most basic part, the existing objective evaluation methods and means in the industry are basically based on Listening Quality.
Talking Quality : The sound quality of a speaker is the quality of the sound that the speaker hears, which is related to echo, sidetone masking, and local environment.
Conversation Quality : Conversation quality, in addition to the Listening Quality and Talking Quality of A/B, is also related to duplex calls. The main influencing factors are echo double talk and end-to-end delay.
Subjective test dimensions of **
The points to be concerned about in the subjective test are shown in the figure above, which are divided into several major aspects such as sound quality, timbre, volume, delay, echo, and noise reduction.
timbre
Timbre, also known as fret, is the characteristic of the sound that is felt by hearing. The timbre of is mainly determined by the frequency spectrum of the sound. In the RTC link, the frequency response that affects the sound is mainly the frequency characteristics of the microphone, intermediate processing such as EQ, high and low pass filtering, and volume control algorithms (DRC/AGC), speakers/headphones get frequency response, etc. There are also differences in the frequency distribution of vocalizations of different people. generally has more low-frequency voices for men, thick or dull voices, and women or children have more high-frequency components, and the voices are bright and even a little sharp.
sound quality: sound quality is divided into 3 dimensions, clarity, fluency and naturalness.
- definition is also called intelligibility in the audio field. indicates the degree of understanding of semantic content, which affects intelligibility in many ways, such as: noise mixed in speech makes speech inaudible, resulting in a decrease in intelligibility; large reverberation in speech causes speech tailing and inability to hear clear.
- indicates the degree of continuity of speech. The factors directly affected by include: poor network environment leads to intermittent voice, freezes, lost words, etc.; QoS adjustment leads to fast playback and slow playback; and voice damage caused by algorithms such as echo and noise reduction.
- naturalness indicates how similar it is to the original speech. The typical problems of affecting naturalness are: distortion introduced by algorithm processing; nonlinear distortion of speakers; clipping and overload caused by excessive sound amplification.
volume
For RTC SDK suppliers, the biggest challenge is the diversity of devices, different platforms (Mac, Windows, Android, iOS, Web), as well as different models and different external devices, different models or devices to collect , The playback volume varies greatly. The strategy of volume control is to ensure the consistency between devices on different platforms. ensures that users can hear a sound of sufficient volume, and will not cause obvious damage and degradation of sound quality.
Noise
The purpose of the reduction algorithm is to remove the noise interference by the environment or equipment, restore the human voice as much as possible, and improve the signal-to-noise ratio. In the process of dealing with noise, the actual noise reduction algorithm inevitably damages the sound quality more or less. Therefore, the evaluation of noise reduction is mainly considered from two aspects:
- noise suppression level. includes convergence time, suppression strength, residual stationarity, etc.
- of impairment of speech. A good noise reduction algorithm of can always achieve a relative balance between the two, which can effectively suppress noise without obviously damaging speech.
echo
Echo cancellation is an important module in the RTC link. The purpose of is to eliminate the echo of the device and ensure a smooth call experience. The evaluation echo of mainly starts from two points:
- echo suppression strength. is there any residual echo.
- damage to near-end speech. In the application scenario of in RTC, the echo is also closely related to the device, platform, model and external device, so the echo test needs to cover the TOP model.
Delay
In network transmission, the audio anti-packet loss algorithms such as FEC, RED, ARQ, and the anti-packet loss algorithms such as Jitter Buffer, etc., will generate additional delay, resulting in an increase in end-to-end delay, which is negative for real-time communication. impact and experience decline. Especially for some low-latency scenarios, the end-to-end latency of is an important indicator to measure the performance of weak network confrontation.
Pain points of subjective testing **
At present, the mainstream evaluation method of RTC audio mainly relies on subjective testing and listening. This method requires relatively high professional ability of people, and the efficiency is relatively low. There are mainly the following pain points:
- has poor repeatability: subjective test is difficult to ensure the consistency of the two tests, such as changes in the sound field environment, changes in speaker pronunciation, changes in volume, differences in distance from the device, etc. There are too many uncontrollable factors, no way Get accurate comparative test results.
- test efficiency is low: subjective test requires two people to participate in the whole process. No matter listening or uttering sound for a long time, it will cause fatigue and slack, and the scene needs to be switched according to the use case, so the test efficiency is very low.
- low test coverage: can only cover limited scenarios and limited link combinations due to efficiency issues. Generally speaking, only key scenarios can be guaranteed. And the tester's own voice is limited, there is no way to cover more kinds of vocals.
- has a great influence on subjective factors: sound is very subjective, the same sound can be heard differently by different people, and the test results of a single person may lead to biased conclusions. Moreover, a person's vocalization and listening have a great relationship with the physical and psychological state, and the same person will give completely different judgments and conclusions at different time periods.
In response to the above pain points, NetEase Yunxin currently has created a set of objective evaluation methods from laboratory construction, environmental simulation, collection and playback, and evaluation methods in the evaluation and testing of audio effects.
Standard Labs
The picture above is the acoustic laboratory of NetEase Yunxin. The main equipment and hardware configuration are as follows:
- Head and Shoulders Simulator: with built-in mouth simulator and calibrated ear simulator (IEC 60318–4/ITU‐T Rec. P.57 Type 3.3 standard) that realistically reproduces the average adult Acoustic characteristics of the head and torso for accurate binaural acoustic signal acquisition and mouth vocalization.
- 4 * Hi-Fi: constructs a uniform scattered sound field, simulates and plays back the noise environment of different scenes and signal-to-noise ratios online.
- multi-channel sound card: supports simultaneous 8-input and 8-output sound capture and playback, meeting the scene settings of various audio tests.
- 4 4 electrical signal interfaces: supports multi-person voice test and echo single-talk test.
By building a professional audio test laboratory, it can meet the needs of audio automation test/competitive product analysis and evaluation/fast comparison test of baseline effects between , and obtain repeatable and objective test results requirements. You can also complete 3A subjective tests by one person: noise reduction, sound quality, and echo single-speaking and double-speaking tests. At present, there are more and more AI algorithms, and data is the key to AI algorithms. With the acoustic laboratory and noise simulation system, can automatically collect and label AI data by writing automated scripts, greatly reducing the cost of data purchase and labeling. . The current acoustic laboratory network of Yunxin is shown in the figure above. The introduction of the laboratory has improved the professionalism of development and testing, mainly in the following applications:
- automated test: objective 3A automated test, such as echo test, noise test, can simulate the scene of multiple people joining the conference.
- AI Automatic data collection: open source voice and target noise are played through the human head and noise playback system, respectively, and recorded back on the target or platform. During the recording process, tags can be marked, and the problem of sequence collection and tagging can be solved at the same time.
- subjective test: quantitative playback environment and quiet listening environment.
- Others: model coverage test, model adaptation, algorithm prototype optimization verification.
Objective Test Standard
The laboratory mainly provides an objective and repeatable test environment, and hardware equipment supports custom acquisition and playback. In addition, NetEase Yunxin's audio laboratory has also introduced objective test standards as the final data evaluation method. . Audio test standards are divided according to different dimensions.
Subjective/Objective
Subjective is based on subjective evaluations by humans, and objective methods use models to calculate and evaluate speech quality. Typical subjective evaluation standards such as P.800, and objective voice quality evaluation methods such as PESQ.
with reference/without reference
Full Reference/No Reference (FR/NR) Describes the type of measurement algorithm used. The FR algorithm has two signals: the original signal and the distorted signal. The NR algorithm only needs a distorted signal. A typical FR algorithm is eg PESQ. A typical NR measurement is P.563, and the NR method is also often referred to as a "single-ended" test.
Perceived/Non-perceived
Typically, such measurement algorithms attempt to model human perception. Perceptual modeling is not only used for quality assessment. Other well-known perceptual algorithms such as MP3 or AAC using perceptual models are used to compress music. Non-perceptual metrics are general physical or technical metrics such as level or signal-to-noise ratio.
Perceptual Model-Based Objective Criteria **
The most classic and most widely used objective indicators based on perceptual models are the active objective speech quality test standard p.86x series , also known as PESQ/POLQA, which is a typical reference speech evaluation standard, PESQ The general idea of /POLQA is: the level of the original signal (reference signal) and the signal passing through the test system to the standard hearing level, and then uses the input filter to simulate the standard telephone handset for filtering.
The two signals after level adjustment and filtering are aligned in time, and an auditory transformation is performed, which includes compensation and equalization for linear filtering and gain changes in the system. The difference between the two auditory-transformed signals is used as a disturbance (ie, the difference), and two distortion parameters are extracted by analyzing the disturbance surface, accumulated in frequency and time, and mapped to the predicted value of the subjective mean opinion score. POLQA has done a lot of precision optimization compared to PESQ, makes the objective test results more consistent with the subjective test results, and has a very wide application in speech evaluation.
Automated Testing
POLQA Automated Testing
In the network test, the adopts the test of the electrical signal link in order to reduce the influence of the hardware acquisition and playback and the acoustic link. The two devices on the sending end and the receiving end are connected to the sound card using a 3.5mm audio cable. In addition, there is a TC system to provide a network loss environment. The two devices under test are connected to the TC router, and the packet loss, delay, jitter and bandwidth of the devices at both ends are controlled through scripts.
As shown in the figure above, the test host sends the signal to the test device A through the sound card. After the test device is processed by the local RTC audio, it is sent to the receiver device B through network transmission. In this process, different types are added in real time through the weak network system. and degree of network loss. The sound card receives the signal from test equipment B, and compares and analyzes the original signal to measure the performance of the RTC against the weak network countermeasure module.
- Support interoperability testing on Android, iOS, Windows, Mac, and Web;
- Use TC scripts to automate control of the network environment;
- Use API to automatically control conference joining, profile switching, parameter control, and leaving conferences;
- Automatically obtain the bit rate, packet loss, freeze and other management information during the test process as an auxiliary standard;
- One-click execution to generate a version baseline report;
3A Objective Automation
NetEase Yunxin currently builds an end-to-end 3A automated test based on the laboratory. The architecture diagram is shown in the figure above, which is mainly divided into use case management layer, API/UI control layer, acquisition and playback, automatic calibration, analysis and calculation, data and report several large modules. It is mainly used for the comprehensive evaluation of echo, noise and volume control, and is currently used in the test links such as version baseline test, version iterative comparison, and competitive product comparison.
Author Introduction **
Ma Jianli, senior audio and video test engineer of Netease Yunxin, core member of Netease Yunxin Audio and Video Media Lab, responsible for the construction of audio test quality system and audio and video quality assurance.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。