Talking about the Real-time Voice Quality Monitoring System

浅谈实时语音质量监控系统

Wang seniors today to talk with you past and present real-time voice quality monitoring system, real-time voice presumably everyone is familiar with, micro-channel voice chat, video broadcast, life examples abound.

In the past voice communication system, there are many factors that affect voice quality, including but not limited to delay, packet loss, packet delay variation, echo, and due to encoding. The resulting distortion.

Generally speaking, speech quality evaluation methods can be divided into three types: with reference objective evaluation method, subjective evaluation method and non-reference objective evaluation method .

has a reference objective evaluation method:

Refers to the comparison between the original reference audio and video and the distorted audio and video in each corresponding pixel in each corresponding frame. To be precise, what this method gets is not the real video quality, but the degree of similarity or fidelity of the distorted audio and video relative to the original audio and video. The simplest methods, such as mean square error MSE and peak signal-to-noise ratio PSNR, are widely used.

PESQ voice quality is an important indicator to measure the performance of voice transmission. How to obtain an accurate and reliable QoE (Quality of Experience) evaluation system has become the focus of current research. PESQ (perceptual evaluation of speech quality, voice quality evaluation algorithm) is proposed by the ITU The QoE-based speech quality evaluation algorithm, and subsequently became the ITU-T P.862 standard. PESQ algorithm is currently a more popular voice quality evaluation algorithm. When it comes to P.862 standard, P.861 PSQM is the earliest standard. ITU-T P.861 is also called PSQM, which is a voice quality evaluation system deduced from PAQM. . At present, P.862 PESQ and PESQ-WB are the most widely used reference evaluation methods. The latest reference evaluation method is P.863 POLQA, which rely on lossless reference signals.

No reference objective evaluation method:

The research on objective evaluation of speech quality has developed rapidly since the 1970s, and scholars at home and abroad have proposed thousands of objective evaluation methods. The objective evaluation is mainly based on the comparison of the characteristic parameters of the original speech signal and the distorted speech signal in the time-frequency domain or the transform domain. It is mainly aimed at the deficiencies of subjective evaluation methods. People have long hoped to have objective evaluation methods to evaluate the sound quality of voice equipment. After that, many people have successively proposed objective sound quality evaluation methods based on the degree of objectivity. It is hoped that these methods can be used to conveniently and quickly give the speech quality evaluation value of the tested speech system, but the main body of the evaluation is done by the machine hardware or software. At present, many objective evaluation methods used at home and abroad are PSQM, PAMS and PSQM+. Among them, P.563 is the most famous narrowband no-reference evaluation method. According to the author, such as ANIQUE+, the accuracy is higher than the referenced PESQ. Others include E-Model/P.1201 parameter domain evaluation methods and xxNet deep learning domain evaluation methods.

The objective evaluation method also has many disadvantages:

has a reference method: can only be used before going online
no reference method-traditional signal domain: narrow application scenarios and poor robustness
No reference method-traditional parameter domain: can maintain accuracy only under limited weak network conditions
No reference method-deep learning: application scenarios and corpus are limited, and the complexity is slightly higher

Generally, we can propose various objective speech quality evaluation methods from different directions, but objective speech quality evaluation must ultimately determine its performance and reliability through its correlation with subjective speech quality evaluation. We usually pass subjective and objective speech quality evaluation. The fitting process makes the above judgment. The fitting process is to input the subjective and objective values of speech under different conditions through subjective and objective speech quality evaluation, and then perform least square fitting on the subjective and objective values, where the target value on the horizontal axis is the target value on the vertical axis . Draw the subjective and objective quality evaluation curve of voice, and get the comparison relationship between subjective and objective voice quality evaluation. People usually use the predicted mean square error value to reflect the degree of correlation between subjective and objective speech quality assessments. The closer the predicted mean square error value is, the better the correlation between subjective and objective speech quality evaluation, that is, the better the performance of objective speech quality evaluation. On the contrary, it shows that the worse the correlation between subjective and objective speech quality assessment, that is, the worse the performance of objective speech quality assessment.

development of 1613a136f1c9f3 to the present is mainly based on online offline testing, which has the characteristics of high precision, wide coverage, low complexity, and strong robustness.

Quality assessment is sufficiently accurate
Cover most business scenarios
Do not introduce too much algorithm complexity
Weakly related to voice content

uplink quality evaluation method: acquisition-AEC-NS-AGC-diagnosis, with independent detection + unified detection

Features : equipment collection stability, echo cancellation ability, noise suppression ability, volume adjustment ability

Downlink quality evaluation method: adopts encoding-transmitting-decoding-playing

Take an example of a certain laboratory. The core indicators of its verification data to draw a global audio quality map are: codec performance, network quality, weak network countermeasure algorithm quality, and equipment playback capabilities.

In the test case of multiple weak networks, multiple devices, and multiple modes, the score of this method and the reference score of POLQA have MAE less than 0.1 points, MSE less than 0.01 points, and the maximum error is less than 0.15 points.

The following figure shows the test results of multiple weak networks of a certain device and a certain mode

How weak the test result of a certain device is in a certain mode

Here is a brief talk about NOMA, NOMA (Non Othogonal Multiple Access), the theoretical basis of NOMA is called multi-user information theory. NOMA, or non-orthogonal multiple access technology, is a very promising 5G technology. Its advantage is that it can improve the frequency spectrum efficiency (rate/bandwigth) and the access volume, which is in line with the explosive data growth and access requirements of the upcoming 5G era. The NOMA technology can be used for a simple comparison in the method of evaluating the quality of the add and drop links.

uplink and downlink quality evaluation methods

1. The allocation of user transmit power is different.

In the downlink NOMA technology, the transmission power of each user is affected by the total transmission power of the base station and the transmission power of other users, and the transmission power allocated to users with different channel quality is different (the channel quality is poor, that is, the channel gain is low. Users are allocated high transmission power, and vice versa, they are allocated low transmission power.

The uplink is that the transmit power of each user is only affected by the maximum transmit power of its device. And for users with different channel quality, they are allowed to use their own maximum transmission power (that is, each user transmits with their own maximum transmission power). When the channel quality difference is small, the channel quality is guaranteed to be poor. While improving the performance of the allocation method with good channel quality, it often causes a bad influence on users with poor channel quality in this case.

2. SIC decoding order is different.

In the downlink, each receiving end receives the superimposed signal from the base station, and each receiving end has its own SIC receiver. The received signal is decoded continuously to obtain the signal required by each. For a certain receiving end, the channel experienced when the superimposed signal is transmitted is the same, so when calculating the rate, the channel gain that everyone multiplies is the same. At this time, the demodulation first has the largest received power.

The decoding order in the uplink is exactly the opposite, because the transmitting users can understand that there is no difference in the performance of the hardware transmitters, and their channel gains have high and low points, but they will all transmit at the maximum power of their transmitters. When the signal of the user near the base station arrives at the base station, the received power is greater (received power = transmit power x channel gain). At this time, demodulate the one with the largest received power (that is, the channel with the largest gain, because the transmit power is the same at this time). ).

Decoding order: Priority decoding will be given to channels with good channel quality (that is, those with high receiving power at the receiving end); therefore, in the NOMA system, regardless of the uplink or downlink, the priority demodulation at the receiving end is the highest received power at the receiving end of.

3. The interference experienced by users is different.

In the downlink, because users with poor channel quality are allocated high transmit power, users with poor channel quality are more likely to cause interference to other users in the cluster, that is, users with good channel quality are more likely to suffer interference;

In the uplink, since users each send signals to the base station to generate superimposed signals that are received by the base station, users with poor channel quality are more susceptible to interference than those with better channel quality.

4. The difficulty of realization is different.

The uplink is easier to implement than the downlink. In NOMA technology, multi-user detection and continuous interference cancellation are to be finally realized. The continuous interference cancellation needs to be realized by distinguishing the receiving power of different users' signals through the SIC receiver. For the downlink , the base station sends the superimposed signal to the user, so the user terminal is needed to realize the multi-user detection and continuous interference cancellation technology; in the uplink 1613a136 each user will be each When the signal is sent to the base station, only multi-user detection and continuous interference cancellation technology need to be implemented at the base station. Compared with the base station, the user terminal has too limited processing capabilities, so it is difficult to implement multi-user detection and continuous interference cancellation at the user terminal.

If you are interested in NOMA technology, you can search for relevant papers and materials to learn, and the positioning is the promising 5G technology.

Let’s briefly talk about the leakage of echo, noise, noise and low volume in the real-time voice process~

The reason why

In the process of delay jitter: there may be busy threads, serious device nonlinearity, dual devices, non-causality, etc.
Large reverberation environment: the reverberation length exceeds the filter length
Acquisition signal overflow: cause the filter to not converge
Double talk: strong dependence on NLP, easy to care about one and lose the other

noise and noise

Equipment noise: such as single-frequency sound, power frequency noise, notebook fan sound, disordered noise
Environmental noise: Babble, whistle, etc.
Signal overflow: popping sound
Algorithm introduction: residual echo, etc.

Reasons for

The device has weak collection ability and low voice (this is the majority)
The device's playback capability is weak
Analog gain, analog boost gain is small
Small digital gain

Finally, the independent monitoring module can be divided into four parts: howling detection, noise detection, noise detection, and hardware detection.

Small Outlook

In the future, I think perception, feedback, and monitoring will be integrated, and will become finer, wider, faster, and more complete; the internal state will also become finer, experience coverage will be wider, and feedback speed will be faster , Coverage calls are also more complete. I also believe that my country's 5G technology and real-time audio and video transmission technology and quality evaluation system will get better and better.

Talking about the Real-time Voice Quality Monitoring System

RTE开发者社区

引用和评论

ElevenLabs 新 TTS 模型支持音频标签；NotebookLM 前产品经理新项目曝光：将邮件日历新闻转为互动音频丨日报

一文掌握 MCP 上下文协议：从理论到实践

AI Agent爆火后，MCP协议为什么如此重要！

2025年医疗大模型各医疗场景赋能实践研究报告130+份汇总解读|附PDF下载

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

MCP 协议为何不如你想象的安全？从技术专家视角解读

🔥吐血整理 Bolt.diy 部署与应用攻略