Agora audio interactive MoS score method: real-time score for the audio interactive experience

In the industry, the QoE (Quality of Experience) method for real-time audio and video has always been an important topic. Every year, the RTE Real-time Internet Conference will have topics involved. The reason why it is so important is that there is not a good QoE evaluation method that can be used to evaluate real-time interactive scenarios in the RTE industry.

Based on the objective real-time data and practice summary of large-scale commercial use around the world, Acoustic Network officially launched a self-developed non-reference objective evaluation method for evaluating real-time audio user experience-Acoustic Agora real-time audio MoS method. This method has been integrated in the 3.3.1 and later versions of the Agora Audio/Video SDK. Currently, only the score for the downlink (codec-transmission-play) link is provided, and the uplink quality score will be available in the future. interface. After calling this method, the developer can objectively judge the audio interactive experience of the current user in real time, and provide important reference data for the optimization of its own business and operations. Click "Read the original text" search for "mosValue", you can browse the detailed documentation of the method.

Then someone may ask, what are MoS points and QoE? What is the principle of this MoS method of Soundnet? What is the difference compared to existing open source methods?

from "feeding" to QoS, QoE

When voice calls appear, there is no QoS (Quality of Service). People can only judge the quality of the call by the number of "hey hey hey".

Later, network-based voice interaction faced the same problem. QoS was born under this background. Its purpose is to provide end-to-end service quality assurance based on the demand characteristics of various businesses. The mechanism of QoS is mainly set up for operators and networks, focusing on network performance and traffic management, etc., rather than end user experience.

People gradually find that the traditional evaluation system built with QoS as the core is always difficult to match the user's experience. As a result, QoE (Quality of Experience), which pays more attention to user experience, was proposed. After a long period of time, the QoE-based evaluation system began to gradually develop. In the communications field, several evaluation methods that are strongly related to QoE have gradually emerged. These evaluation methods can be divided into subjective evaluation methods and objective evaluation methods. These methods will use MoS points to express the current level of user experience.

existing QoE method

Subjective evaluation method

The subjective evaluation method is to map people's subjective feelings to the quality score, which is limited by the professionalism and individual differences of the listeners. In the industry, there is no standard for subjective audio testing. Although the ITU has some suggestions and guidelines for subjective audio testing, each test has its own key design and implementation. A common practice is to invite enough people to collect statistically significant samples, and then do certain listening training for the testers. Finally, the audio call is scored based on signal distortion, background intrusion, and overall quality.

Therefore, often requires a lot of manpower and time to get a relatively accurate subjective voice quality score. Therefore, subjective tests are rarely used in the industry to evaluate communication quality.

Objective evaluation method

Objective evaluation methods are divided into reference evaluation methods and non-reference evaluation methods.

Among them, has a reference evaluation method that can quantify the degree of damage of the damaged signal under the premise of a reference signal (lossless signal), and give an objective voice quality score close to the subjective voice quality score. In 2001, the P.862 standard (P.862 is the ITU International Telecommunication Union standard) defined the reference objective evaluation algorithm PESQ, which is mainly used to evaluate the codec impairments under narrowband and wideband. This algorithm has been widely used in the evaluation of communication quality in the past two decades.

With the development of technology, the scope of application of PESQ has become narrower and narrower, so in 2011, the P.863 standard defined a set of more comprehensive and accurate reference objective evaluation algorithm POLQA. Compared with PESQ, POLQA can evaluate a wider bandwidth, is more robust to noise signals and delays, and its voice quality score is closer to a subjective score.

No-reference objective evaluation method does not require a reference signal, and a quality score can be obtained only by analyzing the input signal itself or parameters. The more famous non-reference objective evaluation methods include P.563, ANIQUE+, E-model, P.1201 and so on.

Among them, P.563 was proposed in 2004, mainly for the quality evaluation of narrowband speech; ANIQUE+ was proposed in 2006, also for narrowband speech. According to the author, its scoring accuracy exceeds that of the reference evaluation method PESQ, but the measurement of PESQ cannot It reflects network delays, packet loss, etc., which are not perfectly suitable for real-time interactive scenarios based on Internet transmission; E-model was proposed in 2003. Unlike the above two methods, this is a damage quantification based on VoIP link parameters. The standard does not directly analyze the signal domain; the P.1201 series was proposed in 2012. For the audio part, the standard does not directly analyze the audio signal, but scores the communication quality based on the network status and signal status.

AI algorithm has limited improvement & difficult to

In recent years, there are also related papers using deep learning to score speech signals. The output of the fitting is often the output of the voice to be tested corresponding to PESQ or other objective evaluation methods with reference. But this method has two obvious disadvantages:

One is that its accuracy depends on the model's computing power, and when the product is launched, because the user experience cannot be directly improved, the complexity and package size requirements of non-quality-improved functions are often very high;
Second, the robustness of this method will be severely tested under the multi-scene feature of RTE. For example, a language chat room scene with background music or special effects will bring a lot to this deep learning-based method. Challenge.

There is a reference objective evaluation method. Because a lossless reference corpus is needed, the more value is to verify the quality of the algorithm, app or scene before it goes online. If your app or scene is already online, you cannot evaluate its voice interaction experience . As for the experience evaluation after the product is released, the industry expects that no-reference objective evaluation methods can provide some help. However, it is hard to regret that due to the diversity of scenarios or the complexity of the algorithm, the above-mentioned non-reference objective evaluation methods are difficult to be fully applied to the RTE field.

Take the non-reference objective evaluation method P.563 as an example. The effective spectrum that it can measure is only 4kHz, and it can only measure the comment tone signal. The robustness to different corpora is very poor: we used the core algorithm of P.563 in the early days. Real-time and transplanted to the SDK, but the variance of the scoring errors of different types of corpora was too large after testing, and finally it was not commercialized. Based on the deep learning method, theoretically an end-to-end evaluation algorithm with better robustness and smaller error than P.563 can be trained, but its algorithm complexity and lower return on investment are still two. A stumbling block.

A new QoE evaluation method for real-time audio interactive scenes

In summary, if we need a method for evaluating the quality of calls deployed on the terminal in real time, any of the above methods is inappropriate. We need to take a different approach and design a new evaluation system. This system needs to have the following characteristics:

It needs to be robust to the corpus (music/voice/mix) in a variety of real-time interactive scenarios, and there will be no obvious evaluation errors.
Requires multi-sampling rate (narrowband/wideband/ultra-wideband/fullband) evaluation capabilities.
The complexity should be low enough to be able to evaluate the voice quality of each channel in a multi-person call on any device without introducing significant performance increases.
The online quality score can be aligned with the offline test results, that is, the same call, the evaluation method for the current online call score, and the subsequent use of the evaluation method to analyze the score of the call, the two should be almost Unanimous.

When this QoE evaluation system meets the above characteristics, it is equivalent to allowing you to perform the "pre-launch quality evaluation" you have done in the past after the product is launched, and you can see the current call experience score of your users at any time. This is not only an improvement in the ability of the evaluation system, but also helps you to significantly improve the user experience in a targeted manner.

Based on the characteristics of real-time interaction, Acoustic Networks designed a real-time voice quality assessment method based on hidden states-Acoustic Agora Audio Interactive MoS Method. This method combines signal processing, psychology and deep learning, and can score the voice quality of calls in real time with extremely low algorithm complexity.

Figure: Agora audio interactive MoS method and the industry's existing evaluation method comparison

The method is mainly divided into two parts: the first part is the uplink quality assessment done at the sending end, which is mainly used to evaluate the quality scores of acquisition and signal processing; the second part is the downlink quality assessment done at the receiving end, which is mainly used to evaluate the Score after codec damage and network damage. The overall architecture diagram can refer to this picture:

This article mainly talks about downlink quality evaluation, which is the most important part that affects the real-time interactive experience. In this part, we have also taken into consideration the encoding module at the sending end. Therefore, this part includes the link of encoding-sending-transmitting-decoding-post-processing-playing. Different from the previous method of score fitting based on network status, we focused on monitoring the status of each module in the SDK. The core idea of this design is very simple. If in a completely damage-free network, this downlink only contains coding damage before playing, and each weak network countermeasure algorithm module will not be triggered. Once the network fluctuates, the weak network countermeasure modules will start to operate, and each activation will more or less affect the final sound quality. Therefore, the core of constructing a downlink quality evaluation algorithm becomes to obtain the mapping relationship between the various modules of the SDK and the sound quality. Of course, there are several other influencing factors in the actual downlink quality evaluation algorithm design, such as the encoder architecture, the efficiency of encoding different corpora, the effective bit rate, the network damage model, etc., which will obviously affect the final hearing and quality score. Generally speaking, the evaluation method of the state of art has an average scoring error (RMSE) of about 0.3 in a weak network environment. The evaluation method we designed can control the average error within 0.2 in a weak network environment.

Based on this set of downlink quality evaluation algorithms, we have constructed a global audio network quality map. Users can monitor the quality of calls occurring in all corners of the world in real time. For users, the MoS score in the table reflects the QoE of their current call:

This set of audio interactive MoS methods applied to the global sound network has been opened to the public in Agora Audio/Video SDK 3.3.1 and later versions. You can obtain the quality score of each call in real time through the mosValue in AgoraRtcRemoteAudioStats. Currently, only It provides the score of the downlink (codec-transmission-play) link, and the uplink quality scoring interface will be opened in the future. For detailed interface parameter description, please click "Read the original" enter the document center of the sound network, search for "mosValue" to refer to the detailed document.

Agora audio interactive MoS score method: real-time score for the audio interactive experience

from "feeding" to QoS, QoE

existing QoE method

Subjective evaluation method

Objective evaluation method

AI algorithm has limited improvement & difficult to

A new QoE evaluation method for real-time audio interactive scenes

RTE开发者社区

引用和评论

Nooka：将书籍生成可互动音频，支持随时打断和提问；Sam Altman：语音与图形界面结合将带来创新丨日报

大模型中的Token究竟是什么？从原理到作用深度解析

【赞奇实测】DeepSeek 不同 GPU 性能测试一期（4090 VS 5000 Ada VS 5880 Ada）

深度探索 DeepSeek 微调：LoRA 与全参数微调实战指南

DeepSeek行业应用实践报告100+份汇总解读|附PDF下载

功率器件热设计基础（九）——功率半导体模块的热扩散

2025增长新前沿——AI人工智能拐点重塑人类潜力 400+份报告汇总解读 | 附PDF下载