How do we establish a no-reference video quality evaluation system?

In today's ubiquitous real-time interaction, video quality is an important indicator reflecting end-user experience. It is unrealistic to rely solely on manual implementation of large-scale real-time evaluation, so the establishment and promotion of an automatic video quality evaluation system is the general trend.

But how to evaluate video quality? Different concerns may lead to different answers. For end users of various live broadcasts, the focus is on real-time quality monitoring; for practitioners who provide video technology services, the focus is on fine-grained improvements or rollbacks between video algorithm versions. Therefore, we need a set of objective indicators for "evaluating subjective video quality experience", on the one hand as the client's experience evaluation or fault detection, on the other hand as the practitioner's algorithm optimization performance reference. We call this evaluation system VQA (Video Quality Assessment).

The difficulty of this problem is how to collect data, that is, how to quantify people’s subjective evaluation of video quality, and how to build a model so that the model can replace manual scoring.

In the following content, I will first sort out the general evaluation methods in the industry, then introduce the Agora-VQA model building process of SoundNet, and finally summarize the future improvement direction.

the industry implement video quality assessment?

Like other algorithms in the field of deep learning, building a video quality assessment model can also be divided into two steps: collecting VQA data and training the VQA model. The entire VQA training process is realized by the objective model simulating the subjective annotation, and the quality of the fitting effect is defined by the consistency evaluation index. Subjective VQA annotation collects end-user feedback in the form of graded ratings and aims to quantify the video experience of real users; objective VQA provides a mathematical model that mimics subjective quality ratings.

Subjective: VQA Data Collection

Subjective evaluation consists of subjective scores of video quality by observers, which can be divided into MOS (Mean Opinion Score) and DMOS (Differential Mean Opinion Score). MOS describes the absolute evaluation of the video, which belongs to the no-reference scene and directly quantifies the quality of massive UGC videos. DMOS represents the relative evaluation of the video, which belongs to the reference scene, and generally compares the differences between videos under the same content.

In this paper, we mainly introduce MOS, and the operation example given by ITU-T Rec BT.500 ensures the reliability and validity of subjective experiments. The subjective video perception is projected into the interval of [1, 5], described as follows:

Fraction	Experience	description
5	Excellent	good experience
4	Good	Perceived, but not affected (experience)
3	Fair	minor impact
2	Poor	influential
1	Bad	very influential

Two issues need to be explained in detail here:

1. How to form MOS?

The suggestion given by ITU-T Rec BT.500 is to "establish a non-expert group with ≥15 people". After getting the video's annotations by the raters, first calculate the correlation between each person and the overall mean, and remove the scores with low correlation. After that, the evaluations of the remaining raters were averaged. When the number of people participating in the scoring is greater than 15, it is enough to control the random error of the experiment within an acceptable range.

2. How to interpret MOS? To what extent does the MOS represent the opinion of "me"?

Although different raters have different definitions of absolute intervals for "good" and "bad", or sensitivity to image quality impairments. But the judgments of "better" and "worse" are still converging. In fact, in public databases such as the Waterloo QoE Database, the mean value of std can reach 0.7, indicating that the subjective feelings of different raters can differ by nearly 1 notch.

Objective: VQA model building

There are many classification methods of VQA tools. According to the amount of information provided by the original reference video, VQA tools can be divided into three categories:

Full Reference

Relying on the complete original video sequence as the reference standard, pixel-by-pixel PSNR and SSIM are the most primitive comparison methods. The disadvantage is that the degree of fitting with the subjective is limited, and the VMAF indicator introduced by Netflix is also in this column.

Reduced

The object of comparison is some corresponding features (of the original video sequence and the video sequence of the receiver), which is suitable for the case where the complete original video sequence is not available. This method is between Full Reference and No Reference.

No Reference

The No Reference (hereinafter referred to as "NR") method further removes the reliance on additional information and evaluates the current video more "factually". Restricted by online data monitoring methods, reference videos are usually unavailable in actual scenarios. Common NR indicators include DIIVINE, BRISQUE, BLIINDS, and NIQE, etc. Due to the lack of reference videos, the accuracy of these methods is often inferior to full reference and half reference.

Subjective and objective consistency evaluation index

As mentioned above, pixel-based PSNR and SSIM methods have limited subjective fit, so how do we judge the quality of various VQA tools?

The industry usually defines the prediction accuracy and prediction monotonicity of the objective model. The prediction accuracy describes the linear prediction ability of the objective model for subjective evaluation, and the related indicators are PLCC (Pearson Linear Correlation Coefficient) and RMSE (Root Mean Square Error). Predictive monotonicity describes the relative rank consistency of scores, measured by SROCC (Spearman Rank Correlation Coefficient).

How does Agora-VQA implement video quality assessment?

However, most of the public datasets are not enough to reflect the real online situation in terms of the amount of data and the richness of video content. Therefore, in order to be closer to the real data characteristics and cover different RTE (real-time interaction) scenarios, we established the Agora-VQA Dataset and trained the Agora-VQA Model on this basis. This is the industry's first deep learning-based video subjective experience MOS evaluation model that can run on mobile devices. It uses deep learning algorithms to estimate the MOS score of the video quality subjective experience at the receiver end of the RTE (real-time interactive) scene. Enables real-time assessment of online video quality.

Subjective: Agora-VQA Dataset

We have established a database of subjective image quality evaluation, and built a scoring system with reference to ITU standards to collect subjective scores, then clean the data, and finally get the subjective experience score MOS of the video. The overall process is shown in the following figure:

In the stage of video sorting, first of all, we consider that the source of the video content itself is rich in the same batch of scoring materials, so as to avoid the visual fatigue of the scorers; secondly, try to distribute the image quality as evenly as possible. The following picture shows the video collection of a certain period The resulting scoring distribution:

In the subjective scoring stage, we built a scoring app. The length of each video is 4-8s, and 100 videos are collected in each batch for scoring. For each rater, the total viewing time is controlled within 30min to avoid fatigue.

Finally, in the data cleaning phase, there are two options. One is according to the ITU standard: first calculate the correlation between each individual and the overall mean, after excluding the raters with low correlation, then average the evaluations of the remaining raters. The second is to select the video with the highest scoring consistency as the gold standard by calculating the 95% confidence interval of each sample, and filter out the participants with large scoring deviation on these samples.

Objective: Agora-VQA Model

On the one hand, in order to be closer to the actual subjective feeling of the user, on the other hand, because the reference video cannot be obtained in live video and similar scenarios, our solution is to define the objective VQA as no on the decoding resolution of the receiving end. Refer to evaluation tool to monitor the video quality at the decoding end with deep learning methods.

Training deep learning models can be divided into end-to-end and non-end-to-end. In the end-to-end training method, due to the different spatial and temporal resolutions of the video, it is necessary to sample to a uniform size for end-to-end training; for non-end-to-end training, first extract features through a pre-trained network, and then perform regression training on the video features Fit MOS.

In the feature extraction part, there are different sampling methods for the original video. The following figure (the illustration in the cited paper [1]) shows the correlation between different sampling methods and subjective, and we can see the impact of sampling in the video space on performance. is the largest, while sampling in the time domain has the highest correlation with the MOS of the original video.

It is not only the characteristics of the spatial domain that affect the image quality experience, but also the distortion in the time domain, including a time domain lag effect (refer to the paper [2]). This effect corresponds to two behaviors: one is that the subjective experience decreases immediately when the video quality decreases, and the other is the slow improvement of the viewer's experience when the video quality improves. We also took this phenomenon into account when modeling.

Performance comparison of with other VQA tools

Finally, let's look at the correlation performance of different image quality evaluation algorithms on KonViD-1k and LIVE-VQC:

Comparison of parameters and computations of the model:

It can be seen that Agora-VQA has a great computing advantage over the large model based on deep learning in academia, and this advantage gives us the possibility to directly evaluate the video communication service experience on the end, while providing a certain accuracy guarantee situation This greatly improves the saving of computing resources.

Outlook

Finally, Agora-VQA still has a long way to go before reaching the final QoE (Quality of Experience), which is the goal of depicting the user's subjective experience:

1) From decode resolution to render resolution

The concept of decoding resolution is relative to rendering resolution. It is known that video playback on different devices, or stretching on the same device with different window sizes will cause differences in subjective experience. At present, Agora-VQA evaluates the quality of the video stream at the decoding end. In the next stage, we plan to support different devices and different stretch sizes, so as to be closer to the end user's perceived quality and realize "what you see is what you get".

2) From the video clip to the entire call

The VQA data set used for model training is mostly composed of video clips with durations ranging from 4 to 10s. In actual calls, the recency effect needs to be considered, and it may not be possible to accurately simulate the video clips only by linearly tracking and reporting video clips. According to the subjective feelings of users, in the next step, we plan to comprehensively consider clarity, fluency, interaction delay, audio and video synchronization, etc., to form a time-varying experience evaluation method.

3) From experience score to failure classification

At present, Agora-VQA can achieve video quality prediction accurate to 0.1 within the interval [1,5], and when the video quality is poor, automatically locating the cause of the fault is also an important part of online quality census, so we plan to There is a model-based support for fault detection functions.

4) From real-time evaluation to industry standardization

At present, Agora-VQA is in the process of iterative polishing of the internal system, and it will be gradually opened in the future. In the future, it plans to integrate the online evaluation function in the SDK and release the offline evaluation tool.

The above is our research and practice in VQA. You are welcome to click "read the original text" to post in the developer community to communicate with us.

References

[1] Z. Ying, M. Mandal, D. Ghadiyaram and A. Bovik, "Patch-VQ: ‘Patching Up’ the Video Quality Problem," 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 14014-14024.

[2] K. Seshadrinathan and A. C. Bovik, "Temporal hysteresis model of time varying subjective video quality," 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, 2011, pp. 1153-1156.

Dev for Dev column introduction

Dev for Dev (Developer for Developer) is a developer interactive innovation practice activity jointly initiated by Agora and the RTC developer community. Through various forms of technology sharing, communication and collision, and project co-construction from the perspective of engineers, the power of developers is gathered, the most valuable technical content and projects are mined and delivered, and the creativity of technology is fully released.

How do we establish a no-reference video quality evaluation system?

the industry implement video quality assessment?

Subjective: VQA Data Collection

Objective: VQA model building

Subjective and objective consistency evaluation index

How does Agora-VQA implement video quality assessment?

Subjective: Agora-VQA Dataset

Objective: Agora-VQA Model

Outlook

RTE开发者社区

引用和评论

实时多模态如何重塑未来交互？我们邀请 Gemini 解锁了 39 个实时互动新可能丨Voice Agent 学习笔记

一文掌握 MCP 上下文协议：从理论到实践

AI Agent爆火后，MCP协议为什么如此重要！

2025年医疗大模型各医疗场景赋能实践研究报告130+份汇总解读|附PDF下载

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

MCP 协议为何不如你想象的安全？从技术专家视角解读

祛魅最热门的通用Agent赛道