2

By Zhi Li, Anne Aaron, Ioannis Katsavounidis, Anush Moorthy and Megha Manohara

term abbreviation

DLM - Detail Loss Metric
DMOS - Differential Mean Opinion Score
DSIS - Double Stimulus Impairment Scale
SVM - Support Vector Machine
VIF - Visual Information Fidelity
VMAF - Video Multimethod Assessment Fusion
SRCC - Spearman's rank correlation coefficient

origin

At Netflix, we care about video quality, and we care about accurately measuring video quality at scale. Our Approach: The Video Multi-Method Assessment Fusion (VMAF) Index, designed to reflect viewers' perceptions of our streaming quality. We are open-sourcing this tool and invite the research community to collaborate with us on this important project.

Our pursuit of high-quality video

We strive to provide our members with a great viewing experience: smooth video playback without annoying picture artifacts. An important part of this work is to provide video streams with the best perceived quality, given the limitations of network bandwidth and viewing devices. We are continuously working towards this goal through a variety of efforts.

First, we innovated in the field of video coding. Streaming video needs to be compressed using standards such as H.264/AVC, HEVC, and VP9 in order to stream at reasonable bit rates. When video is compressed too much or incorrectly, these techniques introduce quality impairments called compression artifacts. Experts call them "blocking," "ringing," or "mosquito noise," but to the average viewer, the video doesn't look right. To this end, we regularly compare the compression efficiency, stability and performance of codec vendors and integrate the best solutions on the market. We evaluate different video coding standards to ensure we are always on the cutting edge of compression technology. For example, we compare H.264/AVC, HEVC and VP9, Alliance for Open Media (AOM) and Joint Video Exploration Team (JVET). Even within established standards, we continue to experiment with recipe decisions (see the Per-Title Encoding Optimization project) and rate allocation algorithms to take full advantage of the existing toolset.

We encode Netflix video streams in a cloud-based distributed media pipeline, which allows us to scale to meet business needs. To minimize the impact of poor source delivery, software bugs, and cloud instance unpredictability (transient errors), we automate quality monitoring at various points in the pipeline. With this monitoring, we attempt to detect video quality issues at every transition point in the acquisition and pipeline.

Finally, as we iterate and run A/B tests in various areas of the Netflix ecosystem (such as adaptive streaming or content delivery network algorithms), we work to ensure that video quality is maintained or improved through system improvements. For example, improvements to adaptive streaming algorithms designed to reduce playback start delays or rebuffering should not degrade overall video quality in a streaming session.

All the above challenging work rests on a basic premise: we can accurately and efficiently measure the perceptual quality of large-scale video streams. Traditionally, in video codec development and research, two methods have been widely used to evaluate video quality: 1) visual subjective tests and 2) computation of simple metrics such as PSNR, or more recently SSIM [1].

There is no doubt that human visual inspection is both operationally and economically infeasible for the throughput of our production, A/B testing monitoring, and coding research experiments. Measuring image quality is an old problem, and many simple and practical solutions have been proposed. Mean Squared Error (MSE), Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity Index (SSIM) are examples of metrics originally designed for images and later extended to video. These metrics are often used in codecs ("in-loop") to optimize encoding decisions and report the final quality of encoded video. While researchers and engineers in the field are well aware that PSNR does not always reflect human perception, it remains the de facto standard for codec comparison and codec standardization efforts.

Build a dataset related to Netflix

To evaluate video quality assessment algorithms, we employ a data-driven approach. The first step is to collect datasets relevant to our use case. While there are public databases for designing and testing video quality metrics, they lack the variety of content associated with actual streaming services like Netflix. Many of them are no longer state-of-the-art in terms of source and encoding quality; for example, they contain standard definition (SD) content and only cover older compression standards. Furthermore, since the problem of assessing video quality is far more prevalent than measuring compression artifacts, existing databases attempt to capture a wider range of impairments, not only caused by compression, but also by transmission loss, random noise, and geometric transformations.

Netflix's streaming service presents a unique set of challenges and opportunities for designing perceptual metrics that accurately reflect the quality of streaming video. For example:

video source features . Netflix has a huge collection of movies and TV shows that are diverse in genres such as children's content, animation, fast action movies, documentaries with original footage, and more. Additionally, they exhibit diverse low-level source characteristics such as film grain, sensor noise, computer-generated textures, consistently dark scenes, or very bright colors. Many quality metrics developed in the past have not been adjusted to accommodate such large changes in source content. For example, many existing databases lack animated content, and most do not take into account film grain, a very common signaling feature in professional entertainment content.

Relic source . Since Netflix streams are delivered using the powerful Transmission Control Protocol (TCP), packet loss and bit errors are never a source of visual impairment. This leaves two types of artifacts in the encoding process that ultimately affect the viewer's quality of experience (QoE): compression artifacts (due to lossy compression) and scaling artifacts (for lower bitrates, the video is compressed before is downsampled and then on the viewer's device). By customizing quality metrics to cover compression and scaling artifacts only, trading generality for precision, the accuracy is expected to outperform general-purpose metrics.

To build a dataset better suited to the Netflix use case, we selected 34 samples of source clips (also called reference videos) from popular TV shows and movies in the Netflix catalog, each 6 seconds long, and compared them with publicly available clip. Source clips cover a wide range of high-level features (animation, indoor/outdoor, camera motion, close-up faces, people, water, apparent salience, number of objects) and low-level features (film grain noise, brightness, contrast, texture, motion, color variation, color richness, sharpness). Using the source clip, we encode the H.264/AVC video stream at resolutions from 384x288 to 1920x1080 and bitrates from 375 kbps to 20,000 kbps, resulting in about 300 distorted video. This covers a wide range of video bits rates and resolutions to reflect the widely varying network conditions of Netflix members.

We then performed subjective tests to determine how non-expert observers rated the impairment of the encoded video relative to the source clip. In a standardized subjective test, we used a method called the Dual Stimulus Impairment Scale (DSIS) method. Reference video and distorted video are displayed sequentially on a consumer TV with controlled ambient lighting (as described in Recommendation ITU-R BT.500–13 [2]). If the distorted video is encoded at a resolution smaller than the reference, it is upscaled to the source resolution before being displayed on the TV. Observers sat on a sofa in a living room-like environment and were asked to rate the impairment on a scale of 1 (very annoying) to 5 (not noticeable). The scores of all observers are combined to generate a Differential Mean Opinion Score or DMOS for each distorted video and normalized on a scale of 0 to 100, with a score of 100 for the reference video. The collection of reference video, distorted video, and DMOS scores from observers will be referred to herein as the NFLX video dataset.

Traditional Video Quality Metrics

How do traditional, widely used video quality metrics correlate with "true" DMOS scores on the NFLX video dataset?

image.png

visual example

Above, we see partial still frames captured from 4 different distorted videos; the top two videos report a PSNR value of about 31 dB, while the bottom two videos report a PSNR value of about 34 dB. However, people barely noticed the difference in the "crowd" video, and the difference between the two "fox" videos was even more pronounced. Human observers confirmed this by giving the two "crowd" videos a DMOS score of 82 (top) and 96 (bottom), while the two "fox" videos had a DMOS score of 27 and 58, respectively. a little.

Detailed results

The figure below is a scatter plot showing the observer's DMOS on the x-axis and the predicted scores from different quality metrics on the y-axis. These plots are obtained from a selected subset of the NFLX video dataset, which we label NFLX-TEST (see next section for details). Each point represents a distorted video. We plot the results for four quality metrics:

For more details on SSIM, Multiscale FastSSIM and PSNR-HVS, see the publications listed in the References section. For these three metrics, we used the implementation from the Daala codebase [5], so the titles in subsequent figures are prefixed with "Daala".

image.png

Note: Points with the same color correspond to distorted videos originating from the same reference video. Some DMOS scores may exceed 100 due to topic variability and reference videos normalized to 100.

As can be seen from the figure, these metrics fail to provide scores that consistently predict observer DMOS ratings. For example, looking at the PSNR plot in the upper left corner, for PSNR values around 35 dB, "true" DMOS values range from 10 (damages are annoying) to 100 (damages are imperceptible). Similar conclusions can be drawn for the SSIM and multiscale FastSSIM metrics, where scores close to 0.90 can correspond to DMOS values of 10 to 100. Above each plot, we report the Spearman rank correlation coefficient (SRCC), the Pearson product-moment correlation coefficient (PCC) and the root mean square error (RMSE) number for each metric, calculated after nonlinear logistic fitting, As described in Annex 3.1 of ITU-R BT.500-13 [2]. SRCC and PCC values close to 1.0 and RMSE values close to 0 are desirable. Among the four metrics, PSNR-HVS exhibits the best SRCC, PCC and RMSE values, but still lacks prediction accuracy.

To achieve meaningful performance on a variety of content, the metric should exhibit a good relative quality score, i.e. the delta in the metric should provide information about the perceived quality delta. In the figure below, we have selected three typical reference videos, a noisy video (blue), a CG animation (green), and a TV series (rusted), and plotted the predicted scores of different distorted videos as a function of DMOS . each. To be effective as a relative quality score, a constant slope between different clips within the same range of the quality curve is desirable. For example, referring to the PSNR graph below, in the range of 34 dB to 36 dB, a PSNR change of about 2 dB for a TV series corresponds to a DMOS change of about 50 (50 to 100), but a similar 2 dB change for a CG animation in the same range corresponds to for variations of less than 20 (40 to 60) DMOS. Although SSIM and FastSSIM show more consistent slopes for CG animation and TV series clips, their performance is still insufficient.

image.png

In conclusion, we see that traditional metrics don't work for our content. To address this issue, we employ a machine learning-based model to design a metric that aims to reflect human perception of video quality. This indicator will be discussed in the next section.

Our Approach: Video Multi-Method Assessment Fusion (VMAF)

Based on our research collaboration6 with Prof. C.-CJ Kuo and his team at USC, we developed Video Multimethod Assessment Fusion, or VMAF, to predict subjective quality by combining multiple basic quality metrics. The rationale is that each basic metric may have its own strengths and weaknesses in terms of source content characteristics, artifact types, and levels of distortion. The base metrics are "fused" into a final metric by using a machine learning algorithm (in our case, a support vector machine (SVM) regressor) that assigns a weight to each base metric, and the final metric preserves all the advantage and provide a more accurate final score. Machine learning models are trained and tested using opinion scores obtained through subjective experiments (in our case, the NFLX video dataset).

The current version of the VMAF algorithm and model released as part of the VMAF Development Kit open source software (denoted as VMAF 0.3.1) uses the following basic metrics fused by support vector machine (SVM) regression [8]:
Visual Information Fidelity (VIF) [9]. VIF is a widely adopted image quality metric that presupposes that quality is complementary to a measure of information fidelity loss. In its raw form, the VIF score is measured as the loss of fidelity combining four scales. In VMAF, we adopt a modified version of VIF in which the fidelity loss in each scale is included as a base metric.

Detail Loss Metric (DLM) [10]. DLM is an image quality metric whose rationale is to separately measure the loss of detail that affects the visibility of content and the loss of redundancy that distracts the viewer. The raw metrics combine DLM and Additional Impairment Measure (AIM) to produce the final score. In VMAF, we only adopt DLM as the basic metric. Pay special attention to special cases, such as black boxes, where the numerical evaluation of the original formula fails.

Both VIF and DLM are image quality indicators. We further introduce the following simple features to explain the temporal features of videos:
sports. This is a simple measure of the difference in time between adjacent frames. This is achieved by computing the mean absolute pixel difference of the luminance components.
These basic metrics and features are selected from other candidates through iterations of testing and validation.

We compare the accuracy of VMAF with the other quality metrics described above. To avoid unfairly overfitting VMAF to the dataset, we first split the NFLX dataset into two subsets, called NFLX-TRAIN and NFLX-TEST. The two sets have non-overlapping reference clips. The SVM regressor is then trained using the NFLX-TRAIN dataset and tested on NFLX-TEST.
The graph below shows the performance of the VMAF metric on the NFLX-TEST dataset and selected reference clips - noisy video (blue), CG animation (green), and TV series (rusted). For ease of comparison, we repeat the plot of PSNR-HVS, the best-performing metric in the previous section. Clearly, the performance of VMAF is much better.

image.png

We also compare VMAF with a video quality model with variable frame delay (VQM-VFD) [11], which is considered by many to be the state-of-the-art in the field. VQM-VFD is an algorithm that uses a neural network model to fuse low-level features into a final metric. It is similar to VMAF in spirit except that it extracts features such as spatial and temporal gradients at a lower level.

image.png

It is obvious that the performance of VQM-VFD on the NFLX-TEST dataset is close to that of VMAF. Since the VMAF method allows the incorporation of new basic indicators into its framework, the VQM-VFD can also serve as a basic indicator for VMAF.

The following table lists the performance measured by the SRCC, PCC, and RMSE data of the VMAF model, and the final performance of VMAF 0.3.1 after fusing different combinations of the base metrics on the NFLX-TEST dataset. We also list the performance of VMAF enhanced with VQM-VFD. The results support our hypothesis that intelligent fusion of high-performance quality metrics leads to increased relevance to human perception.

NFLX-TEST dataset

image.png

Summary of results

In the table below, we summarize the SRCC, PCC and RMSE for the different metrics discussed earlier, on the NLFX-TEST dataset and three popular public datasets: VQEG HD (vqeghd3 collection only) [12], live video database [13] and a real-time mobile video database [14]. The results show that VMAF 0.3.1 outperforms all metrics except the LIVE dataset, and it still provides competitive performance compared to the best performing VQM-VFD. Since VQM-VFD shows good correlation between the four datasets, we are trying to use VQM-VFD as a base metric for VMAF; although it is not part of the open source version VMAF 0.3.1, it may be integrated into in subsequent versions.

NFLX-TEST dataset

image.png

Live dataset*

image.png

*Only for compression impairments (H.264/AVC and MPEG-2 video)

VQEGHD3 dataset*

image.png

*For source content SRC01 to SRC09 and flow-related impairments HRC04, HRC07 and HRC16 to HRC21

Live mobile datasets
image.png

VMAF Development Kit (VDK) open source package

To deliver high-quality video over the Internet, we believe the industry needs good perceived video quality metrics that are both practical and easy to deploy at scale. We developed VMAF to help us meet this need. Today we open source the VMAF Development Kit (VDK 1.0.0) package Github under the Apache License Version 2.0. By open-sourcing the VDK, we hope it can evolve over time, improving performance.

The feature extraction (including basic metric calculation) part in the VDK core is computationally intensive, so it is written in C for efficiency. Control code is written in Python for rapid prototyping.

The package comes with a simple command line interface that allows the user to run VMAF in single mode (run_vmaf command) or batch mode (run_vmaf_in_batch command, optionally enabling parallel execution). Additionally, since feature extraction is the most expensive operation, users can also store the feature extraction results in a data store for later reuse.

The package also provides a framework for further customizing the VMAF model based on:

The command run_training accepts three configuration files: a dataset file containing information about the training dataset, a feature parameter file, and a regressor model parameter file (containing the regressor hyperparameters). Below is sample code that defines a dataset, a set of selected features, a regressor, and its hyperparameters.

##### 定义数据集##### 
dataset_name = 'example' 
yuv_fmt = 'yuv420p' 
width = 1920 
height = 1080 
ref_videos = [ 
  {'content_id':0, 'path':'checkerboard.yuv'}, 
  {'content_id':1, 'path':'flat.yuv'}, 
] 
dis_videos = [ 
  {'content_id':0, 'asset_id': 0, 'dmos':100, 'path':'checkerboard.yuv' }, # ref 
  {'content_id':0, 'asset_id': 1, 'dmos':50, 'path':'checkerboard_dis.yuv'}, 
  {'content_id':1, 'asset_id': 2, 'dmos' :100, 'path':'flat.yuv'}, # ref 
  {'content_id':1, 'asset_id': 3, 'dmos':80, 'path':'flat_dis.yuv'}, 
]
##### 定义特征 ##### 
feature_dict = { 
  # VMAF_feature/Moment_feature 是聚合特征
  #motion, adm2, dis1st 是原子特征
  'VMAF_feature':['motion', 'adm2'], 
  'Moment_feature' :['dis1st'], # dis video 的第一时刻
}
##### 定义回归器和超参数 ##### 
model_type = “LIBSVMNUSVR” # libsvm NuSVR 回归器
model_param_dict = { 
  # ==== 预处理:标准化每个特征 ==== # 
  'norm_type':'clip_0to1' , # rescale to within [0, 1] 
  # ==== postprocess: clip final quality score ==== # 
  'score_clip':[0.0, 100.0], # clip to within [0, 100] 
  # ==== libsvmnusvr 参数 ==== # 
  'gamma':0.85, # selected 
  'C':1.0, # default 
  'nu':0.5, # default 
  'cache_size':200 # default 
}

Finally, FeatureExtractor can extend the base class to develop custom VMAF algorithms. This can be achieved by experimenting with other available basic metrics and functions or by inventing new ones. Likewise, TrainTestModel can extend the base class to test other regression models. See CONTRIBUTING.md . Users can also try out alternative machine learning algorithms using existing open source Python libraries such as scikit-learn [15], cvxopt [16] or tensorflow [17]. An example ensemble of random forest regressors from scikit-learn is included in the package.

The VDK package includes the VMAF 0.3.1 algorithm with selected features and a trained SVM model based on subjective scores collected on the NFLX video dataset. We also invite the community to use the package to develop improved functions and regressors for perceptual video quality assessment. We encourage users to test VMAF 0.3.1 on other datasets and help improve it for our use case and possibly extend it to other use cases.

Our open questions on quality assessment

View condition . Netflix supports thousands of active devices, including smart TVs, game consoles, set-top boxes, computers, tablets and smartphones, giving our members a wide variety of viewing conditions. Viewing settings and displays can significantly affect quality perception. For example, if a Netflix member is watching a 720p movie encoded at 1 Mbps on a 4K 60-inch TV, the quality of the same stream may be viewed very differently if viewed on a 5-inch smartphone. The current NFLX video dataset covers a single viewing condition - watching TV at standard distances. To enhance VMAF, we are conducting subjective tests under other viewing conditions. With more data, we can generalize the algorithm so that viewing conditions (display size, distance from screen, etc.) can be fed into the regressor.

Time Pool . Our current VMAF implementation computes a quality score on a per-frame basis. In many use cases, it is desirable to temporarily pool these scores to return a single value as a summary over an extended period of time. For example, a score for a scene, a score for a regular time period, or a score for an entire movie is desirable. Our current approach is a simple temporal pooling that takes the arithmetic mean of each frame's values. However, this approach runs the risk of "hiding" poor quality frames. Pooling algorithms that give more weight to lower scores may be more accurate for human perception. A good pooling mechanism is especially important when using aggregated scores to compare encodings with varying quality fluctuations between frames, or as a target metric when optimizing encodings or streaming sessions. A perceptually accurate temporal pooling mechanism for VMAF and other quality metrics remains an open and challenging problem.

consistent with the indicator . Since the VMAF contains the complete reference base metrics, the VMAF is highly dependent on the quality of the reference. Unfortunately, the video source quality of all titles in the Netflix catalog may not be consistent. Sources come into our system in resolutions from SD to 4K. Even at the same resolution, the best source available may suffer from some video quality. Therefore, comparing (or summarizing) VMAF scores for different titles may not be accurate. For example, when a video stream generated from an SD source has a VMAF score of 99 out of 100, it will never have the same perceived quality as a video encoded from an HD source with the same score of 99. For quality monitoring, we very much hope that we can calculate absolute quality scores that are consistent across sources. After all, when viewers watch a Netflix show, they don't have any reference other than the picture delivered to their screen. We want an automated way to predict their perception of the quality of the video delivered to them, taking into account all the factors that contribute to the final video rendered on that screen.
generalize

We developed the VMAF 0.3.1 and VDK 1.0.0 packages to help us deliver the highest quality video streaming to our members. As part of our ongoing pursuit of quality, our teams use it every day to evaluate video codecs and encoding parameters and strategies. VMAF and other metrics have been integrated into our coding pipeline to improve our automated QC. We are in the early stages of using VMAF as one of the client-side metrics to monitor system-wide A/B testing.

In today's Internet environment, it is important to improve video compression standards and make informed decisions in actual encoding systems. We argue that the use of traditional metrics that are not always relevant to human perception hinders real progress in video coding techniques. However, it is not feasible to always rely on manual visual testing. VMAF is our attempt to solve this problem, using samples from our content to help design and validate algorithms. Similar to how the industry collaborates to develop new video standards, we invite the community to openly collaborate to improve video quality measures, with the ultimate goal of using bandwidth more efficiently and providing visually pleasing video for all.

Thanks

We would like to thank the following individuals for their help with the VMAF project: Joe Yuchieh Lin, Eddy Chi-Hao Wu, Prof. C.-C Jay Kuo (University of Southern California), Prof. Patrick Le Callet (University of Nantes), and Todd Goodall.

refer to

  1. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image Quality Assessment: From Error Visibility to Structural Similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, Apr. 2004.
  2. BT.500 : Methodology for the Subjective Assessment of the Quality of Television Pictures, https://www.itu.int/rec/R-REC-BT.500
  3. M.-J. Chen and A. C. Bovik, “Fast Structural Similarity Index Algorithm,” Journal of Real-Time Image Processing, vol. 6, no. 4, pp. 281–287, Dec. 2011.
  4. N. Ponomarenko, F. Silvestri, K. Egiazarian, M. Carli, J. Astola, and V. Lukin, “On Between-coefficient Contrast Masking of DCT Basis Functions,” in Proceedings of the 3 rd International Workshop on Video Processing and Quality Metrics for Consumer Electronics (VPQM ’07), Scottsdale, Arizona, Jan. 2007.
  5. Daala codec. https://git.xiph.org/daala.git/
  6. T.-J. Liu, J. Y. Lin, W. Lin, and C.-C. J. Kuo, “Visual Quality Assessment: Recent Developments, Coding Applications and Future Trends,” APSIPA Transactions on Signal and Information Processing, 2013.
  7. J. Y. Lin, T.-J. Liu, E. C.-H. Wu, and C.-C. J. Kuo, “A Fusion-based Video Quality Assessment (FVQA) Index,” APSIPA Transactions on Signal and Information Processing, 2014.
  8. C.Cortes and V.Vapnik, “Support-Vector Networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995.
  9. H. Sheikh and A. Bovik, “Image Information and Visual Quality,” IEEE Transactions on Image Processing, vol. 15, no. 2, pp. 430–444, Feb. 2006.
  10. S. Li, F. Zhang, L. Ma, and K. Ngan, “Image Quality Assessment by Separately Evaluating Detail Losses and Additive Impairments,” IEEE Transactions on Multimedia, vol. 13, no. 5, pp. 935–949, Oct. 2011.
  11. S. Wolf and M. H. Pinson, “Video Quality Model for Variable Frame Delay (VQM_VFD),” U.S. Dept. Commer., Nat. Telecommun. Inf. Admin., Boulder, CO, USA, Tech. Memo TM-11–482, Sep. 2011.
  12. Video Quality Experts Group (VQEG), “Report on the Validation of Video Quality Models for High Definition Video Content,” June 2010, http://www.vqeg.org/
  13. K. Seshadrinathan, R. Soundararajan, A. C. Bovik and L. K. Cormack, “Study of Subjective and Objective Quality Assessment of Video”, IEEE Transactions on Image Processing, vol.19, no.6, pp.1427–1441, June 2010.
  14. A. K. Moorthy, L. K. Choi, A. C. Bovik and G. de Veciana, “Video Quality Assessment on Mobile Devices: Subjective, Behavioral, and Objective Studies,” IEEE Journal of Selected Topics in Signal Processing, vol. 6, no. 6, pp. 652–671, Oct. 2012.
  15. scikit-learn: Machine Learning in Python. http://scikit-learn.org/stable/
  16. CVXOPT: Python Software for Convex Optimization. http://cvxopt.org/
  17. TensorFlow. https://www.tensorflow.org/

Yujiaao
12.7k 声望4.7k 粉丝

[链接]