1

statement:

This article is transferred from the Amazon Science website, and the article translation is provided by the developer community;

Click the link below to view the original English text:

Streaming video can be affected by defects introduced during recording, encoding, packaging, or delivery, so most subscription video services, such as Amazon Prime Video, constantly evaluate the quality of the content they stream.

Manual content moderation -- known as eye testing -- doesn't scale well, and it brings its own challenges, such as differences in reviewers' perceptions of quality. More common in the industry is the use of digital signal processing to detect anomalies in video signals that are often associated with defects.

The initial version of Amazon Prime Video's block corruption detector used a residual neural network to generate a map indicating the probability of corruption at a particular image location, binarized the map, and calculated the ratio between the corrupted area and the total image area.

Three years ago, Prime Video's Video Quality Analysis (VQA) group began using machine learning to identify flaws in content captured from devices such as game consoles, TVs, and set-top boxes to validate new app versions or change encoding profiles offline. More recently, we've been applying the same technology to things like real-time quality monitoring of our thousands of channels and live events and analyzing new catalog content at scale.

Our team at VQA trains computer vision models to watch videos and discover issues that can affect the customer viewing experience, such as blocky frames, unexpected black frames, and audio noise. This enables us to handle hundreds of thousands of live events and catalog item scale videos.

An interesting challenge we faced was the lack of positive cases in the training data due to the extremely low incidence of audiovisual defects in Prime Video products. We address this challenge using a dataset that simulates defects in the original content. After developing a detector using this dataset, we validate that the detector is transferable to production content by testing it on a set of real-world defects.

Example of how we brought audio clicks into clean audio

Waveform for clean audio.

Added clicked audio waveform.

Spectrogram of clean audio.

Added clickable audio spectrogram.

We built detectors for 18 different types of defects, including video freezing and stuttering, video tearing, synchronization issues between audio and video, and subtitle quality issues. Below, we take a closer look at three examples of defects: block corruption, audio artifacts, and audiovisual synchronization issues.

block corruption

One disadvantage of using digital signal processing for quality analysis is the difficulty in distinguishing certain types of content from defective content. For example, to a signal processor, a crowded scene or a high-motion scene might look like a block-corrupted scene, where a corrupted transmission would cause blocks of pixels within a frame to shift or cause blocks of pixels to all take on the same color value.

image.png

Click the link to watch the video:

Example of block corruption

To detect block corruption, we use a residual neural network designed to enable higher layers to explicitly correct errors (residuals) missed by lower layers. We replace the last layer of the ResNet18 network with a 1x1 convolution (conv6 in the network diagram).

image.png

Architecture of the block corruption detector.

The output of this layer is a two-dimensional map, where each element is the probability of block corruption in a particular image region. The 2D map depends on the size of the input image. In the network graph, a 224 x 224 x 3 image is passed to the network and the output is a 7 x 7 map. In the example below, we pass an HD image to the network and the resulting map is 34 x 60 pixels.

In the initial version of the tool, we binarized the map and calculated the corruption area ratio as corruptionArea = areaPositive/totalArea. If this ratio exceeds a certain threshold (0.07 proves to work well), then we mark the frame as having block corruption. (See animation above.)

However, in the current version of the tool, we move the decision function into the model, so it is learned along with feature extraction.

Audio Artifact Detection

"Audio artifacts" are unwanted sounds in the audio signal, possibly introduced through the recording process or data compression. In the latter case, this is the audio equivalent of the corrupted block. However, sometimes, for creative reasons, artefacts are also introduced.

To detect audio artifacts in videos, we use a reference-free model, which means that during training, it does not have access to clean audio as a comparison standard. Based on a pretrained audio neural network, the model classifies one-second audio clips as flawless, audio hum, audio hiss, audio distortion, or audio click.

Currently, the model achieves a balanced accuracy of 0.986 on our proprietary simulated dataset. For more information on this model, see our paper "A no-reference model for detecting audio artifacts using pretrained audio neural networks" presented at this year's IEEE Winter Conference on Computer Vision Applications using pretrained audio neural networks)".

image.png

Click the link to watch the video:

Video example with distorted audio

Audio and video synchronization detection

Another common quality issue when audio doesn't match video is AV sync or lip sync defects. Problems during broadcasting, reception, and playback can desynchronize audio and video.

To detect lip-sync defects, we built a detector based on Oxford University's SyncNet architecture - which we call LipSync.

The input to the LipSync pipeline is a four-second video clip. It is passed to a shot detection model, which identifies shot boundaries; a face detection model, which identifies faces in each frame; and a face tracking model, which identifies faces in consecutive frames as belonging to the same person.

Preprocessing pipeline to extract facial trajectories - four-second clips centered on a single face.

The output of the face tracking model (called the face track) and associated audio are then passed to the SyncNet model, which aggregates over the face track to determine if the clip is in sync, out of sync, or indeterminate, meaning there is or no face detected /face trajectories, either with the same number of in-sync and out-of-sync predictions.

The output of the face tracking model (called the face track) and associated audio are then passed to the SyncNet model, which aggregates over the face track to determine if the clip is in sync, out of sync, or indeterminate, meaning there is or no face detected /face trajectories, either with the same number of in-sync and out-of-sync predictions.

future job

These are some of the detectors featured in our arsenal. In 2022, we will continue to work on refining and improving our algorithms. In ongoing work, we are using active learning (by algorithmically selecting particularly useful training examples) to continually retrain our deployed models.

To generate synthetic datasets, we are working onEditGan , a new approach to more precisely control the output of generative adversarial networks (GANs). We also use our custom Amazon cloud native application and SageMaker implementation to extend our defect detector, monitoring all live events and video channels.

Article by Sathya Balakrishnan & Ihsan Ozcelik

Sathya Balakrishnan is a Software Development Manager for Amazon Prime Video.

Ihsan Ozcelik is a Senior Software Development Engineer at Amazon Prime Video.


亚马逊云开发者
2.9k 声望9.6k 粉丝

亚马逊云开发者社区是面向开发者交流与互动的平台。在这里,你可以分享和获取有关云计算、人工智能、IoT、区块链等相关技术和前沿知识,也可以与同行或爱好者们交流探讨,共同成长。