头图

With its solid accumulation and cutting-edge innovation in the field of face generation, Alibaba Cloud Video Cloud and the Hong Kong University of Science and Technology have cooperated with the latest research result "Depth-Aware Generative Adversarial Network for Talking Head Video". Generation) is received by CVPR2022. This article is an interpretation of the latest research results.

Thesis title: "Depth-Aware Generative Adversarial Network for Talking Head Video Generation"
arxiv link: https://arxiv.org/abs/2203.06605

Face Replay Algorithm Will Make New Breakthroughs in Video Codecs?

In recent years, with the fire of live video streaming, more and more people have begun to pay attention to the field of video cloud. The low latency and high image quality of video transmission have always been two points that are difficult to balance. At present, the minimum delay of live broadcast can be reduced to less than 400ms, but in the case of increasing demand for video conferences and other scenarios, such as remote PPT presentations, we have higher requirements for the balance between image quality and delay. The key to breaking through the delay of live broadcast is the improvement of encoding and decoding technology. The combination of face replay algorithm and encoding and decoding technology will greatly reduce bandwidth requirements in the application of video conference scenarios, and obtain a more immersive experience. It is a very important step towards ultra-low-latency high-quality video conferencing.

Face reenactment (face reenactment/talking head) algorithm refers to using a video to drive an image, so that the face in the image can imitate the facial posture, expression and actions of the characters in the video, and realize the effect of still image video.

figure 1

Development status of face reenactment

Current face reconstruction methods rely heavily on 2D representations learned from input images. However, we believe that dense 3D geometric information (eg: pixel-level depth maps) is very important for face reconstruction, as it can help us generate more accurate 3D face structures and distinguish noisy and complex backgrounds from faces . However, dense video 3D annotation is expensive.

Research Motivation & Innovation

In this paper, we introduce a self-supervised 3D geometry learning method to estimate head depth (depth maps) from videos without requiring any 3D annotations. We further utilize the depth map to assist in detecting face keypoints to capture head motion. In addition, depth maps are used to learn a 3D-aware cross-model attention to guide motion field learning and feature deformation.


figure 2

Figure 2 shows the pipeline of DA-GAN proposed in this paper, which mainly consists of three parts:

(1) Depth estimation network \( F_d \), we estimate the dense face depth map in a self-supervised way;

(2) The key point detection network \( F_{kp} \), we stitch the 3D geometric features represented by the depth map with the appearance features of the RGB map to predict more accurate face key points;

(3) Face synthesis network, which can be further divided into a feature deformation module and a cross-modal attention module.

The feature warping module converts the input sparse keypoints into a sparse motion field, and then learns a dense motion field and uses it to warp image features.

The cross-modal attention module uses deep feature learning to obtain attention maps to capture more action details and correct facial structure. The structure of the two modules can be seen in Figures 3 and 4.


image 3


Figure 4

Experimental results

Quantitative experiments

We conduct experiments on the VoxCeleb1[1] and CelebV[2] datasets.

We use structured similarity (SSIM) and peak signal-to-noise ratio (PSNR) to evaluate the similarity between the result frame and the driving frame;

Use average keypoint distance (AKD) and average euclidean distance (AED)[3] to evaluate keypoint accuracy and CSIM[4] to evaluate identity retention;

PRMSE was used to evaluate head pose retention and AUCON was used to evaluate pose retention.

Quantitative comparison


Table 1


Table 2


table 3

Tables 1 and 2 are the quantitative comparisons between DA-GAN and mainstream face reconstruction methods on the VoxCeleb1 dataset, and Table 3 is the quantitative comparison between DA-GAN and mainstream face reconstruction methods on the CelebV dataset.

Qualitative comparison

Figure 5 is a qualitative comparison between GA-GAN and mainstream face reconstruction methods. Experiments show that the DA-GAN proposed in this paper is superior to other algorithms in various indicators and generation effects.


Figure 5

Ablation study

Figure 6 is the result of the ablation study. It can be seen that both the self-supervised depth estimation and the cross-modal attention module significantly improve the details and micro-expressions of the synthesized face.


Image 6

Research summary

From the above results, it can be seen that the face replay algorithm can achieve more refined face details and micro-expression synthesis. In the video conference scene, using the talking head method, during the communication process, only the coordinates of the key points can be transmitted, without the need to transmit each frame of images. At the receiving end, the image of each frame can be restored by inputting the key points, which greatly reduces the bandwidth requirement. , so as to obtain a high-quality low-latency video conference experience.

"Video Cloud Technology", your most noteworthy public account of audio and video technology, pushes practical technical articles from the frontline of Alibaba Cloud every week, where you can communicate with first-class engineers in the audio and video field. Reply to [Technology] in the background of the official account, you can join the Alibaba Cloud video cloud product technology exchange group, discuss audio and video technology with industry leaders, and obtain more latest industry information.

CloudImagine
222 声望1.5k 粉丝