1

This article is shared from HUAWEI CLOUD Community " Portrait Cutout: Algorithm Overview and Engineering Implementation (1) 160a5d7c103566", the original author: Du Fu

This article will explain how to implement a real-time, elegant and accurate video portrait matting project from three aspects: algorithm overview, engineering implementation, and optimization and improvement.

What is cutout

For a picture I, the part of the portrait we are interested in is called foreground F, and the rest is background B, then image I can be regarded as a weighted fusion of F and B: I = alpha F + (1-alpha) BI= alpha∗F+(1−alpha)∗B, and the matting task is to find the appropriate weight alpha. It is worth mentioning that, as shown in the figure, looking at the matting ground truth, you can see that alpha is a continuous value between [0, 1], which can be understood as the probability that a pixel belongs to the foreground, which is different from portrait segmentation. As shown in the figure, in the portrait segmentation task, alpha can only be 0 or 1, which is essentially a classification task, while matting is a regression task.

Cutout ground truth:
image.png

Segment ground truth:
image.png

Related work

We mainly focus on the more representative matting algorithms based on deep learning. The current popular matting algorithms can be roughly divided into two categories. One is the Trimap-based method that requires prior information. The broad prior information includes Trimap, rough mask, unmanned background image, Pose information, etc., used by the Internet A priori information and picture information jointly predict alpha; the other is the Trimap-free method, which only predicts alpha based on picture information, which is more friendly to practical applications, but the effect is generally not as good as Trimap-based methods.

Trimap-based

Trimap is the most commonly used prior knowledge. As the name suggests, Trimap is a ternary image, each pixel is one of {0, 128, 255}, representing foreground, unknown and background respectively, as shown in the figure.
image.png

Deep Image Matting

Most matting algorithms use Trimap as a priori knowledge. Adobe proposed Deep Image Matting 1 This is the first end-to-end prediction algorithm for alpha. The entire model is divided into two parts: the Matting encoder-decoder stage and the Matting refinement stage. The Matting encoder-decoder stage is the first In one part, a rougher alpha matte is obtained based on the input image and the corresponding Trimap. Matting refinement stage is a small convolutional network used to improve the accuracy and edge performance of alpha matte.
image.png
image.png

This article reached state-of-the-art at the time, and many subsequent articles have followed this "rough-fine" matting idea. In addition, due to the high cost of labeling, the data for matting tasks in the past was very limited. This paper also proposes a large data set Composition-1K through synthesis, which combines the finely labeled foreground with different backgrounds to obtain 45,500 training images and 1,000 test images, which greatly enriches the data of the matting task.

Background Matting

Background Matting 2 is a matting algorithm proposed by the University of Washington, followed by the release of Backgroun MattingV2, the method is more innovative, and has achieved good results in practical engineering applications.
image.png

At the same time, because Adobe's data is based on synthesis, in order to better adapt to real input, this paper proposes a self-supervised network training G_{Real}GReal​ to learn from unlabeled real input. The input of G_{Real}GReal​ is the same as that of G_{Adobe}GAdobe​. ​​Use the alpha matte and F of G_{Adobe}GAdobe​ to monitor the output of G_{Real}GReal​ and get loss. In addition, G_{Real}GReal​ The RGB synthesized from the output of the RGB will also pass a discriminator to determine the authenticity to obtain the second loss, and jointly train G_{Real}GReal​.
image.png

The article lists some test results obtained by shooting with mobile phones. It can be seen that the results are still very good in most cases.
image.png

Background Matting V2

Background Matting has achieved good results, but the project cannot run in real time, nor can it handle high-resolution input well. So the project team launched Background Matting V2 3 , which can get good results on 4k input at 30fps.
image.png

An important idea for the article to achieve high-efficiency high-resolution matting is that most of the pixels in the alpha matte are 0 or 1, and only a small number of areas contain transition pixels. Therefore, the article divides the network into a base network and a refine network. The base network processes low-resolution images, and the refine network selects specific image blocks on the original high-resolution images for processing according to the processing results of the base network.
image.png

The input of the base network is c times downsampled image and background, and rough alpha matte, F, error map and hidden features are output through encoder-decoder. Upsample the error map E_cEc obtained by sampling c times to the original resolution \frac{1}{4}41​ is E_4E4​, then each pixel of E_4E4​ corresponds to the original 4x4 image block, select topk error from E_4E4​ The pixel is the original topk error 4x4 image block. Cut out multiple 8x8 image blocks around the selected pixels and send them to the refine network. The refine network is a two-stage network. First, the input is passed through a partial CBR operation to get the first stage output, and the 8x8 image block extracted from the original input is cat and then input to the second stage. Finally, the refined image block and the base result are obtained. Exchange to get the final alpha matte and F.
image.png
image.png

In addition, the article also released two data sets: video matting data set VideoMatte240K and image matting data set PhotoMatte13K/85. VideoMatte240K collected 484 high-resolution videos, and used Chroma-key software to generate 240,000+ foreground and alpha matte pairs. PhotoMatte13K/85 uses software and manual adjustment methods to obtain 13000+ foreground and alpha matte data pairs when taking photos under good light. Large data sets are also one of the important contributions of this article.
image.png
image.png

In addition, there are also some articles such as Inductive Guided Filter 4 , MGMatting 5 etc., and when using a lot of friendly information, it also uses a lot of preliminary information to predict the alpha than the mat trimap. MG Matting also proposed a matting dataset RealWorldPortrait-636 with 636 accurately labeled portraits, which can be extended and used by data augmentation methods such as synthesis.

Trimap-free

In practical applications, it is very inconvenient to obtain a priori information. Some articles also put the part of the prior information acquisition on the network.

Semantic Human Matting

Alibaba’s Semantic Human Matting 6 also decomposes the matting task. The network is divided into three parts. T-Net classifies pixels into three parts to obtain Trimap, and concat with the image to obtain six-channel input and send it to M-Net, M -Net obtains a rougher alpha matte through encoder-decoder, and finally sends the output of T-Net and M-Net to the Fusion Module to obtain a more accurate alpha matte.
image.png

The alpha loss during network training is divided into alpha loss and compositional loss, similar to DIM, in addition to the pixel classification lossL_tLt​, the final loss is: L = L_p + L_t=L_\alpha + L_c + L_tL=Lp​+Lt​ =Lα​+Lc​+Lt​. The article implements an end-to-end Trimap-free matting algorithm, but it is bloated. In addition, the article proposed the Fashion Model data set, which collected 35,000+ annotated pictures from e-commerce websites, but it was not open.

Modnet

modnet 7 believes that the neural network is better at learning a single task, so it divides the matting task into three subtasks, and performs explicit supervision training and synchronization optimization. Finally, it can achieve soft results at 63fps under 512x512 input, so in In the subsequent project implementation, I also chose modnet as the Baseline.

The three subtasks of the network are Semantic Estimation, Detail Prediction and Semantic-Detail Fusion. The Semantic Estimation part is composed of backbone and decoder. The output is down-sampled 16 times the semantics of the input to provide semantic information. The ground truth of this task is The labeled alpha is obtained by down-sampling and Gaussian filtering. There are three inputs to the Detail Prediction task: the original image, the intermediate features of the semantic branch, and the output S_pSp of the S branch. The D branch is also an encoder-decoder structure. It is worth paying attention to the loss of this branch. Since the D branch only focuses on detailed features, it passes The ground truth alpha generates a trimap, and only calculates the L_1L1​ loss of d_pdp​ and \alpha_gαg​ in the unknown area of ​​the trimap. The F branch fuses semantic information and detailed predictions to obtain the final alpha matte and ground truth to calculate the L_1L1​ loss. The total loss of network training is: L=\lambda_sL_s + \lambda_dL_d+\lambda_{\alpha}L_{\alpha}L =λs​Ls​+λd​Ld​+λα​Lα​.
image.png
image.png

Finally, the article also proposes a post-processing method OFD that makes the video results smoother in time. When the two frames before and after are similar but the distance between the intermediate frame and the two frames is large, the average value of the before and after frames is used to smooth the intermediate frame. But this method will cause the actual result to be delayed by one frame from the input.
image.png

In addition, U^2U2-Net, SIM and other networks can perform saliency matting on images, so you can pay attention to it if you are interested.

data set

  • Adobe Composition-1K
  • matting_human_datasets
  • VideoMatte240K
  • PhotoMatte85
  • RealWorldPortrait-636

Evaluation index

Commonly used objective evaluation indicators come from a 2009 CVPR paper 8 , mainly including:
image.png

In addition, you can view the related articles of the Image Matting task on paperwithcode, and the evaluation indicators of some algorithms on the Alpha Matting website.

The ultimate goal of this project is to implement real-time video reading and background replacement on the HiLens Kit hardware. The development environment is HiLens's supporting online development environment HiLens Studio. First, let's compare the improvement effects of the baseline:

Use the modnet pre-trained model modnet_photographic_portrait_matting.ckpt to test the results as follows:
image.png

It can be seen that due to unfamiliar scenes, backlighting and other reasons, the matting results will flicker. Although modnet can self-supervise finetune for specific videos, our goal is to achieve better results in a general sense, so we did not self-supervise this video. Learn.

The optimized model effects are as follows:
image.png

This video is not used as training data. It can be seen that the flickering of the cutout is reduced a lot, and the details such as hair are basically not lost.

Project landing

In order to test the effect of the baseline, first of all, we have to implement the project on the baseline in the usage scenario. According to the document import/conversion local development model can be known

The model format supported by Shengteng 310 AI processor is ".om". For Pytorch models, it can be obtained through the conversion method of "Pytorch->Caffe->om" or "Pytorch->onnx->om" (new version), Here I choose the first one. The Pytorch->Caffe model conversion method and precautions have been specifically explained in the previous blog, so I won’t go into details here. After the Caffe model is converted, it can be directly converted to an om model in HiLens Studio, which is very convenient.

First, create a new skill in HiLens Studio. An empty template is selected here. You only need to modify the skill name.
image.png

Upload the Caffe model to the model folder:
image.png

Run the model conversion command in the console to get the om model that can be run:

/opt/ddk/bin/aarch64-linux-gcc7.3.0/omg --model=./modnet_portrait_320.prototxt --weight=./modnet_portrait_320.caffemodel --framework=0 --output=./modnet_portrait_320 --insert_op_conf=./aipp.cfg

image.png

Next, complete the demo code. During the test, HiLens Studio can choose to use the video simulation camera input in the toolbar, or connect the mobile phone to use the mobile phone to test:
image.png

The specific demo code is as follows:

# -*- coding: utf-8 -*-
# !/usr/bin/python3
# HiLens Framework 0.2.2 python demo
​
import cv2
import os
import hilens
import numpy as np
from utils import preprocess
import time
​
​
def run(work_path):
    hilens.init("hello")  # 与创建技能时的校验值一致
​
    camera = hilens.VideoCapture('test/camera0_2.mp4')  # 模拟输入的视频路径
    display = hilens.Display(hilens.HDMI)
​
    # 初始化模型
    model_path = os.path.join(work_path, 'model/modnet_portrait_320.om') # 模型路径
    model = hilens.Model(model_path)
​
    while True:
        try:
            input_yuv = camera.read()
            input_rgb = cv2.cvtColor(input_yuv, cv2.COLOR_YUV2RGB_NV21)
            # 抠图后替换的背景
            bg_img = cv2.cvtColor(cv2.imread('data/tiantan.jpg'), cv2.COLOR_BGR2RGB) 
            crop_img, input_img = preprocess(input_rgb)  # 预处理
            s = time.time()
            matte_tensor = model.infer([input_img.flatten()])[0]
            print('infer time:', time.time() - s)
            matte_tensor = matte_tensor.reshape(1, 1, 384, 384)
​
            alpha_t = matte_tensor[0].transpose(1, 2, 0)
            matte_np = cv2.resize(np.tile(alpha_t, (1, 1, 3)), (640, 640))
            fg_np = matte_np * crop_img + (1 - matte_np) * bg_img  # 替换背景
            view_np = np.uint8(np.concatenate((crop_img, fg_np), axis=1))
            print('all time:', time.time() - s)
​
            output_nv21 = hilens.cvt_color(view_np, hilens.RGB2YUV_NV21)
            display.show(output_nv21)
​
        except Exception as e:
            print(e)
            break
​
    hilens.terminate()

The preprocessing part of the code is:

import cv2
import numpy as np
​
​
TARGET_SIZE = 640
MODEL_SIZE = 384
​
​
def preprocess(ori_img):
    ori_img = cv2.flip(ori_img, 1)
    H, W, C = ori_img.shape
    x_start = max((W - min(H, W)) // 2, 0)
    y_start = max((H - min(H, W)) // 2, 0)
    crop_img = ori_img[y_start: y_start + min(H, W), x_start: x_start + min(H, W)]
    crop_img = cv2.resize(crop_img, (TARGET_SIZE, TARGET_SIZE))
    input_img = cv2.resize(crop_img, (MODEL_SIZE, MODEL_SIZE))
​
    return crop_img, input_img

The code in the demo part is very simple, click run to see the effect in the simulator:
image.png

The model inference takes about 44ms, and the end-to-end operation takes about 60ms, which achieves the real-time effect we want.

Effect improvement

The pre-training model has the problem of timing flicker in engineering. The original paper proposed a post-processing method OFD that makes the video result smoother in time, that is, using the middle frame with a large average error of the two frames before and after. However, this method is only suitable for slow motion and will cause a frame delay. We hope that the camera input can be processed in real-time and universal timing. Therefore, OFD is not suitable for our application scenarios.

In the Video Object Segmentation task, there are some Memory Network-based methods (such as STM). There are also new papers in the field of matting, such as DVM. Consider introducing timing memory units to make the matting results more stable in time sequence, but these methods generally require n frames of information before and after. , In terms of resource occupancy, real-time reasoning, and applicable scenarios, they are not in line with our desired scenarios.

Taking into account the balance of resource consumption and effects, we use the method of cating the alpha result of the previous frame to the RGB image of the current frame as the input to make the network more stable in timing.

The modification on the network is very simple, just specify in_channels = 4: when the model is initialized:

modnet = MODNet(in_channels=4, backbone_pretrained=False)

In terms of training data, we choose some VideoMatting data sets: VideoMatte240K, ConferenceVideoSegmentationDataset.

Initially, we tried to train the model with a simple strategy of using the alpha of the previous frame as input, and zero padding when the previous frame was missing:

if os.path.exists(os.path.join(self.alpha_path, alpha_pre_path)):
    alpha_pre = cv2.imread(os.path.join(self.alpha_path, alpha_pre_path))
else:
    alpha_pre = np.zeros_like(alpha)
 
net_input = torch.cat([image, alpha_pre], dim=0)

After convergent deployment, it is found that the model effect is greatly improved when the scene is relatively stable, and the model adapts poorly when people enter and exit the screen. At the same time, if the result of a certain frame is poor, it will have a great impact on subsequent frames. In response to these problems, consider formulating corresponding data enhancement strategies to solve the problems.

  • The model adapts poorly when people enter and exit the picture: there are fewer blank frames in the data set, and there is not enough learning about the characters in and out of the picture, so the blank frame probability is increased during data processing:
if os.path.exists(os.path.join(self.alpha_path, alpha_pre_path)) and random.random() < 0.7:
    alpha_pre = cv2.imread(os.path.join(self.alpha_path, alpha_pre_path))
else:
    alpha_pre = np.zeros_like(alpha)
  • The result of a certain frame is poor, which will have a great impact on subsequent frames: the current result is more dependent on the alpha of the previous frame, and it has not learned to discard the wrong result. Therefore, the alpha_pre is subjected to a certain probability of affine transformation during data processing, so that the network learns Ignore results with large deviations;
  • In addition, the lighting problem still exists, and the effect of matting in the backlit or strong light is poor: the image is enhanced by lighting. Specifically, under a certain probability, the simulated point light source or line light source is superimposed on the original image, so that the network has better illumination. Robust. There are two common ways to enhance the illumination data. One is to perform simple simulation through opencv. For details, please refer to augmentation.py. There is also the use of GAN to generate data. We use opencv for simulation.

After retraining, the effect of our model can reach the effect shown in the previous article, and the real-time and elegant effect is fully achieved on the HiLens Kit with 16T computing power. Furthermore, I also want the model to become an excellent model with less time-consuming and better effect~ The current improvement direction is:

  • Replacing the backbone: Choosing the right backbone for the application hardware has always been the most cost-effective way to improve the model. It is best to search for a model for the hardware based on time and resource consumption. The currently found model is converted to the onnx test result (input 192x192):
GPU:
Average Performance excluding first iteration. Iterations 2 to 300. (Iterations greater than 1 only bind and evaluate)
  Average Bind: 0.124713 ms
  Average Evaluate: 16.0683 ms
​
  Average Working Set Memory usage (bind): 6.53219e-05 MB
  Average Working Set Memory usage (evaluate): 0.546117 MB
​
  Average Dedicated Memory usage (bind): 0 MB
  Average Dedicated Memory usage (evaluate): 0 MB
​
  Average Shared Memory usage (bind): 0 MB
  Average Shared Memory usage (evaluate): 0.000483382 MB
 
CPU:
Average Performance excluding first iteration. Iterations 2 to 300. (Iterations greater than 1 only bind and evaluate)
  Average Bind: 0.150212 ms
  Average Evaluate: 13.7656 ms
​
  Average Working Set Memory usage (bind): 9.14507e-05 MB
  Average Working Set Memory usage (evaluate): 0.566746 MB
​
  Average Dedicated Memory usage (bind): 0 MB
  Average Dedicated Memory usage (evaluate): 0 MB
​
  Average Shared Memory usage (bind): 0 MB
  Average Shared Memory usage (evaluate): 0 MB
  • Model branch: In the observations used, it is found that most stable scenes can use smaller models to get good results. All consider finetune LRBranch to handle simple scenes. HRBranch and FusionBranch are still used to handle complex scenes. This work is still in place. processing.

  1. Xu, Ning, et al. “Deep image matting.”Proceedings of the IEEE conference on computer vision and pattern recognition. 2017
  2. Sengupta, Soumyadip, et al. “Background matting: The world is your green screen.”Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
  3. Lin, Shanchuan, et al. “Real-Time High-Resolution Background Matting.”arXiv preprint arXiv:2012.07810(2020).
  4. Li, Yaoyi, et al. “Inductive Guided Filter: Real-Time Deep Matting with Weakly Annotated Masks on Mobile Devices.”2020 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2020.
  5. Yu, Qihang, et al. “Mask Guided Matting via Progressive Refinement Network.”arXiv e-prints(2020): arXiv-2012.
  6. Chen, Quan, et al. “Semantic human matting.”Proceedings of the 26th ACM international conference on Multimedia. 2018.
  7. Ke, Zhanghan, et al. “Is a Green Screen Really Necessary for Real-Time Human Matting?.”arXiv preprint arXiv:2011.11961(2020).
  8. Rhemann, Christoph, et al. “A perceptually motivated online benchmark for image matting.”2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2009.

华为云开发者联盟
1.4k 声望1.8k 粉丝

生于云,长于云,让开发者成为决定性力量