How are the various backgrounds achieved during the live broadcast? Talk about the technology behind the virtual background

Author｜Yichuan

Review｜Taiyi

The virtual background relies on the portrait segmentation technology, which is achieved by segmenting the portrait in the picture and replacing the background picture. According to its application scenarios, it can be roughly divided into the following three categories:

live broadcast scene : used for atmosphere creation, such as education live broadcast, online annual meeting, etc.;

real-time communication scene : used to protect user privacy, such as video conferencing;

Interactive entertainment scene : used to increase interest, such as film and television editing, vibrato character special effects, etc.

What technologies are needed to realize the virtual background?

Real-time semantic segmentation

Semantic segmentation aims to perform label prediction on each pixel of an image, and has a wide range of applications in fields such as autonomous driving and scene understanding. With the development of mobile Internet, 5G and other technologies, how to perform high-resolution real-time semantic segmentation on terminal devices with limited computing power has increasingly become an urgent need. The figure above lists the real-time semantic segmentation methods in recent years, and this section will introduce some of them.

BiSeNet:Bilateral Segmentation Network for Real-time Semantic Segmentation

Previous real-time semantic segmentation algorithms met real-time requirements by limiting the input size, reducing the number of network channels, and abandoning deep network modules. However, due to discarding too many spatial details or sacrificing model capacity, the segmentation accuracy was greatly reduced. Therefore, the author proposes a bilateral segmentation network (BiseNet, ECCV2018). The network structure is shown in the figure above. The network is composed of a spatial path (Spatial Path) and a semantic path (Context Path), which are used to solve the lack of spatial information and feelings. The problem of wild shrinkage.

The spatial path uses a network with wide channels and shallow depth to obtain high-resolution features and retains rich spatial information; while the semantic path uses a lightweight backbone model with narrow channels and deep depth, which is extracted through fast downsampling and global average pooling Semantic information. Finally, the feature fusion module (FFM) is used to fuse the features of the two paths to achieve a balance between accuracy and speed. The MIOU of this method on the cityscapes test set is 68.4%.

The upgraded version of BiseNetV2 continues the idea of the V1 version. The network structure is shown in the figure above. The V2 version removes the time-consuming skip connection in the V1 spatial path, adds a two-way aggregation layer (Aggregation Layer) and increases the connection between the two branches. Information aggregation, and an enhanced training strategy is proposed to further improve the segmentation effect. The MIOU on the cityscapes test set has increased to 72.6%, and the FPS can reach 156 on the TensorRT using 1080Ti.

DFANet:Deep Feature Aggregation for Real-Time Semantic Segmentation

DFANet (CVPR2019) designed two feature aggregation strategies, subnet aggregation and sub-stage aggregation to improve the performance of real-time semantic segmentation. The network structure of DFANet is shown in the figure above and consists of three parts: lightweight backbone network, subnet aggregation and sub-stage aggregation modules. The lightweight backbone network uses the Xception network with faster inference speed, and adds a fully connected attention module to the top layer to increase the receptive field of high-level features; subnet aggregation reuses the high-level features extracted from the previous backbone network and upsamples them as The input of the next subnet increases the receptive field while refining the prediction results; the sub-stage aggregation module uses the features of the corresponding stages of different subnets to fuse multi-scale structural details to enhance the ability to distinguish features. Finally, a lightweight decoder is used to fuse the output results of different stages to generate the segmentation results from coarse to fine. On the Cityscapes test set, MIOU is 71.3% and FPS is 100.

Semantic Flow for Fast and Accurate Scene Parsing

Inspired by optical flow, the author believes that the relationship between any two feature maps of different resolutions generated from the same image can also be expressed by the flow of each pixel. SFNet (ECCV2020) is proposed. The network structure is shown in the figure above. Show.

Therefore, the author proposes a Semantic Flow Alignment Module (FAM) to learn the semantic flow of features in adjacent stages, and then broadcasts features containing high-level semantics to high-resolution features through warping, so that the deep features The rich semantics are efficiently propagated to the shallow features, so that the features contain rich semantics and spatial information at the same time. The author seamlessly inserts the FAM module into the FPN network to integrate the characteristics of the adjacent stages, as shown in the figure above. SFNet can achieve 80.4% mIoU in Cityscapes in the case of real-time segmentation (FPS is 26).

Portrait segmentation

Portrait segmentation is a subtask of semantic segmentation. The goal is to segment the portrait in the picture from the background, which belongs to the second type of segmentation. Compared with semantic segmentation, portrait segmentation is relatively simple, and is generally applied to end-side devices such as mobile phones. The current research goals can be roughly divided into two categories. One is to design a lightweight and efficient portrait segmentation model through improved network design, and the other is to enhance portrait segmentation. The details.

Boundary-sensitive Network for Portrait Segmentation

BSN (FG2019) mainly focuses on improving the edge segmentation effect of portraits. It is mainly achieved through two types of edge loss, namely the edge Individual Kernel for each portrait and the average edge Global Kernel calculated for the portrait data set. The Individual Kernel is similar to the previous method. The edge label of the portrait is obtained through expansion and erosion. The difference is that it uses the edge as the third category to distinguish between the foreground and the background, which is represented by soft label, thereby transforming the portrait segmentation into 3 The problem of class segmentation is shown in the figure above. The label of Global Kernel is obtained by counting the average value of the edge of the portrait data set. The Global Kernel tells the network the prior information of the approximate location of the portrait. At the same time, in order to provide more portrait edge priors, the author added a two-category branch that distinguishes the edges of long and short hair, and performs multi-task collaborative training with the segmentation network. The MIOU of BSN in the EG1800 portrait segmentation test set is 96.7%, but it has no advantage in speed.

PortraitNet:Real-time Portrait Segmentation Network for Mobile Device

PortraitNet (Computers & Graphics 2019) designed a lightweight u-net structure based on deep separable convolution. As shown in the figure above, in order to increase the segmentation details of the portrait edge, this method generates portraits by expanding and corroding portrait labels. Edge label, used to calculate the edge loss (Boundary loss). At the same time, in order to enhance the robustness to lighting, this method proposes a Consistency constraint loss, as shown in the figure below. By constraining the lighting transformation before and after the image segmentation results are consistent, the robustness of the model is enhanced. The PortraitNet model parameter size is 2.1M, and the MIOU of the EG1800 portrait segmentation test set is 96.6%.

SINet:Extreme Lightweight Portrait Segmentation Networks with Spatial Squeeze Modules and Information Blocking Decoder

SINet (WACV2020) focuses on improving the speed of the portrait segmentation network. It consists of an encoder including a spatial squeeze module and a decoder including an information blocking scheme (information blocking scheme). The network framework is shown in the figure above. Based on the shuffleNetV2 module, the spatial compression module (as shown in the figure below) uses pooling operations of different scales on different paths to compress the feature spatial resolution, extracts the features of different receptive fields to deal with portraits of different scales, and reduces the calculation delay. The information shielding mechanism is based on the confidence of the portrait predicted based on the deep low-resolution features. When fusing the shallow high-resolution features, the high-confidence area is shielded, and only the shallow features of the low-confidence area are fused to avoid introducing irrelevant noise. The MIOU of SINet in the EG1800 portrait segmentation test set is 95.3%, but the model parameter size is only 86.9K, which reduces the parameters by 95.9% compared with PortraitNe.

Portrait cutout

An image can be simply viewed as consisting of two parts, namely the foreground (Foreground) and the background (Background). Picture matting is to distinguish the foreground and background of a given image, as shown in the figure above. Because image matting is an under-constrained problem, traditional matting algorithms and early deep learning-based algorithm methods are mainly constrained by inputting additional semantic information. The most common one is a trimap composed of foreground, background, and uncertain regions. As shown below. The accuracy of the matting algorithm is greatly affected by the accuracy of the trimap. When the trimap is poor, the prediction result of the algorithm based on the trimap drops seriously. The acquisition of trimap is mainly generated by other algorithms (such as semantic segmentation) or manual annotation generation, but the trimap generated by other algorithms is generally rough, and manual annotation of accurate trimap is time-consuming and laborious. The trimap-free matting algorithm has gradually entered people's field of vision. This section will mainly introduce the recent trimap-free portrait matting algorithm.

Background Matting: The World is Your Green Screen

Background Matting (CVPR2020) tries to improve the effect of portrait matting by introducing static background information. Compared with the manual fine-labeled trimap, it is relatively easy to obtain the static background of the foreground. The operation flow of this method is shown in the figure above. First obtain two pictures of the scene, one contains the foreground and the other does not, and then use the deep network to predict the alpha channel, and then perform the background synthesis. Therefore, this method is mainly aimed at the scene with a static background with only a slight camera shake. Cutout.

The model structure of this method is shown in the figure above. The model processing flow is roughly as follows: Given the input image and background image, first obtain the rough segmentation result of the foreground image through the segmentation model (such as deeplab), and then use the same Context Switching Block module for different The combination of input information is selected, and then input into the decoder to predict the foreground and alpha channel at the same time. The training process is divided into two stages. First, training is carried out on Adobe's synthetic data set. In order to reduce the influence of the domain gap between synthetic pictures and real pictures, the second stage is carried out on unlabeled real pictures and background pictures using LS-GAN The adversarial training, by shortening the distance between the picture synthesized by the predicted alpha channel and the real picture, to improve the effect of matting. This method does not work well when the background transformation is large or the difference with the foreground is large.

Boosting Semantic Human Matting with Coarse Annotations

This article (CVPR2020) believes that the main reasons that affect the effect of the portrait matting algorithm come from two aspects. One is the accuracy of trimap, and the other is the high cost and low efficiency of obtaining accurate portrait annotations, resulting in a relatively large number of images in the portrait matting dataset. Therefore, this article proposes a method that only requires part of the fine-labeled data (above (b)) and a large amount of coarse-labeled data (above (a)) to improve the effect of portrait matting.

The network structure of this method is shown in the figure above and consists of 3 modules: MPN, coarse mask prediction network; QUN: mask quality unified network; MRN, matting refinement network, and coarse-to-fine gradually optimize the result of segmentation. During training, first use the coarse standard data and the fine standard data to train the MPN at the same time to obtain the coarse mask, and then use the fine standard data to train the MRN to refine the segmentation results. However, the author of the article found that due to the difference in the labeling of the rough and refined data, the MRN predictions of the two are expected to have a large GAP, which affects the performance. Therefore, the author proposes QUN to unify the quality of the prediction results of the rough mask.

The experimental results are shown in the figure above. Compared with training using only fine-labeled data, combining coarse-labeled data is of greater help in extracting semantic information from the network. At the same time, by combining QUN and MRN, the coarse-labeled data of the existing data set can be refined, and the cost of obtaining fine-labeled data can be reduced.

Is a Green Screen Really Necessary for Real-Time Human Matting?

Existing portrait matting algorithms require additional input (such as trimap, background), or use multiple models, whether it is the cost of obtaining additional input or the computational cost of using multiple models, making the existing methods unable to achieve portrait matting. Real-time application of graphs. Therefore, the author of the article proposes a lightweight portrait matting algorithm that can achieve real-time portrait matting effects using only a single input image. Processing 512x512 images on 1080Ti can achieve an effect of 63FPS.

As shown in the figure above, the method in this article can be divided into 3 parts. First, the portrait matting network is trained on the labeled data through multi-task supervised learning, and then the SOC self-supervised strategy is used to fine-tune the unlabeled real data. Enhance the generalization ability of the model; use the OFD strategy to improve the smoothness of the prediction results during testing. These three parts will be introduced in detail below.

The network structure of MODNet is shown in the figure above. The author was inspired by the trimap-based method and decomposed the trimap-free portrait matting task into three related subtasks for collaborative training, namely semantic prediction, detail prediction and semantics -Detail fusion. The subject of the portrait is captured through the low-resolution semantic branch, the edge details of the portrait are extracted through the high-resolution detail branch, and finally the final portrait segmentation result is obtained by fusing the semantics and details.

When applied to the data of a new scene, the results produced by the three branches may be different. Therefore, the author proposes a SOC self-supervising strategy that uses unlabeled data. By letting the semantic-detail fusion branch prediction results semantically and semantically branch The prediction results are consistent, the detailed results are consistent with the detailed branch prediction results, and the consistency constraints of the prediction results between different subtasks are enhanced, thereby enhancing the generalization ability of the model.

Directly predicting each frame of the video separately will cause inconsistencies in the timing of the prediction results of adjacent frames, resulting in the phenomenon of flashes between frames. The author found that the flashing pixels may be corrected by the prediction results of adjacent frames, as shown in the figure above. Therefore, the author proposes that when the pixel prediction results of the current frame and the next frame are less than a certain threshold, and the prediction results of the current frame and the prediction results of the previous and next frames are greater than a certain threshold, the average of the prediction results of the previous and next frames can be used as the current frame Prediction results, thereby avoiding the inter-frame flickering phenomenon that may not necessarily be caused by the timing of the prediction results of different frames.

Real-Time High-Resolution Background Matting

Although the existing portrait matting algorithms can generate finer matting results, they cannot process high-resolution images in real time. For example, Background Matting can only process about 8 512x512 images per second on the GPU 2080Ti. This cannot meet the needs of real-time applications. The author of the article found that for high-resolution images, only a small part of the area needs to be finely segmented (as shown in the figure above), and most areas only need to be coarsely segmented. If the network only needs to perform further fine segmentation on a small part of the area The segmentation optimization can save a lot of calculations. The author draws on the idea of PointRend, and based on Background Matting, proposes a two-stage portrait matting network for real-time processing of high-resolution pictures. The processing of high-definition pictures (resolution 1920x1080) on 2080Ti can reach 60FPS. For 4K pictures (resolution The rate is 3840x2160), which can reach 30FPS.

The two-stage network framework proposed in this paper is shown in the figure above, which consists of a basic network and a refined network. The basic network of the first stage adopts an encoder-decoder structure similar to DeeplabV3+ to generate coarse portrait segmentation results, foreground residuals, error prediction maps and hidden features containing global semantics. The second stage of the refinement network uses the error prediction map generated in the first stage to select the first k image blocks that need to be segmented and refined, and perform further segmentation optimization, and finally the refined segmentation results and direct up-sampling enlargement The rough segmentation results of the fusion are combined to obtain the final portrait segmentation result. Compared with other methods, the method in this paper has a significant improvement in speed and model size, as shown in the figure below.

Video portrait segmentation

Video Object Segmentation (Video Object Segmentation, referred to as VOS) aims to obtain the pixel-level segmentation result of the object of interest from each frame of the video. Compared with single-frame image segmentation, the video segmentation algorithm mainly relies on the continuity between multiple frames, so as to achieve high smoothness and high precision of the segmentation results. Currently, VOS tasks can be divided into two types: semi-supervised (one-shot) segmentation and unsupervised (zero-shot). The former needs to input the original video and the segmentation result of the first frame of the video at runtime, while the latter only needs to input the original video. The existing semi-supervised VOS algorithm is difficult to achieve accurate and real-time segmentation, and the research focus generally focuses on one of the two. The effect of the existing VOS algorithm [12] is shown in the following figure.

Application of Virtual Background Technology in Video Conference

As a mid-to-high-frequency scene in daily office work, video conferencing is accompanied by the enthusiasm of home office, which puts forward a higher demand for the protection of user privacy, and the virtual background function of video conferencing is coming. Compared with high-performance servers in the cloud, the video conferencing carriers for personal scenes are mainly laptops of various colors. The performance of different models of laptops is uneven, and video conferencing requires high real-time performance and different conference backgrounds. The performance of the algorithm puts forward more stringent requirements.

Real-time requirements determine that the end-side portrait segmentation model should be sufficiently lightweight, while the small model has weak processing capabilities for some difficult scenes (such as the edge of the portrait and the background, etc.), and is more sensitive to data, which can easily lead to background errors. Divided into portraits, blurry edges of portraits, etc. For these problems, we have made some targeted adjustments and optimizations in algorithm and data engineering.

Algorithm exploration

1) Optimize the edge:

The first method of optimizing the edge constructs the edge loss, referring to MODNet, through the expansion and corrosion operation on the portrait label, the label of the portrait edge area is obtained, and the network's ability to extract the edge structure is enhanced by calculating the loss on the edge area.

The second method of optimizing the edge is to use OHEM loss. Compared with the main body area of the portrait, the edge region of the portrait is often easy to be misclassified. During training, through online hard case mining on the prediction results of the portrait segmentation, the edge of the portrait can be implicitly optimized area.

2) Unsupervised learning:

The first unsupervised learning method is realized through data enhancement. With reference to PortraitNet, for a given input image, the data composed of color, Gaussian blur and noise is enhanced to obtain the transformed image, although the image is relative to the image The appearance has changed, but the foreground images of the two pictures before and after the change are the same. Therefore, the prediction results of the pictures before and after the KL Loss constraint data can be enhanced to keep the same, thereby enhancing the network's robustness to changes in external conditions such as lighting and blur. Awesome.

The second unsupervised learning method is realized by using unlabeled real pictures and background pictures for adversarial training. Refer to Background Matting. During model training, an additional discriminator network is introduced to determine whether the picture input to the discriminator is predicted by the network. The portrait foreground and the random background are synthesized from the real picture, reducing the artifacts in the portrait prediction results.

3) Multi-task learning

Multi-task learning usually refers to adding sub-tasks related to the original task for collaborative training to improve the effect of the network on the original task, such as the detection and segmentation tasks in Mask-RCNN. One of the difficulties of portrait segmentation is that when the portrait in the video makes a certain action (such as waving a hand, etc.), the segmentation effect for arms and other parts is poor. In order to better capture the information of the human body, we try to introduce human posture information into the model training for multi-task training. Refer to Pose2Seg to better capture the body movement information by analyzing the portrait posture. In the test, only the trained portrait segmentation branch is used for inference, which can improve the accuracy of segmentation while taking into account the performance.

4) Knowledge distillation

Knowledge distillation is widely used in model compression and transfer learning, usually using teacher-student learning strategies. First, train a teacher model with strong performance (such as DeeplabV3+) in advance. When training the student model, use the soft labels generated by the teacher model as supervision information to guide the student model training. Compared with the original one-hot label, the soft label predicted by the teacher model contains the knowledge of the similarity of the data structure of different categories, which makes the training of the student model easier to converge.

5) Lightweight model

In response to the needs of business scenarios, we selected the U-net structure based on the mobilenet-V2 network, and optimized and tailored the model according to the characteristics of the mnn operator to meet actual business performance requirements.

6) Strategy optimization

In actual meeting scenes, many participants stay still in many cases. In this state, there is a certain waste of resources to use real-time frame rate for portrait segmentation. For this kind of scene, we designed an edge position frame difference method. Based on the changes in the edge area of the adjacent frames, we can accurately judge whether the portrait is moving. At the same time, this method can effectively remove the character's speech, expression changes, external area changes, etc. interference. The edge position frame difference method can effectively reduce the frequency of the portrait segmentation algorithm when participants are stationary, thereby greatly reducing energy consumption.

Data engineering

Portrait segmentation is more dependent on data. The existing open source data sets are quite different from meeting scenarios, and the labeling of segmented data is time-consuming and laborious. In order to reduce the cost of data acquisition and improve the utilization of existing data, we try to synthesize, Some attempts have been made on automated labeling.

1) Data synthesis

In data synthesis, we use existing models to filter out some good sub-data sets, use translation, rotation, thin plate transformation, etc. to increase the diversity of portrait poses and actions, and then merge with the different backgrounds of the conference scene to expand Training data. During data transformation, if the portrait label intersects the boundary, the coordinate relationship is used. When composing a new picture, the original intersecting relationship between the label and the boundary is maintained to avoid the separation and floating of the portrait from the boundary, so that the generated picture is more realistic .

2) Automated marking and cleaning

By using a variety of open source detection, segmentation, and matting algorithms, a set of efficient automatic labeling and cleaning tools is designed to perform rapid automatic data marking and cleaning quality inspection, reducing the cost of data labeling acquisition (labeling valid data 5W+) .

Algorithm results

At present, the algorithm has been used internally.

1) Technical index

2) Effect display

Photo change background scene

In addition to the real-time communication scene, we have also made some attempts in the interactive entertainment scene using the portrait segmentation algorithm, such as changing the background of the photo, and the effect is shown in the figure below.

references

BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation
BiSeNet V2: Bilateral Network with Guided Aggregation for Real-time Semantic Segmentation
DFANet: Deep Feature Aggregation for Real-Time Semantic Segmentation
Semantic Flow for Fast and Accurate Scene Parsing
Boundary-sensitive Network for Portrait Segmentation
PortraitNet: Real-time Portrait Segmentation Network for Mobile Device
SINet: Extreme Lightweight Portrait Segmentation Networks with Spatial Squeeze Modules and Information Blocking Decoder
Background Matting: The World is Your Green Screen
Boosting Semantic Human Matting with Coarse Annotations
Is a Green Screen Really Necessary for Real-Time Human Matting?
Real-Time High-Resolution Background Matting
SwiftNet: Real-time Video Object Segmentation
Pose2Seg: Detection Free Human Instance Segmentation
Distilling the Knowledge in a Neural Network

"Video Cloud Technology" Your most noteworthy audio and video technology public account, pushes practical technical articles from the front line of Alibaba Cloud every week, and exchanges and exchanges with first-class engineers in the audio and video field. The official account backstage reply [Technology] You can join the Alibaba Cloud Video Cloud Product Technology Exchange Group, discuss audio and video technologies with industry leaders, and get more industry latest information.

How are the various backgrounds achieved during the live broadcast? Talk about the technology behind the virtual background

What technologies are needed to realize the virtual background?

Real-time semantic segmentation

Portrait segmentation

Portrait cutout

Video portrait segmentation

Application of Virtual Background Technology in Video Conference

Algorithm exploration

Data engineering

Algorithm results

Photo change background scene

references

CloudImagine

引用和评论

阿里云 ESA 游戏行业解决方案｜安全防护、加速、低延时的技术融合

🔥吐血整理 Bolt.diy 部署与应用攻略

支付宝H5下载被拦截的原因排查与解决指南

PAI Model Gallery 支持云上一键部署 Qwen3 全尺寸模型

2025年3月中国数据库排行榜：PolarDB夺魁傲群雄，GoldenDB晋位入三强

2025年4月中国数据库流行度排行榜：OB高分复登顶，崖山稳驭撼十强

三分钟掌握视频剪辑 | 在 Rust 中优雅地集成 FFmpeg