By Yi Chuan

Proofread | tai one

The virtual background relies on the portrait segmentation technology, which is realized by dividing the portrait in the picture and replacing the background picture. According to the application scenarios, they can be divided into the following three categories:

Live scene: used to create atmosphere, such as education live broadcast, online annual meeting, etc.

Real-time communication: used to protect user privacy, such as video conferencing.

Interactive entertainment scenes: used to increase interest, such as film editing, special effects of douyin characters, etc.

What technologies are needed to implement a virtual background?

Real-time semantic segmentation

Semantic segmentation aims to predict the label of each pixel of an image, which has been widely used in autonomous driving, scene understanding and other fields. With the development of mobile Internet, 5G and other technologies, how to perform high-resolution real-time semantic segmentation on terminals with limited computing power has become an increasingly urgent need. The figure above lists the real-time semantic segmentation methods in recent years, and this section will introduce some of them.

BiSeNet:Bilateral Segmentation Network for Real-time Semantic Segmentation

The previous real-time semantic segmentation algorithm met the real-time demand by limiting the input size, reducing the number of network channels, and discarding the deep network module. However, because too much spatial details were discarded or the model capacity was sacrificed, the segmentation accuracy was greatly reduced. Therefore, the author proposed a bilateral partition network (BiseNet, ECCV2018). The network structure is shown in the figure above. The network consists of Spatial Path and Context Path, which are used to solve the problems of Spatial information loss and sensing field narrowing respectively.

The spatial path obtains high resolution feature through the network with wide channel and shallow depth, and retains rich spatial information. The semantic path is a lightweight backbone model with narrow channel and deep depth, which extracts semantic information through fast downsampling and global average pooling. Finally, feature fusion module (FFM) is used to fuse the features of the two paths to achieve the balance between accuracy and speed. The MIOU of this method on CityScapes test set is 68.4%.

The upgraded BiseNetV2 continues the idea of V1. The network structure is shown in the figure above. The V2 version removes the time-consuming skip connection in the spatial path of V1 and adds the Aggregation Layer to increase the Aggregation of information between the two branches. Furthermore, an enhanced training strategy is proposed to further improve the segmentation effect. The MIOU on CityScapes test set is increased to 72.6%, and the FPS on TensorRT using 1080Ti can reach 156.

DFANet:Deep Feature Aggregation for Real-Time Semantic Segmentation

DFANet (CVPR2019) designed two feature aggregation strategies, subnet aggregation and substage aggregation, to improve the performance of real-time semantic segmentation. The network structure of DFANet is shown in the figure above, consisting of three parts: lightweight backbone network, subnet aggregation, and subphase aggregation module. The light-weight backbone network adopts Xception network, which has a fast reasoning speed, and adds a fully connected attention module to its top layer to increase the receptive field of high-level features. By reusing the high level features extracted from the backbone network and using them as the input of the next subnet, subnet aggregation increases the receptive field and refines the prediction results. The sub-stage aggregation module uses the features of different subnets and corresponding stages to fuse the details of multi-scale structure to enhance the discriminant ability of features. Finally, a lightweight decoder is used to fuse the output results of different stages and generate the segmentation results from coarse to fine. On the Cityscapes test set, MIOU was 71.3% and FPS was 100.

Semantic Flow for Fast and Accurate Scene Parsing

Inspired by optical flow, the author believes that the relationship between any two feature maps with different resolutions generated by the same image can also be represented by the flow of each pixel. SFNet (ECCV2020) is proposed, and the network structure is shown in the figure above.

Therefore, the author proposes the Flow Alignment Module (FAM) to learn the semantic Flow of adjacent phase features, and then broadcast the features containing high-level semantics to the high-resolution features through warping, so as to efficiently propagate the rich semantics of deep features to the shallow features. Make features contain both rich semantic and spatial information. The authors seamlessly inserted the FAM module into the FPN network to fuse the features of the adjacent stages, as shown in the figure above. SFNet can split in real time (26 FPS) and achieve 80.4% mIoU in Cityscapes.

Like splitting

Portrait segmentation is a sub-task of semantic segmentation. The goal is to separate the portrait in the image from the background. Compared with semantic segmentation, portrait segmentation is relatively simple and is generally applied to devices such as mobile phones. The current research objectives can be generally divided into two categories: one is to improve the network design of lightweight and efficient portrait segmentation model; the other is to enhance the details of portrait segmentation.

Boundary-sensitive Network for Portrait Segmentation

BSN (FG2019) mainly focuses on improving the edge segmentation effect of portrait, mainly through two kinds of edge loss, namely, the edge Individual Kernel for each portrait and the average edge Global Kernel for portrait data set calculation. Similar to the previous method, Individual Kernel obtains portrait edge labels through expansion and corrosion operations. The difference is that it regards edge as the third category distinguished from foreground and background and represents it with soft label, thus transforming portrait segmentation into three categories of segmentation problems, as shown in the figure above. The label of Global Kernel is obtained by analyzing the edge average of portrait data set, and the prior information of the approximate location of portrait is told to the network by Global Kernel. At the same time, in order to provide more portrait edge priors, the author adds a dichotomy branch to distinguish the long and short hair edges, and performs multi-task cooperative training with the segmentation network. BSN had a MIOU of 96.7% in the EG1800 portrait segmentation test set, but did not have an advantage in speed.

PortraitNet:Real-time Portrait Segmentation Network for Mobile Device

PortraitNet (Computers & Graphics 2019) designs a lightweight U-NET structure based on deep detached-convolution, as shown in the figure above. In order to increase the segmentation details of portrait edges, this method generates portrait edge tags by expanding and corroding portrait tags. Used to calculate Boundary loss. At the same time, in order to enhance the robustness of the illumination, the method proposes Consistency constraint loss, as shown in the figure below, to enhance the robustness of the model by constraining the Consistency of the image segmentation results before and after the illumination transformation. The parameter size of PortraitNet model is 2.1m, and the MIOU of EG1800 portrait segmentation test set is 96.6%.

SINet:Extreme Lightweight Portrait Segmentation Networks with Spatial Squeeze Modules and Information Blocking Decoder

SINet (WACV2020) focuses on improving the speed of portrait segmentation network, which is composed of an encoder containing spatial squeeze module and an information blocking scheme containing information shielding mechanism. The network framework is shown in the figure above. Space compression module (as shown in the following figure) based on shuffleNetV2 module, the feature spatial resolution is compressed by pooling operations of different scales on different paths, and the features of different sensing fields are extracted to cope with the portraits of different scales and reduce the calculation delay. The information shielding mechanism is to predict the portrait confidence based on the deep low-resolution features. When the shallow high-resolution features are fused, the high confidence region is shielded and only the shallow features of the low confidence region are fused to avoid the introduction of irrelevant noise. The MIOU of SINet in the EG1800 portrait segmentation test set is 95.3%, but the model parameter size is only 86.9K, which is 95.9% less than that of PortraitNe.

Like cutout

An image can simply be regarded as composed of two parts, Foreground and Background. Image matting means to distinguish the Foreground and Background of a given image, as shown in the figure above. As picture matting is an under-constrained problem, traditional image matting algorithm and early algorithm methods based on deep learning mainly input additional semantic information as constraints, and the most common one is triMAP composed of foreground, background and uncertain regions, as shown in the figure below. The accuracy of the matting algorithm is greatly affected by the accuracy of TriMap. When triMap is poor, the prediction results of the algorithm based on TriMap decrease seriously. The acquisition of TriMap is mainly generated by other algorithms (such as semantic segmentation) or manual annotation. However, the TriMap generated by other algorithms is generally rough, while the triMap with accurate manual annotation is time-consuming and laborious. The image matting algorithm of TriMAP-free has gradually come into people’s view. This section will mainly introduce the recent image matting algorithm of TriMAP-free.

Background Matting: The World is Your Green Screen

Background Matting (CVPR2020) tries to improve the effect of portrait Matting by introducing static Background information. Compared with the manually carefully annotating Trimap, it is relatively easy to obtain the static Background where the foreground is located. The operation process of this method is shown in the figure above. One contains the foreground and the other does not, and then the deep network is used to predict the alpha channel and then the background is synthesized. Therefore, this method is mainly aimed at portrait matting with a static background and only slight camera shake.

The model structure of this method is shown in the figure above, and the process of model processing is as follows: Given an input image and a background image, a coarse segmentation of the foreground image is obtained using a segmentation model such as Deeplab, and then a context-switching Block module is used to select a combination of different input information, which is then input into a decoder to predict both the foreground and alpha channels. The training process is divided into two stages. First, the training is carried out on Adobe’s composite data set. In order to reduce the impact of domain Gap between the composite image and the real image, lS-GAN is used to carry out the confrontation training in the second stage on the real image and the background image without labels. By shortening the distance between the predicted alpha channel composite picture and the real picture, the effect of matting is improved. When the background change is large or the foreground difference is large, the effect of this method is not good.

Boosting Semantic Human Matting with Coarse Annotations

According to this paper (CVPR2020), the effect of portrait matting algorithm is mainly influenced by two factors. One is the accuracy of Trimap; the other is the high cost and low efficiency of obtaining accurate portrait labeling, resulting in a small number of images in portrait matting data set. Therefore, this paper proposes a method to improve the effect of portrait matting by combining only part of fine-labeled data (above (b)) with a large amount of coarse labeled data (above (a)).

The network structure of the method is shown in the figure above, which is composed of three modules: MPN, coarse mask prediction network; QUN: Mask unified quality network; MRN, Matting thinning network and coarse-to-fine were used to gradually optimize the segmentation results. During the training, the coarse standard data and fine standard data were first used to train MPN at the same time to obtain the coarse mask, and then the fine standard data was used to train MRN to refine the segmentation results. However, the author found that due to the difference between rough mark data and fine mark data, there was a big GAP between MRN prediction and expectation, which affected the performance. Therefore, the author put forward QUN to unify the prediction results of rough mask.

The experimental results are shown in the figure above. Compared with the training using only refined data, the combination of coarse data is of great help to the extraction of semantic information on the network. At the same time, the combination of QUN and MRN can refine the rough mark data of the existing data set and reduce the acquisition cost of fine annotation.

Is a Green Screen Really Necessary for Real-Time Human Matting?

Existing portrait matting algorithms, or the need for additional inputs (such as trimap, background), or the need to use multiple models, either the overhead of getting additional inputs or the computational overhead of using multiple models, make it impossible for existing methods to achieve real-time applications of portrait matting. Therefore, the author of this paper proposes a lightweight image matting algorithm, which only uses a single input image to achieve real-time image matting effect, processing 512×512 size images on 1080Ti, can reach 63FPS effect.

As shown in the figure above, the method in this paper can be divided into three parts. Firstly, the image cut-out network is trained in the way of multi-task supervised learning on the annotated data, and then the real data without annotations is fine-tuned by SOC self-monitoring strategy to enhance the generalization ability of the model. During the test, THE OFD strategy is used to improve the smoothness of the prediction results. The following three parts are described in detail.

The network structure of MODNet is shown in the figure above. Inspired by the triMap-based method, the author decomposed the trimap-free portrait matting task into three related sub-tasks for collaborative training, namely semantic prediction, detail prediction and semantic-detail fusion. The low resolution semantic branch is used to capture the portrait subject, and the high resolution detail branch is used to extract the edge details of the portrait. Finally, the final portrait segmentation result is obtained by fusing semantics and details.

When applied to the new scene data, the result of a three branch may have differences, so the author puts forward the use of any unsigned data since the SOC supervision policy, by giving semantic details – fusion branch prediction results on the semantic and semantic branch prediction results consistent results in detail and details on branch prediction results are consistent, The consistency constraint of prediction results among different subtasks is enhanced to enhance the generalization ability of the model.

The prediction of each frame of the video directly will lead to the time inconsistency of the prediction results of the adjacent frames, which will lead to the phenomenon of inter-frame flickering. The authors found that the flickering pixels were likely to be corrected by the predicted results of the adjacent frames, as shown in the figure above. Therefore, the author proposed that the current frame and a frame of pixel prediction results after less than a certain threshold, at the same time before and after the prediction of the current frame and the frame of predicted results is greater than a certain threshold, you can use frames before and after the predicted results of the average as the predictive results of the current frame, thereby avoiding predicted results on the sequence of the different frames between the frame does not necessarily lead to bounce phenomenon.

Real-Time High-Resolution Background Matting

Although existing portrait Matting algorithms can generate relatively fine Matting results, they cannot process high-resolution images in real time. For example, Background Matting can only process about 8 512×512 images per second on GPU 2080Ti, which cannot meet the needs of real-time applications. The authors of the paper found that for high resolution images, only a few areas need fine segmentation (as shown in the figure above), and most areas need coarse segmentation. If the network is optimized for only a few areas that need fine segmentation, a lot of calculation can be saved. Using the idea of PointRend for reference and based on Background Matting, the author proposes a two-stage portrait Matting network for real-time processing of high resolution images. The processing of high definition images on 2080Ti (resolution 1920×1080) can reach 60FPS. For 4K images (3840×2160 resolution), 30FPS can be achieved.

The proposed two-stage network framework, as shown in the figure above, consists of a basic network and a refined network. The first phase of the underlying network uses an Encoder -decoder structure similar to DeeplabV3+ to generate coarse image segmentation results, foreground residues, false prediction maps, and hidden features with global semantics. In the second stage, the error prediction images generated in the first stage are used to select the first K image blocks to be segmtioned and refined for further segmentation optimization. Finally, the refined segmentation results are fused with the direct upsampling amplification coarse segmentation results to obtain the final image segmentation results. Compared with other methods, the method in this paper has a significant improvement in both speed and model size, as shown in the figure below.

Video portrait segmentation

Video Object Segmentation (VOS) aims to obtain pixel-level Segmentation results of the Object of interest from each frame of the Video. Compared with single-frame image segmentation, video segmentation algorithm mainly relies on the continuity of multiple frames, so as to achieve high smoothness and high precision of segmentation results. Currently, VOS tasks can be divided into seme-supervised (one-shot) division and unsupervised (zero-shot) division. The former needs to input the original video and the segmentation result of the first frame of the video while the latter only needs to input the original video. The existing semi-supervised VOS algorithm is difficult to achieve accurate and real-time segmentation, so the research focus generally focuses on one of the two. The effect of the existing VOS algorithm [12] is shown in the figure below.

Application of virtual background technology in video conference

As a high-frequency scene in daily office, video conference is accompanied by the popularity of home office, and the protection of user privacy has been put forward with higher demands. The virtual background function of video conference has come into being. Compared with the high performance server in the cloud, the video conferencing carrier of individual scenarios is mainly a variety of laptops, and the performance of different types of laptops is uneven. Meanwhile, video conferencing requires high real-time performance and different meeting backgrounds, which puts forward more stringent requirements on the performance of the end-to-end algorithm.

Real-time demand determines the end side like segmentation model to do enough light, and small model for some difficult scenarios (for example, like the edge with similar background, etc.) processing ability is weak, and the data is sensitive, this easily lead to wrong background are divided into portrait, portrait edge blur, etc., to solve these problems, we respectively on the algorithms and data engineering, Some targeted adjustments and optimizations have been made

Algorithm to explore

1) Edge optimization:

The first edge optimization method constructs the edge loss. Referring to MODNet, the image tag is expanded and corroded to get the image edge region tag. The edge region loss is calculated to enhance the network’s ability to extract the edge structure.

The second edge optimization method is to use OHEM loss. Compared with the subject area of the portrait, the edge area of the portrait is often prone to misclassification. During the training, the edge area of the portrait can be implicitly optimized by mining the prediction results of the portrait segmentation online.

2) Unsupervised learning:

The first kind of unsupervised learning method by data enhancement, reference PortraitNet, for a given input photo pictures, its color, gaussian blur and noise data increased after processing is composed of image images after the transformation, although the relative pictures in appearance has changed, but the changes before and after the two images corresponding to the prospect of people is the same, Therefore, KL Loss can be used to restrict the consistency of the image prediction results before and after data enhancement, so as to enhance the robustness of the network to the changes of illumination, blur and other external conditions.

The second unsupervised learning method is realized by using unlabeled real images and Background images for confrontation training, referring to Background Matting. During model training, additional discriminator network is introduced to determine whether the image input to the discriminator is a combination of portrait foreground and random Background predicted by the network or a real picture. Reduce the artifact that exists in the predicted outcome.

3) Multitasking

Multi-task learning usually refers to the addition of subtasks related to the original task for collaborative training, so as to improve the effect of the network on the original task, such as the detection and segmentation of tasks in mask-RCNN. One of the difficulties of portrait segmentation is that when the portrait in the video makes certain movements (such as waving), the segmentation effect of arms and other parts is poor. In order to better capture human body information, we tried to introduce human posture information into model training for multi-task training. Pose2Seg was used as a reference to better capture body movement information by analyzing portrait pose. In the test, only the trained portrait segmentation branch is used for reasoning, which can improve the accuracy of segmentation while taking into account the performance.

4) Distillation of knowledge

Knowledge distillation is widely used in model compression and transfer learning, usually using teacher-student learning strategies. Firstly, teacher models with strong performance (such as DeeplabV3+) are trained in advance. During the training of student models, soft labels generated by teacher models are used as supervisory information to guide the training of student models. Compared with the original one-hot tag, the soft tag predicted by teacher model contains the knowledge of the similarity of data structures of different categories, which makes the training of student model more convergent.

5) Model lightweight

According to the needs of the business scenario, we choose the U-NET structure based on the Mobilenet-V2 network, according to the characteristics of MNN operator, the model is optimized and trimmed to meet the actual service performance requirements.

6) Strategy optimization

In the actual meeting scenario, many participants remain motionless for many times. In this state, it is a waste of resources to use the real-time frame rate to segment human images. For this kind of scene, we designed an edge position frame difference method, based on the changes of the edge region of the portrait of adjacent frames, to accurately judge whether the portrait is moving or not, and at the same time, this method can effectively remove the interference of the character’s speech, expression change, external region change and so on. The edge position frame difference method can effectively reduce the frequency of portrait segmentation algorithm when participants are still, thus greatly reducing energy consumption.

Data engineering,

Portrait segmentation is more dependent on data. The existing open source data sets are quite different from the meeting scene, and the annotation acquisition of segmented data is time-consuming and laborious. In order to reduce the cost of data acquisition and improve the utilization rate of existing data, we try to make some attempts in data synthesis and automatic annotation.

1) Data synthesis

In data synthesis, we used the existing models to screen out some good sub-data sets, and employed translation, rotation, thin plate transformation and other methods to increase the diversity of portrait poses and movements, and then combined them with different backgrounds of the meeting scene to expand the training data. In the data transformation, if the portrait tag intersects the boundary, the coordinate relationship is used to maintain the original intersection relationship between the label and the boundary when the new image is synthesized, so as to avoid the separation of portrait and boundary, floating and other phenomena, so as to make the generated picture more real.

2) Automatic labeling and cleaning

By utilizing a variety of existing open source detection, segmentation and matting algorithms, a set of efficient automatic annotation and cleaning tools is designed to carry out rapid and automatic data marking and cleaning quality inspection, and reduce the acquisition cost of data annotation (annotating valid data 5W+).

Algorithm results

At present, the algorithm has been put into use internally.

1) Technical indicators

2) Effect display

Photo change the background scene

In addition to the real-time communication scene, we also made some attempts in the interactive entertainment scene by using the portrait segmentation algorithm, such as changing the background of the photo, and the effect is as shown in the following picture.

reference

  1. BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation
  2. BiSeNet V2: Bilateral Network with Guided Aggregation for Real-time Semantic Segmentation
  3. DFANet: Deep Feature Aggregation for Real-Time Semantic Segmentation
  4. Semantic Flow for Fast and Accurate Scene Parsing
  5. Boundary-sensitive Network for Portrait Segmentation
  6. PortraitNet: Real-time Portrait Segmentation Network for Mobile Device
  7. SINet: Extreme Lightweight Portrait Segmentation Networks with Spatial Squeeze Modules and Information Blocking Decoder
  8. Background Matting: The World is Your Green Screen
  9. Boosting Semantic Human Matting with Coarse Annotations
  10. Is a Green Screen Really Necessary for Real-Time Human Matting?
  11. Real-Time High-Resolution Background Matting
  12. SwiftNet: Real-time Video Object Segmentation
  13. Pose2Seg: Detection Free Human Instance Segmentation
  14. Distilling the Knowledge in a Neural Network

“Video cloud Technology” is your most noteworthy audio and video technology public account. It pushes practical technical articles from ali Cloud every week, and communicates with first-class engineers in the field of audio and video. 【 Technology 】 You can join ali Cloud video cloud product technology exchange group, and the industry to discuss audio and video technology, to get more industry latest information.