CVPR 2021 | end-to-end examples of video segmentation method based on the Transformer

Instance segmentation is one of the basic problems in computer vision. Although there have been many studies on Instance Segmentation in still images, the research on Video Instance Segmentation (VIS) is relatively few. However, most of the information received by cameras in the real world is video stream information rather than pure image information, no matter the surrounding scenes perceived by vehicles in real time under the background of automatic driving or the long and short videos in network media. Therefore, research on video modeling models is of great significance. This paper is an interpretation of a paper published by Meituan unmanned Delivery team in CVPR 2021.

preface

Instance segmentation is one of the basic problems in computer vision. At present, many researches on Instance Segmentation in static images have been carried out, but the research on Video Instance Segmentation (VIS) is relatively less. However, cameras in the real world receive mostly streaming video information rather than pure image information, no matter it is the surrounding scenes perceived by vehicles in real time under the background of automatic driving or the long and short videos in network media. Therefore, research on Video modeling is of great significance. This paper is an interpretation of “End-to-end Video Instance Segmentation with Transformers”, an Oral paper published by Meituan Unmanned Distribution team in CVPR2021. The CVPR conference received a total of 7015 valid submissions, and a total of 1663 papers were finally accepted. The acceptance rate of the papers was 23.7%, while the acceptance rate of Oral was only 4%.

background

Image instance segmentation refers to the task of detecting and segmenting objects of interest in static images. Video is an information carrier containing multiple frames of images. Compared with static images, video information is richer, so modeling is more complex. Different from static images which only contain spatial information, video also contains time-dimension information, so it is closer to the depiction of the real world. Among them, video instance segmentation refers to the task of detecting, dividing and tracking the object of interest in the video. As shown in Figure 1, the first act is the multi-frame image sequence of a given video, and the second act is the result of video instance segmentation, where the same color corresponds to the same instance. Video instance segmentation is not only to detect and segment the object in a single frame image, but also to find the corresponding relationship of each object in the dimension of multiple frames, that is, to associate and track it.

Related work

The existing video instance segmentation algorithm is usually a complex process with multiple modules and stages. The earliest Mask Track R-CNN[1] algorithm contains both instance segmentation and tracking modules, which is realized by adding a tracking branch on the network of image instance segmentation algorithm Mask R-CNN[2], which is mainly used for instance feature extraction. In the prediction stage, the method uses the external Memory module to store the feature of multi-frame instance and tracks the feature as an element of instance association. The essence of this method is still single frame segmentation and traditional tracking association. Maskprop[3] added the Mask Propagation module on the basis of Mask Track R-CNN to improve the quality of segmentation Mask generation and association. This module can realize the transmission of the Mask extracted from the current frame to surrounding frames. However, since the propagation of frames depends on the pre-calculated segmentation Mask of a single frame, multi-step Refinement is required to obtain the final segmentation Mask. The essence of this method is still the extraction of a single frame and propagation between frames, and because it depends on the combination of multiple models, the method is more complex and slower.

Stem-seg[4] divides video instances into two modules, instance differentiation and category prediction. In order to realize the instance differentiation, the model constructs the multi-frame Clip of video as 3D Volume, and realizes the segmentation of different objects by clustering the features of pixel points. Since the above clustering process does not include the prediction of instance categories, additional semantic segmentation modules are required to provide the category information of pixels. According to the above description, most of the existing algorithms follow the idea of single-frame image instance segmentation, and divide the video instance segmentation task into single frame extraction and multiple modules associated with multiple frames, and supervise and learn for a single task. The processing speed is slow and is not conducive to giving full play to the advantages of video sequential continuity. This paper aims to propose an end-to-end model, which integrates the detection, segmentation and tracking of instances into one framework, which is conducive to better mining the overall spatial and temporal information of videos, and can solve the problem of video instance segmentation in a faster speed.

VisTR algorithm introduction

Redefining the problem

First, we rethink the task of video instance segmentation. Compared with single frame image, video contains more complete and rich information about each instance, such as trajectory and motion modes of different instances, which can help overcome some difficult problems in single frame instance segmentation task, such as appearance similarity, object proximity or occlusion. On the other hand, the better feature representation of a single instance provided by multiple frames also helps the model track the object better. Therefore, our approach aims to implement a framework for end-to-end modeling of video instance targets. To achieve this goal, our first thought was: video itself is sequence level data, can it be directly modeled as a sequence prediction task? For example, learning from the idea of natural language processing (NLP) task, the video instance segmentation modeling is a sequence-to-sequence (Seq2Seq) task, that is, given multiple frames of images as input, the segmentation Mask sequence of multiple frames is directly output, then a model capable of modeling multiple frames at the same time is needed.

The second consideration is: video instance segmentation actually includes two tasks: instance segmentation and target tracking. Can it be realized under a unified framework? In view of this, our idea is that segmentation itself is the learning of similarity between pixel features, while tracking is essentially the learning of similarity between instance features, so theoretically they can be unified under the same learning framework of similarity.

Based on the above thinking, we choose a model that can simultaneously conduct sequence modeling and similarity learning, namely Transformer[5] model in natural language processing. Transformer itself can be used for Seq2Seq tasks, i.e. given a sequence, a sequence can be entered. Moreover, the model is very good at modeling long sequences, so it is very suitable for the video field to model multi-frame sequence information. Second, The core mechanism of Transformer, the self-attention module, can learn and update features based on the similarity between two features, making it possible to unify the similarity between pixel features and instance features within a framework. These features make Transformer an appropriate choice for VIS tasks. In addition, Transformer already has a practice DETR[6] that has been applied to target detection in computer vision. Therefore, we design a video instance segmentation (VIS) model VisTR based on Transformer.

VisTR algorithm flow

Following the above ideas, the overall framework of VisTR is shown in Figure 2. In the figure, the leftmost part represents the input multi-frame original image sequence (taking three frames as an example), and the right part represents the output instance prediction sequence, where the same shape corresponds to the output of the same frame image, and the same color corresponds to the output of the same object instance. Given a multi-frame image sequence, the convolutional neural network (CNN) is firstly used to extract the initial image features, and then the multi-frame features are combined as the feature sequence input Transformer for modeling to realize the input and output of the sequence.

It’s easy to see that, first of all, VisTR is an end-to-end model that models multiple frames of data simultaneously. The way of modeling is to turn it into a Seq2Seq task, input multi-frame image sequence, and the model can directly output the predicted instance sequence. Although in temporal dimension frame of input and output are more orderly, but single frame enter an instance of the sequence in the initial state is unordered, that still can’t implementation instance of track correlation, so we forced makes every frame image output is consistent with the instance of the order (the same with the picture the shape of the symbol with the same color change order), As long as the output at the corresponding location is found, the association of the same instance can be achieved automatically without any post-processing. To achieve this goal, sequential dimension modeling is required for features belonging to the same instance location. Specifically, in order to realize the sequence-level supervision, we propose the module of Instance Sequence Matching. Meanwhile, in order to achieve sequence-level Segmentation, we propose a module of Instance Sequence Segmentation. End-to-end modeling takes the spatial and temporal features of the video as a whole, so that the information of the whole video can be learned from a global perspective. Meanwhile, the dense feature sequence modeled by Transformer can better retain the detailed information.

VisTR network structure

The detailed network structure of VisTR is shown in Figure 3. The following is an introduction to the various components of the network:

Backbone: Mainly used for extracting initial image features. For each frame input image of the sequence, Backbone of CNN is firstly used to extract the initial image features, and the extracted multi-frame image features are serialized into multi-frame feature sequences along the temporal and spatial dimensions. Since the process of serialization loses the original spatial and temporal information of pixels, and the task of detection and segmentation is very sensitive to Positional information, we encode the original spatial and horizontal positions as Positional Encoding and superimpose them on the extracted sequence features to maintain the original Positional information. The Positional Encoding mode follows the way of Image Transformer[7], only changing the original two-dimensional position information into three-dimensional position information, which is explained in detail in the paper.
Encoder: mainly used for overall modeling and updating of multi-frame feature sequences. Input the previous multi-frame feature sequence. The Encoder module of Transformer uses the self-attention module to fuse and update all features in the sequence by learning the similarity between points. This module can better learn and enhance the features belonging to the same instance by modeling the overall temporal and spatial features.
Decoder: mainly used to decode the instance feature sequence of output prediction. Because Encoder input Decoder is dense sequence of pixel features, in order to decode sparse Instance features, we refer to DETR method and introduce Instance Query to decode representative Instance features. Instance Query is the Embedding parameter learned by the network itself, which is used for Attention operation with dense input feature sequence to select features that can represent each Instance. Taking processing 3 frames of images and predicting 4 objects per frame as an example, the model requires a total of 12 Instance Queries to decode 12 Instance predictions. Consistent with the previous representation, the same shape represents the prediction of the same image frame, and the same color represents the prediction of the same object instance in different frames. In this way, we can construct a prediction sequence for each Instance, corresponding to Instance 1 in Figure 3… Instance 4. In the subsequent process, the model treats the sequence of single object instances as a whole.
Instance Sequence Matching: Mainly used for sequence-level Matching and supervision of input prediction results. The process of moving from sequence image input to sequence instance prediction was described earlier. But the order of the prediction sequence is actually based on the assumption that the input order of the frame is maintained in the dimension of the frame, and that the output order of the different instances is consistent in the prediction of each frame. The order of frames is easy to maintain, as long as the order of control inputs and outputs is consistent, but the order of internal instances of different frames is not guaranteed, so we need to design a special monitor module to maintain this order. In general target detection, each location point has its corresponding Anchor, so the Ground Truth supervision corresponding to each location point is allocated. However, in our model, there is actually no explicit information of Anchor and location, so we do not have ready-made supervision information of which instance belongs to each input point. In order to find this supervision and directly supervise the Sequence dimension, we propose the module of Instance Sequence Matching, which performs binary Matching between the prediction Sequence of each Instance and the Ground Truth Sequence of each Instance in the annotation data. Hungarian matching method is used to find the latest annotation data of each prediction, which is used as its Groud Truth for supervision and subsequent Loss calculation and learning.
Instance Sequence Segmentation: Used to obtain the final Segmentation result Sequence. The sequence prediction process of Seq2Seq has been introduced previously, and our model has been able to complete the prediction and tracking association of the sequence. However, so far, we have only found a representative feature vector for each Instance, and the final task to be solved is the Segmentation task. How to transform this feature vector into the final Mask Sequence is the problem to be solved by the Instance Sequence Segmentation module. As mentioned above, the essence of instance segmentation is the learning of pixel similarity. Therefore, the method of initial calculation of Mask is to calculate self-attention similarity by using the prediction of instance and the feature map after Encode, and the obtained similarity map is used as the initial Attention Mask feature of the corresponding frame of this instance. In order to make better use of time sequence information, we input multi-frame Attention masks belonging to the same instance as the Mask sequence into 3D convolution module for segmentation, and directly obtain the final segmentation sequence. In this way, the feature of the same instance of multiple frames is used to enhance the segmentation result of a single frame, which can maximize the advantages of timing sequence.

VisTR loss function

According to the previous description, losses need to be calculated in network learning mainly in two places, one is the Matching process in the stage of Instance Sequence Matching, and the other is the final calculation process of loss function of the whole network after supervision is found.

The calculation formula of Instance Sequence Matching process is shown in Formula 1: As the Matching stage is only used to search for supervision, and the distance calculation between masks is relatively intensive, we only consider Box and predicted category C in this stage. Yi in the first line represents the Ground Truth sequence corresponding to the ith instance, where C represents category, B represents Boundingbox, and T represents frame number, namely the corresponding category and Bounding Box sequence of the instance of T frame. The second and third lines respectively represent the results of the prediction sequence, where P represents the probability of prediction in the category ci, and B represents the predicted Bounding Box. The calculation of the distance between sequences is obtained by calculating the loss function between the values of the corresponding positions of two sequences, which is represented by Lmatch in the figure. For each predicted sequence, the Ground Truth sequence with the lowest Lmatch is found as its supervision. According to the corresponding monitoring information, the loss function of the whole network can be calculated.

Because our method integrates classification, detection, segmentation and tracking into an end-to-end network, the final loss function also includes categories, Bounding Box and Mask, and the tracking is reflected by calculating the loss function directly to the sequence. Formula 2 represents the Loss function of segmentation. After obtaining the corresponding supervision results, Dice Loss and Focal Loss between corresponding sequences were calculated as the Loss function of Mask.

The final loss function is shown in Formula 3, which is the sum of sequence loss functions including classification (category probability), Bounding Box (Bounding Box) and segmentation (Mask) at the same time.

The experimental results

To verify the effectiveness of the method, we conducted experiments on the widely-used video instance segmentation dataset youtube-VIS, which contains 2238 training videos, 302 validation videos, and 343 test videos, as well as 40 object categories. The evaluation criteria of the model include AP and AR, and the IOU between multi-frame masks of the video dimension is taken as the threshold value.

The importance of temporal information

Compared with the existing methods, the biggest difference of VisTR is that video is modeled directly, while the main difference between video and image is that video contains abundant timing information. If effective mining and learning of timing information is the key to video understanding, we first explore the importance of timing information. Timing consists of two dimensions: the number of times (frames) and the contrast between order and disorder.

Table 1 shows the final test results of the Clip training model with different frames. It is not hard to see that with the increase of frames from 18 to 36, the accuracy AP of the model is also improving, proving that the richer timing information provided by multiple frames is helpful for model learning.

Table 2 shows the use of disturb physical sequence Clip and according to the physical sequence and compare the effect of the Clip model training, can be seen, according to the time sequence model of training has a point of ascension, prove VisTR have learned physical objects change and law under the time, and according to the physical time sequence on video modeling can also help the understanding of the video.

The Query to explore

The second experiment explored Query. Since our model directly models 36 frames of images and predicts 10 objects for each frame, 360 queries are required, corresponding to the results of the last line in Table 3 (Prediction Level). We want to see if queries belonging to the same frame or instance are associated with each other and can be shared. Aiming at this goal, we designed the Frame Level experiment respectively: that is, a Frame uses only one Query feature for prediction, and the Instance Level experiment: an Instance uses only one Query feature for prediction.

It can be seen that the result of Instance Level is only one point less than that of Prediction Level, while the result of Frame Level is 20 points lower, which indicates that the Query of different frames belonging to the same Instance can be shared. Query information of different instances in the same frame cannot be shared. The Prediction Level Query is proportional to the input frame number of the image, while the Instance Level Query can realize the model independent of the input frame number. Only the number of Instance to be predicted by the model is limited, but the number of input frames is not limited, which is also the direction of future research. In addition, we also designed the experiment of Video Level, that is, the whole Video is predicted by only one Query Embedding feature. This model can achieve 8.4 AP.

Other design

Here are some other designs that we found to work during the experiment.

Since the original spatial and temporal information is lost during feature serialization, we provide the original Positional Encoding feature to preserve the original location information. Compared with and without this module in Table 5, the Positional information provided by Positional Encoding gives a lift of about 5 points.

In the process of segmentation, the initial Attention Mask is obtained by calculating Prediction and self-attention of the post-encoded feature of the instance. In Table 6, we compare the effect of segmentation using cnN-encoded features with that using Transformer-Encoded features, which can improve one point. The validity of global feature update in Transformer is proved.

Table 6 shows the comparison of effects with and without 3D convolution in the segmentation module. The use of 3D convolution can bring a point improvement, which proves the effectiveness of using timing information to directly segment multi-frame masks.

Visual result

The visualized effect of VisTR in youtube-VIS is shown in Figure 4, where each line represents the sequence of the same video, and the same color corresponds to the segmentation result of the same instance. It can be seen that no matter in (a). Instance occlusion exists (b). Instance position exists relative change (c). (d). The model can achieve good segmentation tracking under different poses, which proves that VisTR still has a good effect under challenging conditions.

Methods contrast

Table 7 is a comparison of our method with other methods on the YoutubeVIS dataset. Our approach achieves the best results from a single model (where MaskProp contains a combination of multiple models), achieving an AP of 40.1 at 57.7FPS. The preceding 27.7 refers to the speed of sequential Data Loading part (which can be optimized by using parallelism), and 57.7 refers to the speed of pure model Inference. Because our method directly models 36 frames of images at the same time, compared with single frame processing of the same model, it can bring about 36 times faster speed under ideal conditions, which is more conducive to the wide application of video model.

Summary and Prospect

Video instance segmentation refers to the task of simultaneously classifying, dividing and tracking the objects of interest in the video. Existing approaches often design complex processes to solve this problem. In this paper, a new framework for video instance segmentation based on Transformer, VisTR, is proposed, which treats the task of video instance segmentation as a direct end-to-end parallel sequence decoding and prediction problem. Given a video with multiple frames of images as input, VisTR directly outputs the sequence of masks for each instance of the video in sequence. The core of this method is a new strategy for instance sequence matching and segmentation, which can monitor and segment instances at the whole sequence level. VisTR greatly simplifies the process by unifying instance segmentation and tracing under the framework of similarity learning. Without any trick, VisTR achieved the best results of all methods using a single model and achieved the fastest speed on the youtube-VIS dataset.

As far as we know, this is the first application of Transformers in the field of video segmentation. It is hoped that our method can inspire more research on video instance segmentation, and that this framework can be applied to more video comprehension tasks in the future. For more details of this paper, please refer to the original text: end-to-end Video Instance Segmentation with Transformers. The code is also open source on GitHub: github.com/Epiphqny/Vi… , welcome to understand or use.

The author

Meituan unmanned vehicle distribution center: Yuqing, Zhaoliang, Baoshan, Shenhao, etc.
University of Adelaide: Xinlong Wang, Chunhua Shen.

reference

Video Instance Segmentation, arxiv.org/abs/1905.04… .
Mask, R – CNN arxiv.org/abs/1703.06… .
Semen, Segmenting, and Tracking Object Instances in Video with Mask Propagation, arxiv.org/abs/1912.04… .
Stem-seg: Spatio-temporal Embeddings for Instance Segmentation in Videos, arxiv.org/abs/2003.08… .
Attention Is All You Need, arxiv.org/abs/1706.03… .
End-to-end Object Detection with Transformers, arxiv.org/abs/2005.12… .
Image Transformer, arxiv.org/abs/1802.05… .

Recruitment information

Meituan unmanned vehicle distribution center continues to recruit algorithm/system/hardware development engineers and experts. If you are interested, please send your resume to [email protected] (subject: Meituan Unmanned Vehicle Team).

Read more technical articles from meituan’s technical team

| in the public bar menu dialog reply goodies for [2020], [2019] special purchases, goodies for [2018], [2017] special purchases such as keywords, to view Meituan technology team calendar year essay collection.

| this paper Meituan produced by the technical team, the copyright ownership Meituan. You are welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication. Please mark “Content reprinted from Meituan Technical team”. This article shall not be reproduced or used commercially without permission. For any commercial activity, please send an email to [email protected] for authorization.