This paper cleverly proposes YOLACT, a real-time instance segmentation algorithm based on the one-stage target detection algorithm. The overall architecture design is very lightweight and achieves a good trade-off in speed and effect. Source: [Xiaofei algorithm engineering Notes] official account

YOLACT: Real-time Instance Segmentation

  • Thesis Address:Arxiv.org/abs/1904.02…
  • Thesis Code:Github.com/dbolya/yola…

Introduction


Although the effect of current instance segmentation methods has been greatly improved, they all lack real-time performance. For this reason, the first real-time () example segmentation algorithm YOLACT, the main contributions of the paper are as follows:

  • Based on the one-stage target detection algorithm, YOLACT, a real-time instance segmentation algorithm, is proposed. The overall architecture design is very lightweight and achieves good trade-off in speed and effect.
  • An accelerated NMS algorithm, Fast NMS, is proposed with 12ms acceleration

YOLACT


YOLACT’s main idea is to directly add the Mask branch into the one-stage target detection algorithm, without adding any ROI-pooling operation, and divide the instance into two parallel branches:

  • FCN is used to generate a larger resolution prototype mask that is not specific to any instance.
  • The target detection branch adds an additional head to predict the mask factor vector, which is used to weight the prototype mask for specific instances.

The principle of this is that mask is continuous in space, and convolution can maintain this property well. Therefore, prototype mask is generated through full convolution. Although full connection layer cannot maintain spatial coherence, it can predict semantic vector well, so it is used to generate mask factor vector of instance-wise. Combining the two predictions can not only maintain spatial coherence, but also add semantic information and maintain one-stage rapidity. Finally, the instances after the target detection branch passes through NMS are taken, the prototype mask and mask factor vectors are multiplied one by one, and the multiplied results are combined and output

Prototype Generation

Prototype mask branch predictionProtonet is realized in the form of FCN as shown in Figure 3, and the final convolution output channel isProtonet is connected to the backbone network. The overall implementation is similar to most semantic segmentation models, but the difference is that the backbone network uses FPN to increase the depth of the network and maintain a large resolution (, original image 1/4 size) to improve the recognition of small objects. In addition, the paper found that it is important not to limit the output of Protonet, so that the network can give an overwhelming response to the very definite prototype (such as background). ReLU activation or no activation can be selected for the output prototype mask, and ReLU activation is finally selected in the paper.

Mask Coefficients

In the classical Anchor-based target detection algorithm, head detection generally has two branches, which predict category and Bbox offset respectively. Add a third branch for mask factor prediction, per instance predictionA mask factor.

In order to better control and enrich the fusion of the prototype mask, tanH activation of the mask factor was carried out to make the value more stable with positive and negative values. The effect is the branch as shown in Figure 2.

Mask Assembly

The prototype mask and mask factor are linearly combined, and then sigmoID activation is performed on the combined result to output the final mask.forPrototype mask,forIs the prototype factor,The number of instances left after filtering for the detection branch NMS and score.

  • Losses

The training loss function includes three kinds: classified lossBox regression lossAnd mask loss, the weights are 1, 1.5 and 6.125, respectively. The classification loss and regression loss are calculated the same as SSD, and the mask loss is calculated by pixel-wise binary cross entropy

  • Cropping Masks

In the reasoning stage, the predicted Bbox is used to intercept the instance in the final mask, and then the threshold value (manually set 0.5) is used to filter the binary mask. During training, GT is used to intercept examples to calculate mask loss.Is divided by the size of the captured instance, which helps preserve small goals in the prototype.

Emergent Behavior

Generally speaking, FCN segmentation needs to add some extra tricks to increase translation variability, such as position-sensitive feature graph. Although YOLACT only increases translation deformability by intercepting output of the final mask, However, the paper found that the effect of not intercepting output for large objects was also good, which means that YOLACT’s prototype mask learned to respond differently to different instances, as shown in Figure 5. Examples can be obtained by combining the prototype mask appropriately. It should be noted that the prototype mask features of the all-red input images are different. This is because the padding is used in each convolution, which makes the boundary distinguishable, so the backbone network itself has certain translation and deformation.

Backbone Detector

The prediction of both prototype masks and mask factors requires rich features. In order to balance speed and feature richness, the trunk network adopts a structure similar to that of RetinaNet, adding FPN and removingjoinand, head prediction is performed in multiple layers, andFeature prototype mask prediction.

YOLACT head using, and the size of anchor corresponds respectively, each head shares oneConvolved, and then each through the independentConvolution is predictive and lighter than RetinaNet, as shown in Figure 4. Use smooth –Training bbox prediction, using Softmax cross entropy training classification prediction with background class, OHEM positive and negative ratio is.

Other Improvements


Fast NMS

A normal NMS will consecutively validate bboxes by category, which is fast enough for a 5fps algorithm, but a big bottleneck for a 30fps algorithm. Therefore, this paper proposes Fast NMS to accelerate.

Firstly, the detection results were sorted according to the category score, and then their IoU pairs were calculatedThe ious matrix.Is the number of categories,Is the number of BBoxes. Assume that the IoU between bBox and BBox is higher than the thresholdThe score of other bboxes is higher than that of the current box, then the bbox is removed and the calculation logic is as follows:

  • The matrixThe lower triangle and the diagonal of phi are set to zero,

  • Take the maximum value of each column and calculate as formula 2 to obtain the maximum IoU value matrix
  • The test results are reserved for each category.

Through experiments, FastNMS is about 11.8ms faster than native NMS, and mAP decreases by 0.1

Semantic Segmentation Loss

In order to improve the accuracy without affecting the speed of reasoning, the semantic segmentation branch is added in the training stage and the semantic segmentation loss is calculated to assist the training. inThe output is attached to a C dimensionConvolution, since a pixel may belong to more than one category, using sigmoID activation output instead of Softmax activation, loses weight of 1 and gains about 0.4mAP.

Results


Mask Results

Mask Quality

Ablations

CONCLUSION


This paper cleverly proposes YOLACT, a real-time instance segmentation algorithm based on the one-stage target detection algorithm. The overall architecture design is very lightweight and achieves a good trade-off in speed and effect.





Writing is not easy, shall not be reproduced without permission ~ more content please pay attention to wechat public number [Xiaofei algorithm engineering notes]