This article has participated in the activity of “New person creation Ceremony”, and started the road of digging gold creation together.

reference

Mp.weixin.qq.com/s/NooqX7aTa…

Download: openaccess.thecvf.com/content/CVP…

PeizeSun/ Sparser-CNN: End-to-end object detection with learnable proposals, CVPR2021 (github.com)

Innovative contrast with RetinaNet and Faster R-CNN

(a) Direct iteration of K anchor boxes per pixel in RetinaNet.

(b)Faster R-CNN selects N Anchor boxes from W * H * K anchor boxes.

(c) Method proposed in this paper: provide N Anchor boxes directly.

Abstract

This paper proposes a Sparse R-CNN algorithm for target detection.

All current target detection efforts rely on Dense Object Candidates.

However, this paper proposed a group of manually designed object proposals, a total of N. Sparse R-CNN completely avoids all the work associated with candidate target design and many-to-one tag assignment. More importantly, the final prediction is directly outputted without post-processing that is not highly inhibitory. Sparser-cnn demonstrated accuracy, runtime, and training convergence comparable to detector baselines established on challenging COCO datasets, for example, achieving 45.0AP in a standard 3× training program and running at 22fps using the Resnet-50FPN model.

Introduce and Related Work

Inference time heuristic assign rules

Sparse R-CNN

An overview of Sparse R-CNN pipeline. The input includes an image, a set of proposal boxes and proposal features, where the latter two are learnable parameters. The backbone extracts feature map, each proposal box and proposal feature are fed into its exclusive dynamic head to generate object feature, And finally, outputs Classification and location Sparse R-CNN design drawing Overview. The input consists of an image, a set of proposal boxes, and proposal characteristics, the last two of which are learnable parameters. The trunk extracted the feature map, input each proposal box and proposal feature into its special dynamic head, generated the target feature, and finally output the classification and positioning.

Saprse R-CNN consists of backbone network, a dynamic instance Interactive head and two task-specific prediction layers.

  • Data input includes an image, a set of proposal boxes and proposal features
  • Use FPN as Backbone, processing images
  • Proposal Boxes: N*4 are a set of parameters that have nothing to do with backbone
  • The proposals features and Backbone are also irrelevant

Backbone (Feature Pyramid Network (FPN))

Backbone: FPN based on Resnet provides multi-scale feature map.

Learnable porposal box

  • These proposal boxes are represented by 4-d parameters ranging from 0 to 1. Represented by a 4D array and normalized to the range [0,1].
  • It’s basically unaffected by initialization
  • You can view it as the statistical probability of the object’s potential position
  • Parameters can be updated during training

Learnable proposal feature

  • The previous proposal box is a relatively simple method to describe objects, but it lacks a lot of information, such as the shape and posture of objects
  • Proposal feature is used to express more object information.

Dynamic instance interactive head

Given N proposal frames, Sparse R-CNN first extracts features for each frame using the RoIAlign operation. Then, using our prediction header, we use each box feature to generate the final prediction. Inspired by dynamic algorithm, we propose dynamic instance interaction head. Each RoI feature is entered into its own proprietary header for object location and classification, and each header is conditioned on a specific proposal feature.

  • The features of each object were obtained through proposal boxes and ROI methods, and then combined with the proposal feature to obtain the final prediction result
  • The number of heads is the same as the number of Learnable proposal boxes, that is, the Head/Learnable proposal box/ Learnable proposal feature corresponds to each other
def dynamic_instance_interaction(pro_feats, roi_feats): # pro feats: (N, C) # roi feats: (N, S∗S, C) # parameters of two 1x1 convs: (N, 2 ∗ C ∗ C/4) Dynamic_params = linear1(pro_features) # parameters of first conv: (N, C, C/4) param1 = dynamic_params[:, :C*C/4].view(N, C, C/4) # parameters of second conv: (N, C/4, C) param2 = dynamic_params[:, C*C/4:].view(N, C/4, C) # instance interaction for roi features: (N, S * S, C) roi_feats = relu(norm(BMM (roi_feats, param1)) roi_feats = relu(norm(BMM (roi_feats, param1)) Param2)) # ROI * * * * : (N, S * * S * C) roi_feats = roi_feats. Flatten (1) # obj * * : (N, C) obj_feats = linear2(roi_feats) return obj_featsCopy the code

The picture from: mp.weixin.qq.com/s/NooqX7aTa…

Two notable features of Sparse R-CNN are Sparse object candidates and Sparse feature interaction. There are not thousands of dense candidates, There is also no dense global feature interaction. Sparse R-CNN can be regarded as a direction expansion of target detection framework from dense to dense-to-SPARSE.

— — — — — — — — to be added