The design idea of RepPoints is very clever. It uses the point set rich in semantic information to represent the target, and skillfully uses deformable convolution to realize it. The overall network design is very complete, which is worth learning





Source: Xiaofei’s algorithm engineering notes public number

RepPoints: Point Set Representation for Object Detection

  • Thesis Address:https://arxiv.org/abs/1904.11490
  • Paper Code:https://github.com/microsoft/RepPoints

Introduction


Although the classical bounding box is conducive to calculation, it does not consider the shape and attitude of the target. Moreover, the features obtained from the rectangular region may be seriously affected by the background content or other targets. Low quality features will further affect the performance of target detection. In order to solve the problems of bounding box, the paper proposed a new object representation method called RepPoints, which can achieve finer positioning ability and better classification effect.

As shown in Figure 1, RepPoints are a set of points that adaptively surround the target and contain the semantic characteristics of local regions. The training of RepPoints is driven by both target location and target classification, which can constrain RepPoints to tightly surround the target and guide the detector to correctly classify the target. This adaptive representation is differentiable, can be used continuously in multiple phases of the detector, and does not require an additional anchor set to generate a large number of initial boxes.

The RepPoints Representation


As mentioned above, bounding box is only a coarse-grained representation of the target position, only considering the rectangular space of the target, without considering the shape, posture and semantically rich local regions, which can help the network to better locate and extract features. To address the above shortcoming, RepPoints uses a set of adaptive sampling points to represent the target:

$n$represents the total number of target sampling points, which is set to 9 by default.

RepPoints refinement

Stepwise adjustment of bounding box positioning and feature extraction are important means for the success of multi-stage detector. For RepPoints, the adjustment can be simply expressed as:

$\{(\Delta x_k, \Delta y_k)\}^{n}_{k=1}$is the offset value of the predicted new sampling point relative to the old sampling point. The size of the sampling point is the same, so as to avoid the problem of inconsistent size of the center point coordinates and border length like Bouning Box.

Converting RepPoints to bounding box

In order to use the labeling information of bounding box to train and verify the performance of the RepPoint-based detection algorithm, Using the default conversion method $\mathcal{T}=\mathcal{R}_P\to \mathcal{B}_P$to convert the RepPoints to a pseudo-prediction box, there are three conversion methods:

  • $\mathcal{T}=\mathcal{T}_1$: min-max function to get the prediction box $\mathcal{B}_p$
  • $\mathcal{T}=\mathcal{T} 2$: Partial min-max function $\mathcal{B}_p$: Partial min-max function
  • $\ mathcal {T} = \ mathcal {T} _3 $: Moment-based function, the prediction box $\mathcal{B}_p$is obtained by calculating the center point position and the size of the prediction box through the mean and standard deviation of the RepPoints, and the size is obtained by the global shared learnable parameter $\lambda_x$and $\lambda_y$

These functions are differentiable and can be added to the detector for end-to-end training. The experimental results show that the three conversion methods are all effective.

RPDet: an Anchor Free Detector


Reppoints based Anchor-Free target detection algorithm RPDET is designed in this paper, which includes two recognition stages. Because deformable convolution can sample multiple irregularly distributed points for convolution output, deformable convolution is very suitable for RepPoints scene and can guide sampling points according to the feedback of recognition results.

Center point based initial object representation

RPDET uses the center point as the initial target representation, and then gradually adjusts the final RepPoints, which can also be considered special RepPoints. When two targets exist in the same position of the feature graph, the problem of identifying target ambiguity usually occurs in this centre-based method. The previous approach solves this problem by setting multiple default anchors in the same location, whereas RPDET uses FPN to solve this problem:

  • Different size targets are identified by features at different levels
  • The feature map of level corresponding to small objects is generally large, which reduces the possibility that the same object exists in the same position

According to the statistics of the paper, only 1.1% of COCO data sets have the above-mentioned problems when using the above-mentioned FPN constraints.

Utilization of RepPoints

As shown in Figure 2, RepPoints are the base target representation for RPDET, starting at the center point, and the first set of RepPoints is obtained by the offset values of the regression center point. The second set of RepPoints represents the final target location and is tuned by the first set of RepPoints optimizations. RepPoints learning is primarily driven by two goals:

  • The distance loss of the upper left and right corner points of the pseudo-prediction box and GT box
  • Subsequent target classification loss

The first group of RepPoints is guided by distance loss and classification loss, while the second group of RepPoints is guided by distance loss only, mainly to learn more accurate target positioning.

Backbone and head architectures

The FPN backbone network contains five feature pyramid levels, ranging from stage3(8 times down sampled) to stage7(128 times down sampled). The structure of HEAD is shown in Figure 3. HEAD is shared at different levels and contains two independent subnets, which are responsible for location (generation of RepPoints) and classification:

  • The location subnet first uses three 256-D $3\times 3$convolutions to extract features. Each convolution follows the Group Normalization layer, and then it uses two smaller networks in a series to compute the offset values of the two groups of RepPoints.
  • First, the classification subnet uses three 256-D $3\times 3$convolutions to extract features, each of which is followed by the group normalization layer. Then input the offset values of the first set of RepPoints output by the positioning subnet into the 256-D $3\times 3$deformable convolution to further extract the features, and finally output the classification results.

Although RPDET uses two-stage location, its performance is even better than that of single-stage Retinanet, mainly because the Anchor-Free design reduces the calculation of classification layer, covering the small consumption of additional location stages.

Localization/class target assignment

Positioning consists of two stages. The first stage obtains the first set of repPoints from the center point, and the second stage adjusts the second set of repPoints from the first set of repPoints. Positive samples are defined differently in different stages:

  • For the first stage, the feature point is considered to be a positive sample if: 1) The feature pyramid level of the feature point is equal to $S (B)=\ lFloor log_2 (\ SQRT {W_Bh_B}/4)\ rFloor $. 2) The center point of the target is mapped to the feature points on the feature map.
  • For the second stage, only the pseudo-prediction box generated in the first stage corresponding to the feature points and the target’s IOU greater than 0.5 can be considered as a positive sample. Similar to the current anchor-based approach, the output from the first phase is treated as the anchor.

Since only the first group of RepPoints is considered in the classification of the target, the pseudo-prediction box generated by the first group of RepPoints corresponding to the feature points is considered as a positive sample if the IOU of the target is greater than 0.5, and is considered as a background class if the IOU of the target is less than 0.4, while others are ignored.

Experiments


Compare the performance of different pseudo-prediction box generation methods.

Performance comparison with other SOTA detection methods.

Conclusion


The design idea of RepPoints is very clever. It uses the point set rich in semantic information to represent the target, and skillfully uses deformable convolution to realize it. The overall network design is very complete, which is worth learning.





If you found this article helpful, please feel free to like it or read it again


For more information, please pay attention to the WeChat public number [Xiaofei’s algorithm engineering notes]