Guided Anchoring solves the problem of conventional manual preset anchor by online generation of anchor, and provides two implementation methods for embedding according to the adaptive characteristics of the generated anchor, which is a very complete solution





Source: Xiaofei’s algorithm engineering notes public number

Region Proposal by Guided Anchoring

  • Thesis Address:https://arxiv.org/abs/1901.03278
  • Paper Code:https://github.com/open-mmlab/mmdetection

Introduction


Anchor is an important mechanism in many target detection algorithms, but it also brings two problems:

  • Appropriate size and aspect ratio of anchor need to be designed in advance. If the design is not good, the speed and accuracy will be greatly affected
  • In order to achieve sufficient recall rate of anchors, a large number of anchors need to be laid on the feature map, which not only introduces a large number of negative samples, but also costs computation

Therefore, this paper proposes Guided Anchoring to generate anchor online according to image features. Firstly, the possible positions of targets are judged, and then the shapes of targets in different positions are learned. Ssparse candidate anchors can be learned online according to image features. However, the online generated anchors have different shapes, and the fixed perception domain may not match the shape. Therefore, Guided Anchoring carries out adaptive feature extraction according to the shape of the anchor, and then carries out fine tuning and classification of the prediction frame.

Guided Anchoring


Guided Anchoring tries to learn the anchors of different shapes and their positions online, so as to obtain the collection of anchors that are unevenly distributed on the feature graph. The target can be represented by a quad of $(x,y,w,h)$, and its position and shape can be considered to follow a distribution of the image $I$:

Formula 1 consists of two parts: 1) given a picture, the target exists only in certain areas and 2) the shape is closely related to the position.

Based on Formula 1, this paper designs the Anchor Generation in Fig. 1, which includes two branches: position prediction and shape prediction. Given the image $I$, the feature image $F_I$is firstly obtained. The position prediction branch predicts the probability that the pixel is the target position according to $F_I$, while the shape prediction branch predicts the shape related to the pixel position. Based on the two branches, the anchor whose probability is higher than the threshold and the most appropriate at each position is selected to obtain the final set of anchors. Since the shapes of anchors in the collection may vary greatly, each position needs to obtain the features of regions of different sizes. This paper proposes a Feature adaption module to extract the features adaptively according to the shape of anchors. The above generation process is based on a single feature, and the overall architecture of the network includes FPN, so each layer is equipped with a Guided anchoring module whose parameters are shared between layers.

Anchor Location Prediction

Position to predict the probability of branch prediction characteristic figure $F_I $figure $p (\ cdot | F_I) $, each $p (I, j | F_I) $for the location is the center of the target probability, the corresponding coordinates in the input image $((I + \ frac {1} {2}) s (j + \ frac {1} {2}) s) $, $s$is the stride of the feature graph. In the implementation, the probability graph is predicted by subnet $\mathcal{N}_L$. First, the objectness fraction of the trunk network feature graph $F_I$is extracted by $1\times 1$convolution, and then the probability is transformed by Elemental-wise sigmoid function. More complex subnet can bring higher accuracy. The paper adopts the structure with the most affordable accuracy and speed. Finally, the pixel position higher than the threshold $\epsilon_L$can ensure that 90% of the irrelevant areas can be filtered under the premise of high recall.

Anchor Shape Prediction

The goal of the shape prediction branch is to predict the optimal shape $(w,h)$corresponding to the target at each position. However, due to the large numerical range, it will be very unstable if the specific number is directly predicted, so the transformation is carried out first:

Shape prediction branch outputs $dw$and $dh$, and transforms shape $(w,h)$according to formula 2. $s$is the stride of the feature graph, and $\sigma=8$is the manually set scaling factor. This nonlinear transformation can map [0, 1000] to [-1, 1], making it easier to learn. In the implementation, the subnet $\mathcal{N}_S$is used to predict the shape. At first, a two-dimensional feature map corresponding to $DW $and $DH $is obtained by using $1\times 1$convolution, and then transformed by Equation 2. As the anchors at each position are learned, compared with the preset fixed anchor, this learned anchor has a higher recall rate.

Anchor-Guided Feature Adaptation

In the conventional preset anchor method, since the anchors in each position are the same, the same feature extraction can be carried out for each position, and then the anchors and prediction classification can be adjusted. However, the Guided Anchoring anchor is different in each position. Ideally, a larger anchor needs a larger receptive domain feature, otherwise, a smaller receptive domain feature is required. Therefore, the anchor-guided feature adaptation component is designed in this paper. According to the shape of the anchor at each position, the features are transformed:

$f_i$is the feature of $I $position, $(w_i, h_i)$is the corresponding anchor shape, $\mathcal{N}_T$is $3\times 3$deformable convolution, and the offset value of the deformable convolution is obtained by transforming the position prediction branch output by $1\times 1$convolution. $f^{‘}_i$is an adaptive feature, which is used for subsequent adjustment of anchor and classification prediction, as shown in Figure 1.

Training


Joint objective

The overall loss function of the network is composed of four parts, namely, classification loss, regression loss, anchor position loss and anchor shape loss:

Anchor location targets

Assume that target at $(x_g, y_g w_g, h_g) $in characteristic on the graph mapping for $(x ^ {‘} _g, y ^ {} _g, w ^ {‘} _g, h ^ {} _g) $, define the following three areas:

  • Central region $CR = \ mathcal {R} (x ^ {‘} _g, y ^ ^ {{‘} _g, \ sigma w ‘} _g. \ sigma h ^ {} _g) $, area are the sample points
  • Ignore area $IR = \ mathcal {R} (x ^ {‘} _g, y ^ ^ {{‘} _g, \ sigma_2 w ‘} _g. \ sigma_2 h ^ {} _g) $$$, CR \ $\ sigma_2 > \ sigma $, area were ignored, not to participate in the training
  • The OR of the outer region is non-IR and Cr region, and the negative sample points are all within the region

The backbone network uses FPNs, and each FPN layer should only be responsible for training targets within a specific size range. Because the characteristics of adjacent layers are similar, the IR region is mapped to adjacent layers. In this case, CR region is not considered in IR region, and the region also does not participate in training, as shown in Figure 2. When multiple targets overlap, CR area takes precedence over IR area, IR area takes precedence over OR area, and Focal loss is used for training.

Anchor shape targets

First, define the dynamic anchor $a_ = {wh} \ {(x_0 y_0, w, h) | w > 0, h > 0 \} $and optimal problem between GT:

If Equation 5 is solved for each position, the amount of calculation will be considerable. For this reason, the paper approaches Equation 5 by sampling. The sampling range is common anchor samples, such as 9 kinds of Anchors of Retinanet. For each position, select the largest anchor of IOU as the result of Equation 5. The larger the sampling range is, the more accurate the generated result of Anchor will be, but it will bring more extra computation amount. Smooth -L1 training is adopted for Anchor:

The Use of High-quality Proposals


The Guided Anchoring is embedded into RPN to obtain the enhanced GA-RPN. Compared with the original version, it can be seen from Figure 3:

  • GA-RPN has more positive samples
  • GA-RPN has more candidate boxes for high IOU

It can be seen from the results that the effect of GA-RPN is much better than that of RPN. In this paper, replacing GA-RPN directly with RPN only brings an AP increase of less than 1. According to the observation, the premise of using high quality candidate boxes is that the distribution of training data needs to be adjusted according to the distribution of candidate boxes. Therefore, when GA-RPN is used, higher positive and negative sample thresholds need to be set to make the network pay more attention to high-quality candidate boxes. In addition, the paper finds that GA-RPN can also fine-tune the performance of the two-stage detector. Given the trained detection model, replacing RPN with GA-RPN for several rounds of iteration can bring good performance improvement.

Experiments


Contrast with various candidate box methods.

Embedding effect contrast.

Fine – most cerebral sci-film contrast.

Conclusion


Guided Anchoring solves the problem of conventional manual preset anchor by online generation of anchor, and provides two implementation methods for embedding according to the adaptive characteristics of the generated anchor, which is a very complete solution. However, there is a bad point in the generation of Anchor Target. In order to ensure the performance is not perfect enough, I hope someone can put forward a more accurate and efficient formula 5 solution.





If you found this article helpful, please feel free to like it or read it again


For more information, please pay attention to the WeChat public number [Xiaofei’s algorithm engineering notes]