Guided Anchoring solves the problems existing in conventional manual preset anchor through online generation of anchor and provides two implementation methods in terms of embedding, which is a very complete solution

Source: Xiaofei algorithm engineering Notes public account

Thesis: Region Proposal by Guided Anchoring

  • Thesis Address:Arxiv.org/abs/1901.03…
  • Thesis Code:Github.com/open-mmlab/…

Introduction


Anchor is a very important mechanism in many target detection algorithms, but it also brings two problems:

  • It is necessary to design appropriate anchor size and aspect ratio in advance. If the design is not good, the speed and accuracy will be greatly affected
  • In order to achieve sufficient recall rate of Anchor, a large number of Anchors need to be laid on the feature map, which not only introduces a large number of negative samples, but also consumes calculation

Therefore, the paper proposes Guided Anchoring to generate anchor online according to image features. Firstly, the possible location of the target is judged, and then the shape of the target at different positions is learned. Sparse candidate anchor can be learned online according to the image features. However, anchor shapes generated online are different, and the fixed perception domain may not match its shape. Therefore, Guided Anchoring conducts adaptive feature extraction according to the shape of anchor, and then conducts fine tuning and classification of prediction frame.

Guided Anchoring


Guided Anchoring tries to learn anchors of different shapes and their positions online, and obtains the anchor set that is not evenly distributed on the feature map. The target can be represented by a quad (x,y, W, H)(x,y, W, H), and its position and shape can be considered as a distribution following image III:

Formula 1 consists of two parts: 1) Given the image, the target only exists in certain areas; 2) the shape is closely related to the position.

Based on Formula 1, this paper designs Anchor Generation in Figure 1, which contains two branches of position prediction and shape prediction. Given picture III, the feature graph FIF_IFI is firstly obtained. The position prediction branch predicts the probability that the pixel is the target position according to FIF_IFI, while the shape prediction branch predicts the shape related to the pixel position. Based on the two branches, the anchor whose probability is higher than the threshold and the most suitable anchor for each position is selected to obtain the final anchor set. As the anchor shape in the set may vary greatly, each location needs to acquire the features of regions of different sizes. In this paper, Feature adaption is proposed to extract features adaptively according to the anchor shape. The generation process described above is based on a single feature. The overall architecture of the network includes FPN, so each layer is equipped with Guided Anchoring modules whose parameters are shared between layers.

Anchor Location Prediction

Position to predict the probability of branch prediction figure FIF_IFI figure p (⋅ ∣ FI) p (\ cdot | F_I) p (⋅ ∣ FI), each of the p (I, j ∣ FI) p (I, j | F_I) p (I, j ∣ FI) for the position is the center of the target probability, Corresponding coordinates in the input image ((I + 12), s (j + 12) s) ((I + \ frac {1} {2}), s (j + \ frac {1} {2}) s) (s (I + 21), (21) j + s), the characteristics of SSS figure stride. First, the objectness score of FIF_IFI is extracted by 1×11 times 11×1 convolution, and then the probability is transformed by the Element -wise sigmoid function. More complex subnets can bring higher accuracy, the paper adopts the structure of the most affordable accuracy and speed, and finally takes the pixel position higher than the threshold ϵL\epsilon_LϵL, which can ensure that 90% of the irrelevant area can be filtered under the premise of high recall.

Anchor Shape Prediction

The goal of the shape prediction branch is to predict the best shape (W,h)(w,h)(w, H) of the target corresponding to each position. However, due to the large numerical range, it will be very unstable to directly predict the specific number, so the transformation is carried out first:

The shape prediction branch outputs DWDWDW and DHDHDH, transforms the shape (W, H)(w, H)(W, H) according to formula 2, SSS is the stride of the feature graph, σ=8\sigma=8σ=8 is the manually set scaling factor. This non-linear transformation maps [0, 1000] to [-1, 1] and is easier to learn. In the realization, the shape prediction is carried out by subnet NS\mathcal{N}_SNS. First, 1×11\times 11×1 convolution is used to obtain two-dimensional feature graphs corresponding to DWDWDW and DHDHDH respectively, and then the transformation is carried out by formula 2. As the anchor in each position is learned, compared with the preset fixed anchor, the recall rate of such learned anchor is higher.

Anchor-Guided Feature Adaptation

As the anchor in each position is the same, the conventional preset anchor method can carry out the same feature extraction for each position, and then adjust the anchor and forecast classification. However, anchor at each location of Guided Anchoring is different. Ideally, a larger anchor needs a larger perception domain feature, and vice versa, a smaller perception domain feature is needed. Therefore, the paper designs an anchor shape-based anchorage-guided feature adaptation component, which converts the anchor shape of each position according to:

Fif_ifi is the feature of iii position, (wi,hi)(w_i, h_I)(WI,hi) is the corresponding Anchor shape, NT\mathcal{N}_TNT is 3×33\times 33×3 deformable convolution, The offset value of the deformed convolution is obtained by converting the branch output of position prediction by 1×11\times 11×1 convolution. Fi ‘f^{‘}_ifi’ is an adaptive feature, which is used for subsequent anchor adjustment and classification prediction, as shown in Figure 1.

Training


Joint objective

The overall loss function of the network consists of four parts, namely, classification loss, regression loss, anchor position loss and anchor shape loss:

Anchor location targets

Assumes that the target (xg, yg, wg, hg) (x_g y_g, w_g, h_g) (xg, yg, wg, hg) in the figure on the map for (xg ‘, yg ‘, wg ‘, hg ‘) (x ^ {‘} _g, y ^ {} _g, w ^ {‘} _g, H ^{‘}_g (xg ‘,yg ‘,wg ‘,hg ‘), define the following three regions:

  • Central region CR = R ((xg ‘, yg ‘, sigma wg ‘, sigma hg ‘)) CR = \ mathcal {R} (x ^ {‘} _g, y ^ {} _g, \ sigma w ^ {‘} _g, \ sigma h ^ {‘} _g CR = R)) ((xg ‘, yg ‘, sigma wg ‘, sigma hg ‘)), and area are all are sample points
  • Ignore the regional IR = R (xg ‘, yg ‘, sigma 2 wg ‘, sigma 2 hg ‘) IR = \ mathcal {R} (x ^ {‘} _g, y ^ ^ {{‘} _g, \ sigma_2 w ‘} _g, IR \ sigma_2 h ^ {‘} _g) = R (xg ‘, yg ‘, sigma 2 wg ‘, sigma 2 hg ‘) \ CRCRCR, sigma 2 > sigma \ sigma_2 2 > > \ sigma sigma sigma, area were ignored, not to participate in the training
  • OR in the external region is non-IR and CR region, and there are negative sample points in the region

The backbone network uses FPN, and each layer of FPN should only be responsible for training targets within a specific size range. Since the characteristics of adjacent layers are similar, the IR region is mapped to the adjacent layers. At this time, the IR region does not consider the CR region and the region also does not participate in the training, as shown in Figure 2. When multiple targets overlap, CR region takes precedence over IR region, and IR region takes precedence over OR region, and Focal Loss is used in training.

Anchor shape targets

First, define the dynamic anchor awh = {(x0, y0, w, h) > 0, ∣ w h > 0} a_ = {wh} \ {(x_0 y_0, w, h) > 0, | w h > 0 \} awh = {(x0, y0, w, h) ∣ w > 0, h > 0} between the GT and the optimal problem is:

If formula 5 is solved for every position, the amount of calculation will be quite large. Therefore, this paper uses sampling method to approximate Formula 5, and the sampling range is common anchor samples, such as 9 types of RetinaNet anchor. For each location, the largest anchor of IoU is selected as the result of Formula 5. The larger the sampling range is, the generated result of Anchor will be more accurate, but it will bring more extra calculation. The smooth-L1 training of Anchor is adopted:

The Use of High-quality Proposals


Embedded Guided Anchoring into RPN to obtain enhanced GA-RPN, which is compared with the original version. It can be seen from Figure 3:

  • Ga-rpn has more positive samples
  • Ga-rpn has more high IoU candidate boxes

According to the results, the effect of GA-RPN is much better than THAT of RPN. In this paper, replacing RPN with GA-RPN directly only brings AP improvement less than 1. According to the observation, the premise of using high-quality candidate frames is that the distribution of training data needs to be adjusted according to the distribution of candidate frames. Therefore, when using GA-RPN, higher positive and negative sample thresholds should be set to make the network pay more attention to high-quality candidate boxes. In addition, it is found that GA-RPN can also boost the performance of two-stage detector in the form of fine-tune. Given a trained detection model, replacing RPN with GA-RPN for several iterations can bring good performance improvement.

Experiments


Compare with various candidate box methods.

Embedding effect comparison.

Fine – most cerebral sci-film contrast.

Conclusion


Guided Anchoring solves the problems existing in conventional manual preset anchor through online generation of anchor and provides two implementation methods in terms of embedding, which is a very complete solution. However, a bad point lies in the generation of Anchor target. In order to ensure that the performance is not perfect, it is hoped that someone can propose a more accurate and efficient solution of Formula 5.





If this article was helpful to you, please give it a thumbs up or check it out

For more information, please pay attention to wechat official account [Algorithm Engineering Notes of Xiaofei]