FoveaBox, as the Anchor free paper of the same period as FCOS and FSAF, is also based on the strategy of DenseBox plus FPN in terms of overall structure. The main difference lies in that FoveaBox only uses the target central area for prediction and regression prediction is the normalized offset value. There are also multiple layers of FPN selected according to the target size for training, we can learn

Source: Xiaofei algorithm engineering Notes public account

FoveaBox: Beyound Anchor-Based Object Detection

  • Thesis Address:Arxiv.org/abs/1904.03…
  • Thesis Code:Github.com/taokong/Fov…

Introduction


The paper believes that the use of Anchor is not necessarily the optimal way to search for targets, and is inspired by foVEa: the middle part of the visual area has the highest visual acuity, so the anchor free target detection method FoveaBox is proposed.

FoveaBox jointly predicts the likelihood of each effective location being the target center and its corresponding target size, outputs the category confidence and size information used to transform the target area. If you have seen many anchor-Free detection schemes, you may feel that the implementation scheme of the paper is very common. Indeed, this article is also a work of Anchor-Free at the early stage of blowout, with a pure overall idea and one that many leaders have thought of. Attention should be paid to the following details when reading:

  • Classification prediction and regression prediction are made based on the central area of the target
  • The regression is predicted by the normalized offset
  • FPN multi-layer training can be specified during training
  • A feature alignment module is proposed, which uses the output of regression to adjust the input features of classification

FoveaBox


Object Occurrence Possibility

Given GT target box (x1,y1,x2,y2)(x_1, y_1, x_2, y_2)(x1,y1,x2,y2), map it to feature pyramid layer PlP_lPl:

Sls_lsl is the stride of the feature layer relative to the input, and the positive sample region RposR^{pos}Rpos is roughly the reduced version of the mapping box:

Sigma sigma is an artificial scaling factor. In the training phase, the feature points in the positive sample area are marked as corresponding target categories, and the remaining areas are negative sample areas. The output of each layer of the feature pyramid is C×H×WC\times H\times WC×H×W, and CCC is the total number of categories.

Scale Assignment

The goal of the network is to predict the boundary of the target, and direct prediction is unstable because of the large span of the target size. Therefore, the target size is divided into multiple intervals corresponding to each layer of the feature pyramid, and each layer is responsible for the prediction of specific size range. Given the base size of feature pyramid P3P_3P3 to P7P_7P7 RL = 2L + 2R_L =2^{L +2}rl= 2L +2, then the target size range responsible for layer LLL is:

η\etaη is a manually set parameter used to control the regression size range of each layer of the feature pyramid. Training targets that are not within the range of this layer are ignored. The target may fall within the size range of multiple layers. In this case, multi-layer training is used. Multi-layer training has the following benefits:

  • Adjacent feature pyramid layers usually have similar semantic information and can be optimized simultaneously.
  • Greatly increase the number of training samples for each layer, making the training process more stable.

Box Prediction

FoveaBox directly calculates the normalized offset of the positive sample region (x,y)(x,y)(x,y) to the target boundary when predicting the target size:

Formula 4 first maps the pixels of the feature pyramid layer back to the input image, and then calculates the offset value. L1 loss function is used for training.

Network Architecture

The network structure is shown in Figure 4. The main network adopts the form of feature pyramid, and each layer is followed by a prediction Head, including classification branch and regression branch. In this paper, the simpler Head structure is used, and the more complex Head can obtain better performance.

Feature Alignment

The paper proposes trick of feature alignment, which is mainly to transform the predicted Head. The structure is shown in Figure 7.

Experiment


Compared with SOTA method.

Conclusion


FoveaBox, as the Anchor free paper of the same period as FCOS and FSAF, is also based on the strategy of DenseBox plus FPN in terms of overall structure. The main difference lies in that FoveaBox only uses the target central area for prediction and regression prediction is the normalized offset value. There are also multiple layers of FPN selected for training according to the target size. Because the overall implementation scheme of FoveaBox is too pure and similar to other anchor-free methods, it was not easy for the author to submit the paper until now.





If this article was helpful to you, please give it a thumbs up or check it out

For more information, please pay attention to wechat official account [Algorithm Engineering Notes of Xiaofei]