As the Anchor-free paper of the same period with FCOS and FSAF, the overall structure of Foveabox is also based on the strategy of Densebox plus FPN. The main difference is that Foveabox only uses the target central region for prediction and the regression prediction is the normalized offset value. There are several layers of FPN selected according to the target size for training, you can learn the next





Source: Xiaofei’s algorithm engineering notes public number

Foveabox: Beyound Anchor-Based Object Detection

  • Thesis Address:https://arxiv.org/abs/1904.03797
  • Paper Code:https://github.com/taokong/FoveaBox

Introduction


This paper believes that the use of anchor is not necessarily the optimal way to search for targets, and it is inspired by the fovea of the retina of the human eye: the central part of the visual region has the highest visual acuity, so the anchor-free target detection method, Foveabox, is proposed.

Together, Foveabox predicts the likelihood that each valid location will be the target center and the size of the corresponding target, and outputs the category confidence and size information used to transform the target area. If you have seen many anchor-free detection schemes, you may think that the implementation scheme of this paper is very common. Indeed, in fact, this paper is also the work of the early blowout of anchor-free. The overall idea is very pure, and it is also the idea that many big shots have thought of.

  • Classification prediction and regression prediction are carried out based on the central region of the target
  • What predicts the regression is the normalized offset
  • When training, you can specify FPN multi-layer training at the same time
  • A feature alignment module is proposed, which uses the output of regression to adjust the input features of the classification

FoveaBox


Object Occurrence Possibility

Given the GT target box $(X_1, Y_1, X_2, Y_2)$, map it to the feature pyramid layer $P_l$:

$s_l$is the stride of the feature layer relative to the input, and the positive sample area $R^{pos}$is the reduced version of the mapping box roughly:

$\sigma$is an artificial scaling factor. In the training stage, the feature points in the positive sample area are marked as the corresponding target category, and the remaining areas are the negative sample area. The output of each layer of the feature pyramid is $C\times H\times W$, and $C$is the total number of categories.

Scale Assignment

The goal of the network is to predict the boundaries of the target, and direct prediction is unstable because of the wide range of target sizes. For this purpose, the paper classifies the target size into several intervals, corresponding to the layers of the feature pyramid, which are responsible for the prediction of a specific size range. Give characteristics pyramid based size $$$$P_7 P_3 $r_l = 2 ^ {l + 2} $, the layer $l $is responsible for the target size range as follows:

$\eta$is a parameter manually set to control the regression size range for each layer of the feature pyramid, and training targets outside the size range of the layer are ignored. The target may fall within the size range of multiple layers. In this case, multi-layer training is used. Multi-layer training has the following benefits:

  • Adjacent feature pyramid layers usually have similar semantic information and can be optimized simultaneously.
  • Significantly increase the number of training samples for each layer, making the training process more stable.

Box Prediction

When predicting the target size, Foveabox directly calculates the normalized offset from the positive sample region $(x,y)$to the target boundary:

Formula 4 first maps the pixels of the feature pyramid layer back to the input picture, and then calculates the offset value. L1 loss function is adopted for training.

Network Architecture

The network structure is shown in Figure 4. The backbone network adopts the form of feature pyramid, with each layer followed by a prediction HEAD, including classification branch and regression branch. The paper adopts the simpler HEAD structure, and the more complex HEAD can get better performance.

Feature Alignment

The trick of feature alignment is proposed in this paper, which is mainly to transform the predicted HEAD. The structure is shown in Figure 7.

Experiment


It was compared with the SOTA method.

Conclusion


As the Anchor-free paper of the same period with FCOS and FSAF, the overall structure of Foveabox is also based on the strategy of Densebox plus FPN. The main difference is that Foveabox only uses the target central region for prediction and the regression prediction is the normalized offset value. There are also layers of FPN selected according to the target size for training. Since the overall implementation of Foveabox is very pure and similar to other anchor-free methods, it has not been submitted until now, which is not easy for the author.





If you found this article helpful, please feel free to like it or read it again


For more information, please pay attention to the WeChat public number [Xiaofei’s algorithm engineering notes]