Focal Loss for Dense Object Detection

Address: arxiv.org/abs/1708.02…

Code address:

  • Official Github: github.com/facebookres…
  • Tensorflow:github.com/tensorflow/…
  • Github.com/fizyr/keras…
  • Github.com/yhenon/pyto…

At present in the field of target detection in the second phase of the method is superior to a phase of the precision, though a phase detector is more dense sample processing, should be able to do it faster, more simple, the authors think it is about no reason is the extreme imbalance of positive and negative samples (foreground and background), so the author puts forward a new loss, Called Focal Loss, it can reduce the weight of well-identified samples by modifying the standard cross-entropy function, for example, so that it will focus on training a small number of difficult samples without being disturbed by a large number of simple samples.

In addition, a new detection framework called RetinaNet is proposed based on FPN. The experiment shows that its accuracy is better than the best two-stage detection algorithm and its speed is better than the one-stage detection algorithm.

Contact information:

Github:github.com/ccc013/AI_a…

Zhihu column: Machine Learning and Computer Vision, AI paper notes

1. Introduction

The current mainstream target detection method is divided into two stages, two-stage and one-stage:

  • The two-stage approach is to generate sparse candidate Bboxes in the first stage, and then classify these Bboxes in the second stage to determine whether they are target objects or backgrounds. The two-stage detection algorithm is also the most accurate target detection algorithm at present
  • The one-stage detection method directly processes a large number of candidate Bboxes, and the detection and classification are completed at the same time. Although the speed is very fast, its accuracy is still 10%-40% lower than that of the best two-stage algorithm.

According to the author, the main problem of one-stage detection algorithm is that it does not deal with the problem of category imbalance well:

  • In the first stage, when the two-stage detection algorithm generates candidate Bboxes with the same method such as RPN, it will filter out a large number of background samples. In the second stage, the sampling strategy uses a fixed ratio of front and back scenes (1: 3), or online hard example Mining (OHEM) can maintain a reasonable proportion of categories;
  • Although the one-stage detection algorithm also adopts the same sampling strategy, due to the large number of candidate Bboxes (over 100K), these strategies will be very inefficient due to many easy to classify background samples. Of course, this is also a classic problem in target detection, and there are some solutions, such as Boostrapping. Or difficult sample mining;
  • The problem of category imbalance exists in both one-stage and two-stage detection algorithms, which will lead to two problems:
    • Training is inefficient due to the large number of easily classified negative samples (background), because these samples can not provide effective information;
    • The easily distinguishable negative samples will also destroy the training of the model and lead to the degradation of the model.

The author hopes to combine the advantages of the first stage and the second stage, that is, to achieve fast and accurate, so a new loss function called Focal Loss is proposed. Its function is to dynamically adjust the size of the cross entropy function, and the set scaling factor will change with whether the samples are easy to distinguish, as shown in the figure below:

Intuitively, this scaling factor automatically reduces the weight of the easily distinguishable samples, allowing the model to focus on training a small number of difficult samples during training.

In order to verify the effectiveness of this method, the author uses RPN as the backbone network and proposes a one-stage detection framework with a RetinaNet. The experimental results are shown below. The results show that the one-stage detection framework with a RetinaNet can achieve a balance between speed and accuracy and the speed is faster than the two-stage detection algorithm. The accuracy is better than one – stage detection algorithm.

In addition, the author emphasizes that the achievement of RetinaNet mainly depends on the improvement of loss, and there is no innovation in the network structure.

2. Focal Loss

Focal Loss is a Loss function designed to solve the extreme category imbalance (such as positive and negative sample ratio of 1:1000) in the one-stage detection algorithm, and it is a modification of the standard cross entropy function.

First, the standard cross entropy function formula is as follows:


C E ( p . y ) = C E ( p t ) = l o g ( p t ) CE(p,y)=CE(p_t)=-log(p_t)

Where, y represents the true label of the sample, and binary classification is used as an example here, so the value of y is 1 or -1, while p is the probability of model prediction, and the value range is [0,1]. Then, ptp_tpt is:

In Figure1 of the first section of Introduction, the top blue curve represents the loss of the cross entropy loss function to different samples. It can be seen that even for samples that are easily distinguishable, that is, samples with ptp_tpt much larger than 0.5, The loss obtained by cross entropy calculation is still very large. If the loss of a large number of such samples is summed up, the information provided by a small number of difficult samples will be destroyed.

2.1 Balanced Cross Entropy

A balanced cross entropy function has been proposed before, as follows:

Here, a weighting factor α\alphaα is introduced, which is inversely proportional to the number of categories, that is, the less the number of categories, the greater the weight of loss will be. This is also the baseline of this method.

However, the problem of this Loss function is that the increased weight factor only distinguishes between positive and negative samples, but it cannot distinguish between the samples that are easy to classify and those that are difficult to classify. Therefore, in this paper, improvement is made to this point by proposing Focal Loss.

2.2 the Focal Loss

Calculation formula of Focal Loss is as follows:

Here we add a hyperparameter γ\gammaγ, which the authors call the focusing parameter. In this experiment, γ=2\gamma=2γ=2 works best, and when it is 0, it is the standard cross entropy function.

For Focal loss, there are two characteristics:

  1. When some samples are misclassified and pTP_tPT is small, the regulatory factor (1−pt)γ(1-p_t)^\gamma(1−pt)γ is close to 1, which has little effect on loss. However, as PTP_tpt tends to 1, this factor will gradually tend to 0. Then loss for well-classified samples will also become smaller to achieve the effect of reducing weight.
  2. The focusing parameter γ\gammaγ will smoothly adjust the ratio of lower weights of easily classified samples; The effect of the regulator can be enhanced by increasing γ\gammaγ, and the experiment shows that the best effect is γ=2\gamma =2γ=2. Intuitively, the regulator reduces the loss contribution of easily classified samples, and expands the range of low loss received by samples.

In practical application, combined with balanced cross entropy, focal Loss can be obtained as follows. In this way, the weight of positive and negative samples can be adjusted, and the weight of difficult to classify samples can be controlled:

Experiments show that this loss can improve the accuracy of the loss without the addition of α\alphaα.

2.3 Class Imbalance and Model Initialization

In dichotomies, the output probability of default dichotomies is equal, but such initialization will lead to the loss of a large number of categories occupying a large proportion in the overall loss, which will affect the stability of the initial training.

For this problem, in the initial training, the author introduced a prior concept for the PPP evaluation of the model of a few classes (i.e., foreground), and used π\ PI π to represent it, and then set it to a relatively small value. The experiment showed that whether using cross entropy or Focal loss, when facing the serious problem of category imbalance, All this improves the consistency of the training.

2.4 Class Imbalance and Two-stage Detectors

Two-stage detection algorithms generally use the standard cross-entropy function, and rarely use balanced cross-entropy or Focal loss. They mainly rely on two mechanisms to deal with the problem of category imbalance:

  1. Two-stage series mode;
  2. There is a selected mini-batch sampling mode

In the first stage, a large number of candidate Bboxes will be reduced to 1-2K, and a large number of negative samples can be deleted rather than randomly. In the second stage, a selective sampling method will be used to build the mini-batch, such as the ratio of positive and negative samples is 1:3, which is equivalent to the α\alphaα parameter added to the equilibrium cross entropy.


3. RetinaNet Detector

The overall structure of the RetinaNet is shown below:

RetinaNet is a one-stage detection framework consisting of a backbone network and two task-specific sub-networks which compute the feature map of the input image and then the two sub-networks categorize the output of the backbone network and the candidate box regression respectively.

Feature Pyramid Network Indicates the backbone Network

The backbone of the RetinaNet is FPN, the structure of the FPN is shown in a-B above, here is the FPN built based on ResNet, it is a top-down convolutional neural network with side connection, it can get multi-scale features when the input is single resolution.

This paper constructs a pyramid from P3 to P7, where L represents the level of the pyramid, the resolution of PlP_lPl is 1/2L1/2 ^ L1/2 L of the input resolution, and the number of channels for each level of the pyramid is C = 256.

Anchors

In this paper, we use the method of translation-invariant anchor boxes, where each anchor holds regions from 32232^2322 to 5122512^25122 in P3 to P7 of the pyramid. And at each pyramid level set the aspect ratio to {1:2, 1:1, 2:1}, the corresponding set scale is {20,21/3,22/32^0, 2^{1/3}, 2^{2/3}20,21/3,22/3}, The result is A=9 anchors per pyramid layer and covers an area between 32 and 813 pixels on A web input image.

Each Anchor is the target of the one-hot vector of K classification targets (K represents the number of categories) and 4 Box Regression. The author sets the anchor in such a way that the IoU (intersection-over-union) of the detection box is 0.5 with the actual tag object, and the IoU of the background is [0, 0.4), whereas if it is [0.4, 0.5), it will be ignored during training;

The Box Regression goal is to calculate the offset of each Anchor and its assigned object Box, or ignore it if not set.

Classification Subnets

The function of classification subnetwork is to predict the probability of occurrence of A Anchor and K category objects in each spatial position. The subnetwork is a small FCN that is connected to each layer of FPN, and the parameters of the subnetwork are shared across all layers of the pyramid.

The design of this subnetwork is relatively simple. Given that the number of channels in a certain pyramid layer is the input feature map of C, the number is 3×33\times 33×3 through 4 convolution kernels, and the number is the convolution layer of C, and the activation function ReLU is adopted. Then A convolution kernel is also 3×33\times33×3, but the quantity is K×AK\times AK×A convolution layer, and sigmoid is used to activate the function and output KA binary prediction for each spatial position. In the experiment in this paper, C=256, A=9.

Compared with THE RPN network, the classification subnetwork adopted in this paper has more layers, and only 3×33\times 33×3 convolution kernel is used, and the parameters are not shared with the box Regression subnetwork described below. This is because the author found that the design decisions at higher levels are more important than the hyperparameters of specific values.

Box Regression Subnet

The relationship is parallel to the classification subnetwork. The author also uses a small FCN network, which is also connected to each layer of FPN. The purpose is to regression the offset of each Anchor box to adjacent real label objects.

In terms of network design, it is similar to the classification subnetwork, except that the output is 4A linear outputs. For A anchor in each spatial location, the output of these four anchors predicted the offset of anchor and real label detection box. In addition, different from most current work, the author used A class-Agnostic bounding box regressor. You can use fewer parameters but be more efficient.

The classification subnetwork and the regression detection subnetwork share the same network structure, but adopt different parameters respectively.

Experiment 4.

The experimental results are as follows:

A) The result of adding the parameter α\alphaα based on the standard cross entropy loss, in which α=0.5\alpha=0.5α=0.5 is the traditional cross entropy. It can be seen from the table that the effect is best when 0.75, and AP increases by 0.9;

B) The experimental results of γ\gammaγ and α\alphaα were compared. The AP increased significantly with the increase of γ\gammaγ, and the effect was the best when γ=2\gamma =2γ=2;

C) The influence of size scale and aspect ratio of Anchor on the effect is compared, and the best result is when 2 and 3 are selected respectively;

D) Compared with using OHEM method, it can be seen that the best OHEM effect is AP=32.8, and Focal Loss is AP=36, an improvement of 3.2. In addition, oHEM1:3 means that the positive and negative sample ratio in minibatch obtained through OHEM is 1: 3, but this does not improve AP;

E) AP and speed are compared under different network model depths and input image sizes.

Conclusion 5.

The author of this paper believes that the fundamental reason why the one-stage detection algorithm cannot exceed the two-stage detection algorithm in performance is the extreme unbalanced problem of categories. In order to solve this problem, a Focal loss is proposed, which is modified to the standard cross entropy loss so that the network model can focus more on the negative samples with difficulty in learning. The method in this paper is simple but efficient, and a full convolution one-stage detection framework is designed to verify its high efficiency. Experimental results also show that it can achieve the accuracy and speed of state-art.