Target detection (1) RCNN detail — the first work of deep learning target detection (2) SPP Net — let convolution computation share (3) Target detection (3) Fast RCNN — let RCNN model can train end-to-end (4) target detection Faster RCNN — RPN network to replace Selective Search [Target Detection (5)] YOLOv1 — opens the chapter of one-stage target detection

1. Motivation: RPN network was used to replace selective search

In the same year of 2015, Ross Girshick’s team introduced the Faster RCNN target detection algorithm, which is another major work after Fast RCNN algorithm. The previous several articles respectively introduced RCNN, SPP Net and Fast RCNN. This development process is also gradually end-to-end process: From RCNN’s Selective Search + CNN network + SVM classifier + Bounding Box regression, to SPP Net, the convolution calculation only needs to be calculated once. When it comes to the selective Search + CNN network of Fast RCNN (including full-connection layer prediction classification and regression), the training time and reasoning time are greatly reduced. The whole end-to-end development process can be represented as follows:

As mentioned in the last article on Fast RCNN, the network forward inference time is only 0.32s, but selective Search requires 2s. The main contradiction of the current performance bottleneck lies in the time-consuming selective Search algorithm. To solve this problem, Faster RCNN proposed RPN, Region Proposal Network (REGIONAL Proposal Network) to replace the original Selective search algorithm. Other parts are the same as Fast RCNN. Faster RCNN = RPN + Fast RCNN. The key to understanding Faster RCNN is to understand how the RPN mechanism works.

2. Principle of Fast RCNN

2.1 Pipeline

Faster RCNN = RPN + Fast RCNN, which is mainly divided into the following steps:

  • The image was input into CNN network to obtain the feature map.
  • RPN network structure is used to generate candidate frames, and then these candidate frames generated by RPN are projected to the first step to obtain the corresponding eigenmatrix.
  • Each feature matrix is scaled to a fixed by the ROI Pooling layer7 * 7Finally, the feature graph is flatten through a series of full-connection layers to obtain the classification and regression results.

2.2 RPN network structure

2.2.1 INPUT and output of RPN

The following figure shows the RPN network structure. First, figure out what the input and output of RPN are respectively.

  • The input of RPN network is the characteristic matrix of backbone output. If backbone is ZFNet or AlexNet, there are 5 layers in front of RPN, and 13 layers if backbone is VGG16. If you are using AlexNet, the input for RPN is13 * 13 * 256The feature of the map.
  • The output of RPN network is 2k score prediction and 4K Bbox deviation prediction (k is the number of prediction candidate boxes), each candidate box predicts two scores (foreground or background) and four Bbox deviation values (the definition of deviation value is the same as RCNN and Fast RCNN).

2.2.2 RPN Network details

RPN Inputs the feature graph extracted by backbone (assuming the input dimension is 13*13*256), passes a 3*3 convolution kernel (3*3*256) and 1*1 convolution kernel (1*1*256) in turn, and the 256-D feature matrix as shown in the following figure can be obtained. The first branch CLS layer convolves with 2k 1*1*256 convolution kernels, and outputs 2k numbers, representing the fraction of objects in a certain region. The second branch Reg layer convolves with 4K 1*1*256 convolution kernels, and finally outputs 4K numbers. Represents the offset of x, y, W, and h.

Two questions come naturally: if it’s an offset, with respect to whom? What does this K refer to? Section 2.2.3 will explain in detail.

Then the Anchor mechanism

The k above refers to the number of Anchor boxes, each position (per sliding window) corresponds to K =3*3=9 anchors on the original image, including a combination of 3 scales and 3 aspect ratios:

  • scale: [128, 256, 512]
  • ratio: 1:1, 1:2, 2:1

RPN network carries out 3*3 convolution after inputting feature map, and there is a corresponding relationship between the position of feature map and the pixel of the original image. Here, the center of the reference frame of Anchor box is the center of the convolution kernel, so every convolution on conv5 layer will automatically correspond to 9 Anchor boxes. In this way, the offset of boundary box we fit is actually the offset relative to Anchor box.

Problem 1: Too many anchors are generated. How to deal with the impact on efficiency? For a picture of 1000*60*3, about 600*40*9 (20,000 +) anchors can be generated. After ignoring the anchors across the image boundary, about 6000 anchors are left. There is a lot of overlap between candidate frames generated by RPN, and THE CLS score based on candidate frames is suppressed by NMS algorithm, so there are only 2000 candidate frames left in each image, which is basically consistent with the number of candidate frames generated by Selective Search. The authors verify that reducing the number of candidate boxes does not reduce the accuracy of the model. To summarize this paragraph, the author compresses the generated anchor in the following four ways:

  • Filter the part that the original image area corresponding to anchor exceeds the image boundary.
  • Filter out the anchor with GT Box’s IOU in [0.3, 0.7].
  • After the classifier and regressor, the generated Region Proposal is suppressed by NMS.
  • After NMS, use the score of Softmax classification to select the top-N Region Proposal.

For ZFNet, the receptive field is 171 and VGG16 is 228, while the maximum scale of Anchor box can reach 512. Why can the characteristic diagram of receptive field 171 learn and modify the frame with 512 size? The author thinks that although the receptive field is smaller than the largest anchor box, it may only be part of the information of the object, but part of the information of the object is enough for target detection network to learn, just like people only see the upper body of a person without the lower body, people will know that it is a person.

2.2.4 RPN loss function

Here, the loss function of RPN is the same as that of Fast RCNN, as shown below:

PI represents the probability that the ith anchor is predicted to be a true label, PI is 1 when it is a positive sample, PI is 0 when it is a negative sample, Ti represents the regression parameter of the boundary Box of the ith anchor, Ti represents the boundary Box regression parameter of the boundary Box of the ith anchor corresponding to GT Box, Ncls represents the number of all samples in a mini-batch (256), and Nreg represents the number of anchor locations (not the number of anchor) of about 2400. Here λ is the balance coefficient of the two Loss. In the original paper, λ=10 is taken. It can be observed that Ncls is 256, while λ*1/Ncls is about 1/240, which tends to be almost equal. Therefore, the coefficient of regression Loss is directly set to 1/Ncls in the official implementation of Torch, which is consistent with the classifier.

The Lcls is the multicategory cross entropy loss(softmax + negative log likelyhood). Many technical blogs refer to it as the calculation of dichotomous cross entropy loss, but this is not correct, because if you calculate dichotomous cross entropy loss, CLS layer can only output K scores instead of 2k scores. Lreg only calculates loss when the candidate box is indeed the target, just like Fast RCNN, which still uses smooth L1 Loss.

Here, the regression of bbox adopts offset regression. What the author fits is the offset of the prediction frame and anchor and that of the real frame and anchor, so that the two are as equal as possible (P=G, equivalent to P-a=G-a). In the offset formula below, X and Y are the coordinates of the prediction Box, w and H are the width and height of the prediction Box, XA and YA are the coordinates of bbox corresponding to the original picture of Anchor, WA and HA are the width and height of Bbox corresponding to the original picture of Anchor, x and Y are the coordinates of the real GT Box corresponding to Anchor. W and H indicate the width and height of anchor corresponding to real GT Box. There might be a question here: why not just revert to the prediction box and GT? We consider two cases: one is a large target with the size of 300*300, and the difference between the prediction Box and GT Box is 30 pixels; the other is a small target with the size of 15*15, and the difference between the prediction Box and GT Box is 2 pixels. If the regression between the prediction Box and GT is conducted directly, Loss will be extremely biased towards large targets, resulting in almost no learning effect for small targets, and finally resulting in poor detection effect for small targets. This is the motivation for offset Regression.

2.2.5 Division of positive and negative samples

  • 256 anchors are selected for each picture as a mini-batch, and the ratio of positive and negative samples should be 1:1 as far as possible. If the positive samples are less than 128, negative samples will be used for filling.
  • Positive sample determination conditions, the paper gives two conditions, meet one is considered as a positive sample.
    • The IOU of the corresponding location of Anchor and GT Box is greater than 0.7.
    • And the anchor that GT Box has the largest IOU.
  • Negative sample determination condition: IOU with any GT Box is less than 0.3.

2.2.6 RPN training process

Now we all use the combined training method of RPN Loss + Fast RCNN Loss to conduct end-to-end training for the whole network. At that time, we were limited to various reasons in 2015 (maybe the network was not so easy to train as it is now, and Google put forward BN in the same year). In the paper, Faster RCNN is segmented fine-tune:

  • The parameters of backbone network layer are initialized by ImageNet pretraining model.
  • The unique convolutional layer and full-connection layer parameters of RPN network were fixed, and the pre-convolutional network parameters were initialized by ImageNet and training classification model, and the target suggestion box generated by RPN network was used to train the Fast RCNN network parameters.
  • The pre-convolutional network layer parameters trained by Fast RCNN are fixed to fine-tune the unique convolutional layer and full-connection layer parameters of RPN network.
  • Similarly, fixed pre-convolutional network layer parameters are maintained to fine-tune full-connection layer parameters of Fast RCNN network. Finally, RPN network and Fast RCNN network share the pre-convolutional network layer parameters to form a unified network.

3. Effects and disadvantages of Faster RCNN

As shown in the figure below, compared with Fast RCNN, Faster RCNN also increases points + speeds. [email protected] has improved its accuracy rate from 39.3% to 42.7%. Meanwhile, Fast RCNN requires 2.32s to deduce an image, while Faster RCNN only needs 0.198s, which also brings a huge improvement in speed. Up to now, Faster RCNN is still widely used in the industry. In practice, two-stage target detection is still more accurate than single-stage and Anchor Free algorithms. However, Faster RCNN reasoning has poor real-time performance, generally less than 10fps, and can only be used in scenes with low requirements on detection speed.

Reference:

  1. Arxiv.org/pdf/1506.01…
  2. www.bilibili.com/video/BV1af…