Object Detection · RCNN paper interpretation

Reproduced please indicate the author: Dream tea

Object Detection, as the name implies, is to detect the target Object from the image, specifically to find the location of the Object. The common data set is PASCAL VOC series. From 2010 to 2012, Object Detection made slow progress and made no great progress after DPM. It was not until CVPR2014 that Ross Girshick combined CNN, which was then in flames, into Detection. The accuracy of PASCAL VOC was improved to 53.7%. CVPR2014 paper of RBG

Rich feature hierarchies for accurate object detection and semantic segmentation

Key insights

CNN can be used to identify the local area of the image, so as to determine whether the local area is the target object
In cases where marker data is scarce, models can be pre-trained with other data sets and then fine tune

RCNN Overview

The input image
2k region proposals were presented by Selective Search

A feature of detection problem is, we need to know whether an image contains not only the target object, but also need to know the target location, there are several ways, one is the location of the regression test box in figure [38], but the accuracy is very low, one is use the method of sliding window pictures will be cut into many small pieces, then analysis on small pieces do but for CNN, After pooling of a layer, the receptive field becomes smaller. RCNN adopts a five-layer convolution structure and requires the input size to be at least 195×195, which cannot be guaranteed by sliding window.

Selective Search is a good data screening method. Firstly, the image is segmented into many small pieces, and then, according to the basic characteristics of color histogram, gradient histogram, area and position between the small pieces, the adjacent objects are splicing, so as to select the region with certain semantics in the picture. More information on Selective Search can be found in this paper: Recognition Using Regions (CVPR2009).

Each recommended region is passed into CNN to extract features
A SVM is trained for each class, and the SVM is used to determine which class the recommended region belongs to
The NMS merges the region proposals of the same class
Precise correction of predicted positions was performed with bounding box Regressor to further improve accuracy

Non-maxima suppression (NMS), as its name implies, inhibits elements that are not maxima and searches for local maxima. This local represents a neighborhood with two variables: dimension and size of the neighborhood. The general NMS algorithm is not discussed here, but is used to extract the window with the highest score in target detection. For example, in pedestrian detection, after features are extracted from sliding Windows and classified and identified by classifiers, each window will get a score. However, sliding Windows can cause many Windows to contain or mostly cross with other Windows. In this case, NMS is needed to select the neighborhood with the highest score (the highest probability of being a pedestrian) and suppress the Windows with the lowest score. (Translated from Zhihu column: Notes on Machine Learning by Xiaolei)

training

As can be seen from the above Overview, there are mainly two parts to be trained, CNN shared by each class and SVM separate by each class.

Network Structure

RCNN tried two CNN frameworks, one of which is AlexNet: ImageNet Classification with Deep Convolutional Neural Networks published by Hinton and his team on NIPS2012

This is a five-layer convolution + three-layer fully connected structure, the input is 224×224 images, the output is 1000 dimensional one-hot categories,

One is VGG16(Very Deep Convolu-tional Networks for Large-scale Image Recognition)

Here are the results of the two networks:

VGG16 has a higher accuracy, but it has a large amount of calculation, and its real-time performance is not as good as AlexNet. For convenience, AlexNet is used as the basis for the following analysis.

Supervised Pretraining

First of all, ImageNet is used for pre-training. The image is input, and the output is the category of the target object contained in this image without specific position, because there is no bounding box information in ImageNet. After training AlexNet to achieve the accuracy of Hinton and his team on classification tasks, Fine Tune was made with detection data.

Domain Specific Fine Tuning

The performance of CNN directly pre-trained by ImageNet is definitely not satisfactory on PASCAL VOC. Next, detection data of PASCAL VOC are used for fine tune. Because VOC has 20 categories, in the detection task of ILSVR2013, there are 200 categories in the end, and ImageNet has 1000 categories, first of all, the last fully connected classification layer should be replaced by the target task output number +1 (plus a background class) of the fully connected layer. The Region Proposals obtained by Selective Search correspond to bounding box for input data.

In Fine Tune here, it is necessary to determine which target classification Region Proposal belongs to. In VOC training set, bounding box and corresponding classification labeling are available. RBG, they check the overlap rate of each Region Proposal and bounding box in the training set. If the overlap rate of Region Proposal and bounding box is greater than the threshold value (after the experiment, Select 0.5), the classification of the Region Proposal is considered as the corresponding classification of bounding box, and the corresponding bounding box is used as the input of Fine tune.

However, these inputs vary in size and need to be adjusted to the target input size of 224×224. A number of preprocessing methods are discussed in Appendix A.

A. Original map B. equal scale and empty parts are filled with original map C. Equal scale and empty parts are filled with average box D. The experimental results show that B has the best effect when scaled to 224×224, but in fact there are still many pre-processing methods available, such as area repetition for the vacant part.

During the training, an initial learning rate of 0.001 was adopted (1/10 of the previous pre-training), and mini-Batch SGD was adopted. Each batch contained 32 positive samples (all kinds of classes were mixed together) and 96 negative samples for training.

Object category classifiers

Each class corresponds to a Linear SVM binary classifier (well, a very simple SVM without complex kernel). The input is the output of the second-to-last layer of CNN, which is a vector with a length of 4096. SVM learns and adjusts its weight according to this feature vector and tag. Learn which variables in the feature vector are most effective in distinguishing the current class.

The data of SVM training are different from the data of Fine Tuning training CNN. The positive samples of PASCAL VOC training set are directly used, and the Region with an overlap rate less than 0.3 with the bounding box are used as background (negative samples). This overlap rate is also compared by tuning; On the other hand, due to the large number of negative samples, the paper uses hard mining technology to screen out difficult to classify negative samples for training. However, in this case, the positive and negative samples of SVM and CNN are defined differently, and the positive samples of SVM are much less (those bounding boxes with an overlap rate greater than 0.5 are not used).

As explained in Appendix B, in fact, they used the positive and negative sample definition of SVM to Fine Tune CNN at the beginning of RBG, but found that the effect was very poor. SVM can achieve good results on small samples, but CNN cannot, so more data need to be used for Fine tune. The data of Region Proposals with an overlap rate greater than 0.5 can be used as positive samples and bring 30 times more data. However, the cost of adding these inaccurate data is The position of the test is not accurate enough (because the deviation of the position of the sample is considered as a positive sample).

So I have a natural idea, if there are a lot of accurate data, can CNN and Softmax directly output 21 classification, not SVM classification? RBG and their team fine tune this classification method directly, and found that the accuracy of this method is also very high (50.9%), but it is not as good as the classification result of SVM (54.2%). On the one hand, the positive sample is not accurate enough, on the other hand, the negative sample has not undergone hard mining, but at least it proves that, It is possible to achieve better detection effect by training CNN directly, which can accelerate the training speed and make it more concise and elegant.

Bounding-box regression

This section is elaborated in Appendix C (CVPR space limitation). First of all, for each type of training a bounding box regressor, a bounding box regression of DPM, each class regressor can output a response for each figure figure, each part of representative figure on the responsivity of the class. Regressor in DPM was calculated using geometric features (HOG) of images. In CONTRAST to DPM, activation in CNN-BB is calculated by CNN. The input is the original image and the output is the response image (thus obtaining the location of the Bbox). Regressor input of RCNN-BB is the locations and original drawings of Region Proposals, and the output is the locations of bounding box.

The position of a region proposal is defined as P=(Px, Py, Pw, Ph), x and y are the center point of region prosal, w and h are the width and height of region proposal, In Regressor, the orientation of bounding boxes was G=(Gx,Gy,Gw,Gh). In Regressor, the training goal was to learn a P->G mapping and disintegrate the mapping into four parts:

Among them, dx(P), dy(P), Dw (P), dh(P) are four linear functions. The input is P, and the output is a real number after the POOL5 feature obtained by the CNN mentioned above

Training is to solve an optimization problem and work out four W vectors to minimize the difference between the predicted G and the real G. The sum of the squared differences is used to represent the distance. The simplified form is as follows:

Among them,

The four maps of the front side are corresponding, and l2 regular constraint on W is added to suppress overfitting

After the four mapping relationships are obtained, the locations of the predicted regions can be refined by using the four mappings during the test to improve the location accuracy of the detection frame.

At this point, the whole training and testing process is introduced.

The metaphysics of time

In the paper, Hinton also opened the convolutional layer in RCNN to analyze their functions. In AlexNet’s paper, Hinton showed us that the first layer of convolution described the outline and color of the object in a visual way, but the following layer could not be directly visualized because it could not be represented as an image. The method of RBG is: Input each area of an image, see the responsivity of each unit in POOL5 (Max pooling output of the last convolution layer), and box the areas with high responsivity:

The size of pool5 feature map is 6x6x256, and 16 graphs in each line of the figure represent 16 graphs with the highest responsiveness of a unit. The areas with high responsiveness of each figure are highlighted in white boxes, and only 6 units are selected here for display (so there are only 6 lines). A unit is a real number in a tensor of 6x6x256, and the larger the number, the higher the response to the input.

It can be seen that different units have different division of labor. The unit in the first line is highly responsive to Person, and the unit in the second line is highly responsive to dog and dot array. From this perspective, each unit can be used as a separate object detector.

There are more visualizations in Appendix D

The reason why it is called metaphysical is that although this kind of visualization reflects what CNN has learned to some extent, it still fails to explain why it is this unit that has learned such information.

Summary

RCNN applied CNN combined with Region proposal to detection task for the first time and achieved good results. In this paper, many popular techniques of visual deep learning were also reflected, such as Pretrain and Fine Tune, and traditional methods were combined with deep learning (segmentation + detection, CNN+SVM, Bounding box Regression) is a good paper worth reading.