Deep Learning Object Detection series (1) R-CNN

R – introduction of CNN

R-cnn proposed in 2014, which should be regarded as the pioneering work of convolutional neural network in target detection task. Of course, there is also an Overfeat algorithm in the same year, which will not be discussed here.

In the following years, there are more and more CNN models of target detection task, with better real-time and accuracy, but the most classical model is still worth learning.

So let’s get started:

My personal understanding of R-CNN model is that it is actually a good combination of four existing algorithms applied to different tasks, and finally achieved good results in target detection task. This combination is more like an engineering method, rather than a breakthrough in algorithm. Of course, the model is gradually improved and integrated into one model in the subsequent fast-RCNN and ftP-RCNN, but there is no such model in R-CNN.

Therefore, R-CNN consists of four parts, which are as follows:

1. Region suggestion Algorithm (SS)

2. Feature Extraction Algorithm (AlexNet)

3. Linear classifier (Linear SVM)

4. Bounding Box Regression Models

Region suggestion algorithm:

Firstly, there is the Region Proposal algorithm, which has existed before CNN, and there are more than one algorithm. Ss (Selective Search) algorithm is a famous one, and there are also EdgeBox, MSER, MCG and other algorithms.

So what is the use of ss algorithm in R-CNN? This target detection task to start talking about, in order to realize target detection task in an image, one of the most simple idea is to establish a sliding window, if for each sliding window extracted image do classification, if the classification result is the target date, will be achieved detection, target attribute by the classifier to, to the location of the object by the sliding window. However, considering that the number of sub-images generated by a single slide traversal is not small, as well as the asynchronous length and window size, there are a large number of images to be classified at this time, which obviously has no practical value. Therefore, SS algorithm, an algorithm that generates recommendation regions according to the information of the image itself, is developed. It generates about 1,000 to 2,000 potential target areas, which is a lot less than the slide-through approach.

Feature extraction algorithm:

The feature extraction algorithm here is actually a convolutional neural network. AlexNet is used in R-CNN, but the author (Ross) does not use AlexNet as a classifier. Instead, he only uses the feature layer of the network to extract the feature of the image output by ss algorithm. The fifth feature is given the Bounding Box regression model.

Linear classifier:

R – CNN used linear SVM classifier, the nothing to say, very cow in machine learning algorithm, to be sure, is the function of the classification of target detection tasks, such as a task is to detect the cats and dogs, so in addition to box out of the location of the cats and dogs, dogs or cats also need to decide, this is also the SVM in the role of R – CNN. Therefore, there are several categories of objects to be detected, so there should be several SVM classifiers of binary classification. In the above example, two binary classifiers are needed, namely “cat-non-cat” model and “dog-non-dog” model. In R-CNN, there are 20 classifiers, and its input features are fc7 layer features extracted by AlexNet.

Boundary-box modified regression model:

Bounding box is also an old topic. Among common tasks in computer vision, there is also a locating task between classification and detection, and only one target is boxed in an image, using the Bounding box regression model.

In R-CNN, Bounding box is used to correct the boundaries of regions recommended by SS, and the input features are the fifth layer features of AlexNet. Like SVM classifier, it also has one model for each category, with a total of 20 models.

Above, we respectively introduced the four parts of R-CNN and their functions. It can be seen that they are all the previous things, but the success of R-CNN lies in finding a method to combine the four parts in training and testing, while the accuracy rate is greatly improved due to the introduction of CNN. We refer to the method of HOG+SVM for pedestrian detection. HOG is a kind of manual feature, while IN R-CNN, it is replaced with CNN extraction feature.

So my personal opinion is that the key to understanding R-CNN is not the four algorithms mentioned above, but how they are trained and tested in R-CNN!

R – CNN’s training

R-cnn trained CNN, SVM and Bounding box models, because SS algorithm does not need training, haha ~~

After the SS generates 1000-2000 recommendation areas, it has little to do with the training task, and the training sample is constructed from the subgraph generated by the SS region.

And the three parts of the training is perfect, and not integrated.

1. The training of CNN

CNN is AlexNet model pre-trained on ImageNet, fine-tune is carried out in R-CNN. The process of fine-tune is to change AlexNet Softmax to the number of categories required by the task, and then train as a classification model. The construction of training samples uses the sub-graphs generated by SS. When the IoU of these graphs and the ground-truth of the actual samples is greater than or equal to 0.5, they are considered as positive samples of a certain class, and there are altogether 20 such classes. When IoU is less than 0.5, it is considered as a negative sample. Then AlexNet can do the pre-train. After the pre-train, AlexNet’s Softmax layer will be thrown away, leaving only the parameters after training, which will be used for feature extraction.

2. Training SVM

As mentioned before, the input feature of SVM is the output of AlexNet FC7, and then SVM is dichotomized, one of which has 20 SVM models. Then, for a certain classifier, its positive samples are the features output by all ground-truth regions after AlexNet, and its negative samples are the features output by regions with overlapping IoU less than 0.3 after AlexNet. SVM training can be performed when features and labels are determined.

3. Training Bounding box regression model

There are also 20 Bounding box regression models, and one of them, whose input is characteristic of AlexNet Conv5, note that 20 refers to the number of classes, but for an Bounding box, it has 4 sets of parameters, Because an Bounding box regression model performs regression on 4 numbers, which are four values representing boundary boxes, the loss function of the model is as follows:

Where I is the number of samples, * is 4 numbers, they are x, y, W, h respectively, where (x, y) is the central position, (w, h) is the width and height; P is the region given by SS, which is determined by Px, Py, Pw and Ph. After AlexNet, this region outputs features at the fifth layer, and then trains a parameter W before each dimension of features. A set of features has a set of W, and there are four sets of W by regression with the four groups. The final number is t, which also has 4 numbers tx, ty, TW, and th, computed as follows:

And G is the modified bounding box, it’s still four numbers Gx, Gy, Gw, Gh. As can be seen from the above formula, t is the deviation of the bounding box. Finally, what ss region can be used as input, in this case with an IoU greater than 0.6.

The Bounding box regression model is summarized in one word: For the regression model of a certain class, the SS region with IoU>0.6 is convolved as the input feature, and the same feature is used to train four groups of weights corresponding to them, and the four attribute values of boundary boxes are regressed respectively.

After the above three ** parts, r-CNN training was completed. It can be seen that it was really very troublesome, not only in the slow speed, but also in the tedious process, because samples need to be reconstructed at each step.

R – CNN’s test

After training r-CNN can be used to do the test, the test process can be completed at a time, it has the following steps:

1. Ss algorithm extracts 1000-2000 regions;

2. Normalize all regions in order to be accepted by CNN;

3. Two sets of features are proposed by AlexNet network, one is fc7 layer, the other is CON5 layer;

4. For the features of a FC7 region, run 20 classifiers respectively to see which classifier gives the highest score, so as to determine the category of the region, and operate all the regions at one time;

5. Use non-maximum suppression for all the regions labeled above to obtain the subset of regions without redundancy (overlap). After non-maximum suppression, all the remaining regions are considered to be the last to be framed;

6. Retrieve the features of con5 layer of the remaining area in step 5 and send them into the Bounding box model, and make a revision according to the output of the model;

7. Label according to the results of SVM and frame according to the modified results;

8. End ！！！！！！

R-cnn performance evaluation

The appearance of R-CNN makes a qualitative leap in the performance evaluation map of target detection task in computer vision:

But R-CNN also has a fatal flaw, the extremely long training and testing time: The training takes 84 hours. If the training time is not so important, the test time of a single image is 47s, which makes R-CNN lose its practicality. Fortunately, various algorithms have improved it, which will be mentioned later.

1. Non-maximum suppression is not covered here; 2. How to make corrections according to the output of Bounding box model: The output of the model is the deviation (proportion) of four values, then the final position can be obtained according to the following formula

The fifth formula is the Bounding box model.

Original text: quant. La/Article/Vie…

Deep Learning Object Detection series (1) R-CNN

Related Posts

“Deep Application” NLP Machine Translation Deep Learning Practical course

Json data in Python -1

Bayes Review (1)