The difference between target recognition and detection:

Identification only needs to know which category it belongs to, while detection needs to know which category it belongs to and its specific location.


RCNN

ø Region CNN (RCNN) Background and significance

Ross Girshick

The research of RCNN can be said to be the pioneering work of using deep learning for target detection.


ø Comparison with classical target detection algorithms

Compared with DPM algorithm, the effect is significantly improved.


ø Feature extraction method of candidate area

Classical target detection algorithm: manually set features (such as Haar and HOG)

RCNN: The Deep Web


ø Available databases

Recognition library (ImageNet ILSVC 2012) : 10 million images, 1000 classes. Class to calibrate objects in each image. Used for pre-training.

Detection library (PASCAL VOC 2007) : 10,000 images, 20 categories. Calibrate the class and position of objects in each image. For tuning parameters and profiling.


Ø Selective Search

Selective Search:

1. Generated region set S1;

2. Merge the two regions with the highest similarity and add them to R;

3. Delete related subsets in the merge in S1;

4. Calculate the new region set S2;

5. Repeat operations 2-4 in S2 until the region set is empty.



Region set generation method: in an image, each pixel is a vertex, the line is an edge, and the vertex of the smallest spanning tree is a region.


Similarity: the color, texture, size and overlapping similarity are summed by different coefficients.


ø Candidate area generation

Selective Search was used in multiple color Spaces (HSV, RGB, Lab, etc.) to conduct the above four rules at the same time to obtain all regions and delete the repetition to obtain candidate regions.

Output all existing regions and generate about 2000~3000 candidate regions in one graph.

ø Feature extraction

1. Normalized size of candidate area: 227×227

2. Pre-training network structure:


The learning rate was 0.01, 4096 dimensions of features were extracted, and category labels of 1000 dimensions were output

3. Tuning the training network structure:

The difference from pre-training changes at the last layer from 1000 dimensions of output to 21 dimensions, representing 20 classes + backgrounds. The learning rate is 0.001.

ø Category judgment

1. Classifiers

Classifier: SVM; Input: 4096 dimensional features; Output: Whether it belongs to this class


2. Positive and negative samples

Positive sample: the true value calibration box of this class

Negative sample: examine each candidate frame. If the overlap with all calibration frames in this category is less than 0.3, it is a negative sample


3. Hard negative mining

In the training, the hard negative mining method will be used. At the beginning of the training, a random batch of marker boxes without any overlap with the positive samples will form the negative samples. However, the results obtained after training tend to mark a lot of misclassification.

The hard negative method creates a negative sample in error detection and adds it to the training set. When the classifier is retrained, it works even better.


4. Category judgment

The one with the largest overlap between a candidate frame and all the calibration frames on the current image. If the overlap ratio is greater than 0.5, the candidate box is considered as the calibrated class; otherwise, it is considered as the background.

ø Position refinement

1. Linear ridge regressor

The regular term λ=10000. Input: 4096-dimensional characteristics of pool5 layer; Output: zoom and shift in the xy direction.


2. Training samples

The candidate box whose overlap area with the truth value is greater than 0.6 is judged as the candidate box of this class.