CNN has become a magic tool for image classification, detection and segmentation since the METHOD based on CNN made a surprise in ILSVRC competition in 2012. Among the tasks of image detection, R-CNN series is a set of classic methods. From the initial R-CNN to the later Fast R-CNN, Faster R-CNN and this year’s Mask R-CNN, we can see how CNN improves bit by bit in image detection. Take a look at the evolution of these methods and the creative ideas along the way through the r-CNN family.

Here are four articles in the R-CNN series:

  1. R – CNN: arxiv.org/abs/1311.25…
  2. Fast: R – CNN arxiv.org/abs/1504.08…
  3. Faster – R – CNN: arxiv.org/abs/1506.01…
  4. Mask: R – CNN arxiv.org/abs/1703.06…

The task of image detection is to find different objects from the image of a complex scene and give the boundary boxes of each object. Three well-known data sets for image detection are PASCAL VOC, ImageNet and Microsoft COCO. PASCAL VOC contains 20 object categories, while ImageNet contains more than a thousand object categories, COCO has 80 object categories and 1.5 million object instances.

PASCAL VOC target detection
COCO target detection and instance segmentation

1, R – CNN

R-cnn has three intuitive steps: 1. Obtain several candidate regions; 2. 2. Classify each candidate region by CNN; 3. Make border prediction for each candidate region.

Before the appearance of R-CNN, the popular idea of target detection was to first get some candidate regions from the image, and then extract some features from the candidate regions, and then use a classifier to classify these features. The result of classification and the boundary box of candidate region can be used as the output of target detection. Selective Search is a method to obtain candidate regions, which can obtain candidate regions at different scales, and each candidate region is a connected region. In the figure below, the smaller candidate region is obtained on the left and the larger candidate region on the right, which includes the entire human region at this scale.

Selective Search obtained candidate regions of different scales

R-cnn’s idea is that since CNN performs well in image classification tasks and can automatically learn features, why not use it to extract features for each candidate region? Therefore, r-CNN scales each candidate region to a fixed size on the basis of Selective Search candidate regions, and as the input of AlexNet (the champion of ImageNet 2012 Image Classification Competition), features are extracted for these regions in turn. Then support vector machine is used to classify the features of each region. R-cnn’s process is as follows:

R – CNN schematic diagram

So you can get a test result. But the candidate area obtained by Slelective Search does not necessarily coincide with the real boundary of the target object, so R-CNN proposes to further adjust the boundary box of the object and use a linear regression to predict the real boundary of the object in a candidate area. The input of the regressor is the feature of the candidate region and the output is the coordinate of the bounding box.

R-cnn worked well, improving accuracy by 30% over the best VOC 2012 results. But the problem is slow, mainly for three reasons: 1. The generation of candidate regions is a time-consuming process; 2. AlexNet should be used for more than 2000 times on a single image for feature extraction of candidate regions; 3. Feature extraction, image classification and border regression are three independent steps, which need to be trained separately and have low efficiency in the testing process.

2, the Fast R – CNN

In order to solve the problem of low efficiency in R-CNN, Fast R-CNN thought that AlexNet should be used for more than 2000 times on an image to obtain the features of each region respectively, but many regions are overlapped. Can we avoid these repeated calculations and only use AlexNet once on an image, and then obtain the features of different regions?

Therefore, Fast R-CNN proposes an ROI Pooling method, which firstly uses CNN forward calculation for the input image to obtain the feature map of the whole image, and then extracts the features of each candidate area from the feature map. Since the sizes of candidate areas are different and the corresponding features need to have fixed sizes, the method uses POI Pooling for all candidate areas respectively. The method is as followsThe ROI of the candidate regions is, to make the output size is, then divide that ROITwo cells, each of which has a size of zero, then use max-pooling to obtain the size ofThe feature image of.

Fast R-CNN schematic diagram: divide each candidate area into h x W grid for pooling

The second idea of Fast R-CNN is to put the previously independent three steps (feature extraction, classification and regression) into a unified network structure. The network structure simultaneously predicts the object category of a candidate region and the boundary box of the object. Two fully connected output layers are used for category prediction and border prediction respectively (as shown in the figure below). The two tasks are trained at the same time and a joint cost function is used:

The two items in the formula are classification Loss and Regression Loss respectively.

Feature extraction – classification – regression combined network

Using VGG16 as the feature extraction network, Fast R-CNN can process test images more than 200 times faster than R-CNN, with higher accuracy. If the time of candidate region generation is not considered, real-time detection can be achieved. Selective Search algorithm, which generates candidate regions, takes about 2s to process an image, so it becomes a bottleneck of this method.

3, Faster – R – CNN

The above two methods rely on Selective Search to generate candidate regions, which is time-consuming. Given how powerful CNN is, Faster R-CNN proposes to use CNN to get candidate areas. It is assumed that there are two convolutional neural networks, one is the region generation network, which obtains each candidate region in the image, and the other is the classification and border regression network of the candidate region. The first several layers of the two network to calculate the convolution, if let them several layers of sharing in this parameter, just at the end of several layers of accomplish their specific tasks, respectively, in an image with this a few Shared convolution layer for a forward convolution computation, could get the candidate region and each candidate region at the same time, the type and borders.

Faster R-CNN: Region Proposal Network is used on the feature graph after convolution

Candidate Region Proposal Network (RPN) is shown as follows. First, a feature image is obtained by convolution of several layers of the input image, and then candidate regions are generated on the feature image. It uses a(3) the sliding window transforms the local feature image into a low-dimensional feature and predictsThe CLS layer,Is the candidate region and correspondingBorder (reg layer,Output). Here,Two regions are called anchors, corresponding to rectangular frames of different sizes and different aspect ratios with the same center as the sliding window. Assume that the size of the feature image after convolution isSo there’s a total ofAn anchor. This method of feature extraction and candidate region generation has displacement invariance.

Predict candidate areas for an Anchor in a sliding window

After the candidate regions are obtained by RPN, Fast R-CNN is still used for the classification and border regression of candidate regions. The two networks use a common convolution layer. Due to the need to use fixed candidate region generation method in the training process of Fast R-CNN, RPN and Fast R-CNN cannot be trained by using back propagation algorithm at the same time. This paper uses four steps to complete the training process: 1. 2. Separately train Fast R-CNN using the region generation method obtained in Step 1; 3. Use the network obtained in Step 2 as the initial network training RPN; 4. Train Fast R-CNN again and fine-tune parameters.

The accuracy of Faster R-CNN was about the same as Fast R-CNN, but the training time and test time were 10 times shorter.

4, Mask R – CNN

Faster R-CNN has achieved very good performance in object detection, and Mask R-CNN goes further on this basis: detection results at pixel level can be obtained. For each object, not only its boundary frame is given, but also whether each pixel in the boundary frame belongs to the object is marked.

Mask R-CNN: Target detection at pixel level

Mask R-CNN uses the existing network structure in Faster R-CNN and adds a head branch, and uses FCN to perform binary segmentation for each region.

Mask R-CNN also proposed two small improvements to make the segmentation result better. First, the coupling between different classes should be removed when each region is segmented. Given K class objects, the general segmentation method directly predicts an output with K channels, where each channel represents the corresponding category. While Mask R-CNN predicts K outputs with 2 channels (foreground and background), so that the predictions of each category are independent. Secondly, the ROI Pooling used in Faster R-CNN maps each ROI to a fixed size, and the rounding operation is performed during Pooling, which leads to no continuous correspondence between feature maps before and after Pooling. For example, if the size before pooling is 112×112 and the size after pooling is 7×7, the abscissa position of a pixel before pooling is, then the abscissa after pooling is, and then round to find its position in the 7×7 grid. Since the rounding operation will bring error, the Faster R-CNN is not rounded and obtained by bilinear interpolationThe method is called ROIAlign.

Code for the above methods:

R-CNN

  • Caffe version: RBGirshick/RCNN

Fast R-CNN

  • Caffe version: RBGirshick/fast-RCNN

Faster R-CNN

  • Caffe: github.com/rbgirshick/…
  • PyTorch version: github.com/longcw/fast…
  • MatLab version: github.com/ShaoqingRen…

Mask R-CNN

  • PyTorch version: github.com/felixgwu/ma…
  • TensorFlow version: github.com/CharlesShan…

Note: R-CNN, Fast R-CNN and Faster R-CNN have been summarized in previous articles, and Mask R-CNN is added here. More methods of target detection are presented in the paper “Progress of Deep convolutional Neural Network in target Detection”. In addition, the article “A Brief History of CNNs in Image Segmentation: From R-CNN to Mask R-CNN” is also introduced in detail.