Heart of Machine has discussed a number of object detection algorithms before, and readers interested in computer vision can also improve their understanding in conjunction with previous articles.

  • A comprehensive review of deep learning target detection models: Faster R-CNN, R-FCN and SSD

  • PyTorch Project from scratch: YOLO V3 target detection implementation

  • Deconstruct Faster R-CNN like Legos: How object detection works

  • Progress of object detection and instance segmentation in post-RCNN era

  • Object detection algorithm overview: from traditional detection methods to deep neural network framework

Target detector based on candidate region

Sliding window detector

Since AlexNet won the ILSVRC 2012 Challenge, classification by CNN has become mainstream. A violent method for target detection is to move the window from left to right and top to bottom, using classification to identify the target. To detect different target types at different viewing distances, we use Windows of different sizes and aspect ratios.

Sliding Windows (right to left, top to bottom)

We cut image blocks from the image according to the sliding window. Since many classifiers only take fixed size images, these image blocks are transformed by deformation. However, this does not affect the classification accuracy, because the classifier can process the deformed image.

Transform an image into a fixed size image

The deformed image blocks were input into CNN classifier and 4096 features were extracted. After that, we use a SVM classifier to identify the category and another linear regressor of the bounding box.

System flow chart of sliding window detector.

Here is the pseudocode. We created many Windows to detect different targets in different locations. One obvious way to improve performance is to reduce the number of Windows.

for window in windows
    patch = get_patch(image, window)
    results = detector(patch)
Copy the code

Selective search

Instead of using a brute force approach, we use the candidate Region proposal method to create the ROI for target detection. In selective search (SS), we first treat each pixel as a group. Then, the texture of each group is calculated and the two closest groups are combined. But to avoid a single region devouring others, we first grouped smaller groups. We continue merging regions until all regions are joined together. The first row in the following figure shows how to make the region grow, and the blue rectangle in the second row represents all possible ROIs during the merger.

Source: Van de Sande et al. ICCV’11

R-CNN

R-cnn created about 2,000 ROIs using the candidate region approach. These regions are transformed into fixed-size images and fed to the convolutional neural network. The network architecture is followed by several fully connected layers to achieve target classification and refine boundary boxes.

Candidate region, CNN and affine layer are used to locate the target.

Here’s a flow chart of r-CNN’s entire system:

By using less ROI and higher quality, R-CNN is faster and more accurate than the sliding window approach.

ROIs = region_proposal(image)
for ROI in ROIs
    patch = get_patch(image, ROI)
    results = detector(patch)
Copy the code

Bounding box regressor

The candidate region method has very high computational complexity. To speed up this process, we typically build the ROI using a less computationally intensive candidate region selection approach, and then further refine the bounding box using a linear regression (using a full connection layer).

A regression method is used to refine the original blue bounding boxes into red ones.

Fast R-CNN

R-cnn needs a lot of candidate regions to improve accuracy, but in fact many regions overlap each other, so the training and inference speed of R-CNN is very slow. If we have 2000 candidate regions, and each of them needs to be fed independently into CNN, then we need to repeat feature extraction 2000 times for different ROI.

In addition, feature maps in CNN represent spatial features in a dense way, so can we directly use feature maps instead of original images to detect targets?

Use feature maps directly to calculate ROI.

Fast R-CNN uses a feature extractor (CNN) to extract the features of the whole image first, instead of extracting each image block multiple times from scratch. We can then apply the method of creating candidate regions directly to the extracted feature maps. For example, Fast R-CNN selected conv5 of the convolution layer in VGG16 to generate ROI. These areas of concern were then combined with corresponding feature maps to be cropped into feature map blocks and used in target detection tasks. We use ROI pooling to convert feature blocks to fixed sizes and feed them to the full connection layer for classification and positioning. Because fast-RCNN does not repeatedly extract features, it can significantly reduce the processing time.

Candidate areas are directly applied to feature maps and are pooled into fixed-size feature map blocks using ROI.

Here’s a flow chart from Fast R-CNN:

In the pseudocode below, the computationally expensive feature extraction process is moved out of the For loop, so the speed is significantly improved. Fast R-CNN is 10 times faster in training and 150 times faster in inference than R-CNN.

feature_maps = process(image)
ROIs = region_proposal(feature_maps)
for ROI in ROIs
    patch = roi_pooling(feature_maps, ROI)
    results = detector2(patch)
Copy the code

The most important point of Fast R-CNN is that the whole network including feature extractor, classifier and boundary box regression can carry out end-to-end training through multi-task loss function. Such multi-task loss combines the method of classification loss and positioning loss, which greatly improves the accuracy of the model.

ROI pooling

Because Fast R-CNN uses a full connection layer, we apply ROI pooling to convert ROI of different sizes into fixed sizes.

For brevity, we first convert the 8×8 feature graph to a predefined 2×2 size.

  • Upper left corner: Feature image.

  • Top right: Overlap the ROI (blue area) with the feature map.

  • Bottom left: Split ROI into target dimensions. For example, for a 2×2 goal, we split the ROI into four pieces of similar or equal size.

  • Lower right corner: find the maximum value of each part and get the feature map after transformation.

Input profile (top left), output profile (bottom right), ROI (top right, blue box).

According to the above steps, a 2×2 feature map block is obtained, which can be fed into the classifier and the boundary box regressor.

Faster R-CNN

Fast R-CNN relies on external candidate region methods, such as selective search. But these algorithms run on the CPU and are slow. In testing, Fast R-CNN required 2.3 seconds to make predictions, of which 2 seconds were used to generate 2000 ROIs.

feature_maps = process(image)
ROIs = region_proposal(feature_maps)         # Expensive!
for ROI in ROIs
    patch = roi_pooling(feature_maps, ROI)
    results = detector2(patch)
Copy the code

Faster R-CNN adopts the same design as Fast R-CNN, except that it replaces the candidate region method with an internal deep network. The new candidate Area network (RPN) is more efficient in generating ROI and runs at 10 milliseconds per image.

The flow chart of Faster R-CNN is the same as that of Fast R-CNN.

The external candidate region approach replaces the internal deep network.

Candidate area network

The candidate area network (RPN) takes the output characteristic graph of the first convolutional network as input. It slides a 3×3 convolution kernel on the feature graph to construct category-independent candidate regions using the convolutional network (ZF network shown below). Other deep networks (such as VGG or ResNet) can be used for more comprehensive feature extraction, but at the expense of speed. The ZF network will eventually output 256 values that will be fed into two separate fully connected layers to predict the bounding box and the two Objectness scores that measure whether the bounding box contains the target. We could have used a regressor to calculate a single Objectness score, but for brevity, Faster R-CNN uses a classifier with just two categories: categories with and without goals.

For each position in the feature graph, RPN makes k predictions. Therefore, RPN will output 4× K coordinates and 2× K scores at each position. The figure below shows the 8×8 feature graph, and a 3×3 convolution kernel performs the operation, which finally outputs 8×8×3 ROI (k=3). The figure below (right) shows three candidate areas for a single location.

There are three guesses here, which we will refine later. Since only one correct guess is needed, our initial guess had better cover different shapes and sizes. Therefore, Faster R-CNN does not create random bounding boxes. Instead, it predicts offsets associated with a reference box named “anchor” in the upper-left corner (such as 𝛿x, 𝛿 Y). We limit the values of these offsets, so our conjecture still resembles an anchor point.

To make k predictions for each location, we need k anchors centered around each location. Each prediction is associated with a specific anchor, but different locations share the same shape of the anchor.

These anchors are carefully selected so that they are diverse and cover realistic targets with different ratios and aspect ratios. This allows us to guide initial training with better guesses and allows each prediction to be tailored to a particular shape. This strategy makes early training more stable and convenient.

Faster R-CNN uses more anchors. It deploys nine anchor point frames: three different size anchor point frames with different aspect ratios. Nine anchors are used for each location, and each location generates 2×9 Objectness scores and 4×9 coordinates.

Photo source: https://arxiv.org/pdf/1506.01497.pdf

R-cnn method performance

As you can see in the figure below, the Faster R-CNN is much Faster.

Region-based Full Convolutional Neural Network (R-FCN)

Suppose we only have one feature map to detect the right eye. So can we use it to locate faces? I think so. Because the right eye should be in the upper left corner of the face image, we can use this to locate the entire face.

If we have other feature maps that detect the left eye, nose or mouth, then we can combine the results to better locate the face.

Now let’s review all the questions. In the Faster R-CNN, the detector used multiple fully connected layers for prediction. If you have 2,000 ROIs, the cost is very high.

feature_maps = process(image)
ROIs = region_proposal(feature_maps)
for ROI in ROIs
    patch = roi_pooling(feature_maps, ROI)
    class_scores, box = detector(patch)         # Expensive!
    class_probabilities = softmax(class_scores)
Copy the code

R-fcn is accelerated by reducing the amount of work required for each ROI. The region-based feature maps above are independent of ROI and can be calculated separately from each ROI. The rest of the work is easier, so r-FCN is Faster than Faster R-CNN.

feature_maps = process(image)
ROIs = region_proposal(feature_maps)         
score_maps = compute_score_map(feature_maps)
for ROI in ROIs
    V = region_roi_pool(score_maps, ROI)     
    class_scores, box = average(V)                   # Much simpler!
    class_probabilities = softmax(class_scores)
Copy the code

Now let’s look at the 5 × 5 feature graph M, which contains a blue square inside. We divide the squares evenly into 3 × 3 areas. Now we have created a new feature map in M to detect the upper left corner (TL) of the square. This new feature diagram is shown below (right). Only the yellow grid cells [2, 2] are active.

Create a new feature map on the left to detect the top left corner of the target.

By dividing the square into nine sections, we created nine feature maps, each used to detect the corresponding target area. These feature maps are called position-sensitive score maps because each map examines a subregion of the target (calculating its score).

Generate 9 score charts

The red dotted rectangle in the figure below is the suggested ROI. We divide it into 3 × 3 regions and ask what the probability is that each region contains a corresponding part of the target. For example, the ROI area in the upper left contains the probability of the left eye. We store the results as a 3 × 3 Vote array, as shown below (right). For example, vote_array[0][0] contains the score for whether the upper-left area contains the corresponding portion of the target.

Apply ROI to the feature graph and output a 3 x 3 array.

The process of mapping score charts and ROIs to the Vote array is called position-sensitive ROI-pool. This process is very close to the ROI pooling discussed earlier.

Calculate V[I][J] by overlaying a portion of ROI onto the corresponding score graph.

After calculating all the values for location-sensitive ROI pooling, the category score is the average of the scores of all its elements.

ROI pooling

Let’s say we have C categories to test. We extend this to C + 1 categories, thus adding a new category for the background (non-target). There are 3 × 3 score plots for each category, so there are (C+1) × 3 × 3 score plots. The score chart for each category can be used to predict the category score. We then apply the Softmax function to these scores to calculate the probabilities for each category.

Here is the data flow diagram, in our case, k=3.

conclusion

We first looked at the basic sliding window algorithm:

for window in windows
    patch = get_patch(image, window)
    results = detector(patch)
Copy the code

Then try to reduce the number of Windows to minimize the work in the for loop.

ROIs = region_proposal(image)
for ROI in ROIs
    patch = get_patch(image, ROI)
    results = detector(patch)
Copy the code

Single-pass target detector

In the second part, we will review the single-pass target detector (including SSD, YOLO, YOLOv2 and YOLOv3). We will analyze FPN to understand how multi-scale feature maps improve accuracy, especially for small target detection, which is often poor in single-pass detectors. We will then analyze Focal Loss and RetinaNet to see how they address category imbalance during training.

Single detector

In Faster R-CNN, there is a dedicated candidate LAN after the classifier.

Faster R-CNN workflow

Zone-based detectors are accurate, but they come at a cost. Faster R-CNN processed images at 7 frames per second (7 FPS) on the PASCAL VOC 2007 test set. Similar to R-FCN, the researchers streamlined the process by reducing the amount of work per ROI.

feature_maps = process(image)
ROIs = region_proposal(feature_maps)
for ROI in ROIs
    patch = roi_align(feature_maps, ROI)
    results = detector2(patch)    # Reduce the amount of work here!
Copy the code

Instead, do we need a separate candidate area step? Can we get bounding boxes and categories directly in one step?

feature_maps = process(image)
results = detector3(feature_maps) # No more separate step for ROIsCopy the code

Let’s look at the sliding window detector again. We can detect the target by sliding the window over the feature map. We use different window types for different target types. The fatal mistake with the previous sliding window approach was to use the window as the final bounding box, which required a very large number of shapes to cover most objects. A more efficient approach is to use the window as an initial guess, so we get a detector that predicts both categories and bounding boxes from the current sliding window.

Prediction based on sliding window

This concept is similar to the anchor points in Faster R-CNN. However, the single-pass detector predicts both bounding boxes and categories. For example, we have an 8 × 8 feature map and make k predictions at each location, or a total of 8 × 8 × K predicted results.

64 position

At each position, we have k anchor points (anchors are fixed initial boundary box conjecture), one for a particular position. We carefully selected the anchors and each position using the same anchor shape.

Four predictions are made at each location using four anchors.

Below are the four anchors (green) and the four predictions (blue), each of which corresponds to a specific anchor.

Four predictions, one anchor for each prediction.

In the Faster R-CNN, we used the convolution kernel to predict 5 parameters: 4 parameters corresponding to the predicted frame of an anchor point, and 1 parameter corresponding to objectness confidence score. Therefore, the 3× 3× D × 5 convolution kernel transforms the feature graph from 8 × 8 × D to 8 × 8 × 5.

Predictions were calculated using a 3×3 convolution kernel.

In a single detector, the convolution kernel also predicts C class probabilities to perform classification (one class for each probability). Therefore, we apply a 3× 3× D × 25 convolution kernel to transform the feature graph from 8 × 8 × D to 8 × 8 × 25 (C=20).

K predictions are made for each position, and each prediction has 25 parameters.

Single-pass detectors often require a trade-off between accuracy and real-time processing speed. They are prone to problems when detecting objects that are too close or too small. In the image below, there are nine Santas in the lower left corner, but a single detector detects only five.

SSD

SSD is a single-pass detector using VGG19 network as a feature extractor (same as CNN used in Faster R-CNN). We add a custom convolution layer behind the network (blue) and use a convolution kernel (green) to perform predictions.

Perform a single prediction for both category and location.

However, the convolution layer reduces spatial dimension and resolution. Therefore, the above model can only detect large targets. To solve this problem, we perform independent target detection from multiple feature graphs.

Multi-scale feature maps were used for detection.

The following is an illustration of the feature map.

Photo source: https://arxiv.org/pdf/1512.02325.pdf

SSDS use deeper layers in convolutional networks to detect targets. If we redraw the image at a near-realistic scale, we will find that the spatial resolution of the image has been significantly reduced and it may be impossible to locate small targets that are difficult to detect in low resolution. If such problems occur, we need to increase the resolution of the input image.

YOLO

YOLO is another single-pass target detector.

YOLO uses DarkNet to do feature detection after the convolutional layer.

However, it does not use multi-scale feature maps for independent detection. Instead, it smoothes the feature image partially and splices it with another feature image at a lower resolution. For example, YOLO reshaped a 28 × 28 × 512 layer into 14 × 14 × 2048, then splicing it with a feature map of 14 × 14 ×1024. Then, YOLO applied convolution kernel to make prediction on the new 14 × 14 × 3072 layer.

YOLO (V2) made a number of implementation improvements, increasing the mAP value from 63.4 when first released to 78.6. The YOLO9000 can detect 9,000 different categories of targets.

Photo source: https://arxiv.org/pdf/1612.08242.pdf

Below is the mAP and FPS comparison of different detectors in YOLO’s paper. YOLOv2 can process input images of different resolutions. Lower resolution images result in higher FPS, but lower mAP values.

Photo source: https://arxiv.org/pdf/1612.08242.pdf

YOLOv3

YOLOv3 uses a more sophisticated backbone network to extract features. Darknet-53 consists primarily of 3 × 3 and 1× 1 convolution cores and skip joins similar to those found in ResNet. DarkNet has a lower BFLOP (billion floating-point operations) than ResNET-152, but can get the same classification accuracy twice as fast.

Photo source: https://pjreddie.com/media/files/papers/YOLOv3.pdf

YOLOv3 also adds a feature pyramid to better detect small targets. Here is the trade-off between accuracy and speed for different detectors.

Photo source: https://pjreddie.com/media/files/papers/YOLOv3.pdf

Feature Pyramid Network (FPN)

Detection of targets at different scales is challenging, especially for small targets. Feature Pyramid Network (FPN) is a feature extractor designed to improve accuracy and speed. It replaces the feature extractor in detectors such as Faster R-CNN and generates a higher-quality feature graph pyramid.

The data flow

FPN (FIG. Source: https://arxiv.org/pdf/1612.03144.pdf)

FPN consists of bottom-up and top-down paths. The bottom-up path is a commonly used convolutional network for feature extraction. The spatial resolution declines from the bottom up. As higher-level structures are detected, the semantic value of each layer increases.

Feature Extraction in FPN (edited from original paper)

SSDS are detected using multiple feature graphs. However, the lowest level is not selected to perform target detection. They have high resolution but insufficient semantic values, resulting in a significant slowdown in speed and are not usable. SSDS use only the upper layer to perform target detection and thus have poor detection performance for small objects.

Image changes since the paper at https://arxiv.org/pdf/1612.03144.pdf

FPN provides a top-down path to build high-resolution layers from semantically rich layers.

Top-down reconstruction of spatial resolution (edited from the original paper)

Although this reconstruction layer is semantic, the location of the target is imprecise after all up-sampling and down-sampling. Adding horizontal connections between the reconstruction layer and the corresponding feature map can make position detection more accurate.

Added skip join (from original paper)

The figure below details the bottom-up and top-down paths. Among them, P2, P3, P4 and P5 are feature map pyramids used for target detection.

FPN combining RPN

FPN is not only a target detector, but also a target detector and a collaborative feature detector. Respectively to each feature graph (P2 to P5) to complete the target detection.

FPN combined with Fast R-CNN or Faster R-CNN

In FPN, we generated a pyramid of feature maps. Use RPN (see above) to generate ROI. Based on the size of ROI, we select the feature layer with the most appropriate size to extract feature blocks.

Difficult cases

For most detection algorithms such as SSD and YOLO, we make much more predictions than the actual number of targets. So there are more wrong predictions than right ones. This creates a category imbalance that is not conducive to training. Training is more about learning the background than detecting the target. However, we need negative sampling to learn what a poor prediction is. So, we calculate the confidence loss to classify the training samples. Pick the best ones to make sure the ratio of negative samples to positive samples is no more than 3:1. This makes training faster and more stable.

Non-maximum suppression during inference

The detector makes repeated detections for the same target. We use non-maximum suppression to remove low confidence duplicates. Rank the predictions in order of confidence. If any prediction is in the same category as the current prediction and both have an IoU greater than 0.5, we remove it from the sequence.

Focal Loss (RetinaNet)

Category imbalance can hurt performance. The SSD resamples the ratio of the target class to the background class during training so that it is not swamped by the image background. Focal Loss (FL) takes another approach to reducing the loss of well-trained classes. Therefore, as long as the model can detect the background well, it can reduce its loss and re-enhance the training of the target class. We start with the cross entropy loss CE and add a weight to reduce the CE of the high confidence class.

For example, if γ = 0.5, the Focal loss of a well-classified sample tends to 0.

Edited from the original paper

This is based on FPN, ResNet, and RetianNet built with Focal Loss.

RetinaNet


Original link: https://medium.com/@jonathan_hui/what-do-we-learn-from-region-based-object-detectors-faster-r-cnn-r-fcn-fpn-7e354377a7c9 https://medium.com/@jonathan_hui/what-do-we-learn-from-single-shot-object-detectors-ssd-yolo-fpn-focal-loss-3888677c5f4 d