FPN detail - Multi-scale feature fusion through feature pyramid network

Target detection (1) RCNN detail — the first work of deep learning target detection (2) SPP Net — let convolution computation share (3) Target detection (3) Fast RCNN — let RCNN model can train end-to-end (4) target detection Faster RCNN — RPN network to replace Selective Search [Target detection (5)] YOLOv1 — start the chapter of one-stage target detection [Target detection (6)] YOLOv2 — Introduce Anchor, Better, Faster, Understand the regression box loss function of target detection — IoU, GIoU, DIoU, CIoU principle and Python code FPN detail — Multi-scale feature fusion through feature pyramid network

In the development of target detection, two-stage algorithm has not introduced multi-scale feature method. Later SSD algorithm proved the effectiveness of introducing multi-scale features into target detection. Two-stage algorithm introduced multi-scale in FPN and proposed feature pyramid network, which can greatly improve the accuracy of target detection.

FPN Original paper: arxiv.org/pdf/1612.03…

1. Motivation

The task of target detection is divided into two parts: classification task and localization task. Among the features learned by the deep neural network, the shallow features learn physical information, such as the details of the corners and edges of objects, while the deep features learn semantic information, which is more high-level and abstract. For classification tasks, features learned by deep networks may be more important, while for positioning tasks, features learned at deep and shallow levels are equally important, because shallow physical details are essential for accurate positioning. Most of the target detection algorithms before FPN only use top-level features to make predictions, and the details contained in them are relatively rough. Even if feature fusion method is adopted, it generally adopts the fused features to make predictions. FPN proposes a feature pyramid network, which can be independently predicted at different feature layers.

2. The principle of FPN

2.1 Pyramid design concept

Above, FPN lists several ideas for the design of the pyramid:

Figure A shows that several images of different scales are input into the model, which is very inefficient and requires several times of forward reasoning.
Figure B shows that only single-scale dimension features are used for prediction, such as the method adopted by Faster RCNN.
Figure C makes predictions on different scales of CNN. In CNN network, the concept of multi-scale is introduced, such as SSD and YOLOv3.
Figure D adds a top-down process, and there is interaction between high-level feature layers and low-level feature layers (up-sampling + horizontal interaction Element Wise addition). This structure is called the feature pyramid network by the author.

2.2 FPN network structure

FPN is essentially a structure of UNET, which first samples down and then samples up for feature fusion. The structure of FPN mainly includes three parts: bottom-up, top-down and lateral connection.

2.2.1 Bottom – up

Bottom-up is a process in which images are input to the network and the backbone network extracts the features of the last output layer from each module. Taking ResNet as an example, the output of conv2, conv3, conv4, conv5 is defined as {C2,C3,C4,C5}\{C_2, C_3, C_4, C_5\}{C2,C3,C4,C5}, Respectively are the outputs of each module(stage) after the last residual layer, These are the dimensions of the feature map artwork 14,18,116,132} {\ {\ frac {1}, {4} \ frac {1} {8}, \ frac {1} {16}, \ frac {1} {32} \} {41,81,161,321}, For example, if the input image is 640*640*3, C2-c5 dimensions are {160∗160∗256,80∗80∗512,40∗40∗1024,20∗20∗2048}\{160*160*256, 80*80*512, 40*40*1024, respectively. 20*20*2048\}{160∗160∗256,80∗80∗512,40∗40∗1024,20∗20∗2048}, the size and channel of their feature maps are doubled.

2.2.2 Top – down

Top-down is to transmit the high-level feature map from Top to bottom through Upsample. High-level features are rich in semantic information, so that low-level features with rich physical information can also contain rich semantic information when propagated from Top to bottom. The nearest neighbor interpolation method was adopted in the original paper. Enlarge the size of the feature map by two times.

2.2.3 Lateral Connection

Lateral connection mainly includes three steps:

(1) For feature map Cn{C_n}Cn output by each stage, a 1*1 convolution is first unified and the dimension is reduced.

(2) Then the feature graph Pn+1P_{n+1}Pn+1 is sampled from the features of the previous layer, which is essentially element wise addition. Notice that in Backbone, every two extracted feature layers have a size relationship twice the size, and the dimension is unified through top-down sampling and horizontal convolution of 1*1, so they can be added.

(3) After the summation, the characteristic output Pn{P_n}Pn of this layer can be obtained through a 3*3 convolution. It is stated in the paper that the purpose of using the 3*3 convolution is to eliminate the aliasing effect generated by upsampling, which should refer to the gray scale discontinuity of the image generated by interpolation mentioned above, and obvious jagged shape may appear where the gray scale changes. In this paper, because the output characteristics of all layers of the pyramid shared classifiers/ Regressors, the output dimensions were unified to 256, i.e., the channels of these 3*3 convolution were all 256.

2.3 FPN for RPN

The figure below illustrates the application of FPN in RPN. Backbone of ResNet50 is taken as an example on the far left, and then goes through FPN network to generate p2-P6 feature layer. P6 is only applicable to RPN in Faster RCNN, not Fast RCNN, and P6 is sampled from P5. It’s used to predict larger targets. For different feature layers, weights of RPN and Fast RCNN are shared, so channels of P2-P6 are 256. P2-p6 will be followed by an RPN Head to generate 3 anchors at each point on the feature map (each size corresponds to 3 aspect ratio). Five characteristics of layer {P2, P3, P4, P5, P6} \ {P_2, P_3, P_4, P_5, P_6 \} {P2, P3, P4, P5, P6} respectively corresponding to the size of {2}, 322642128, 2256, 2512 \ {32 ^ 2, 64 ^ 2, 128 ^ 2. 256^2, 512^2\}{322,642,1282,2562,5122} anchors, that is, there will be a total of 15 anchors, followed by a parallel 1*1 convolution in both branches for the classification and regression of anchors.

Here, the division of positive and negative samples of Anchor is consistent with the Faster RCNN, that is, the anchor with maximum IOU and GT Box or the anchor with IOU greater than 0.7 are considered as positive samples, while the anchor with IOU less than 0.3 is considered as negative samples. Note that the RPN head is shared for each layer.

2.3 FPN for Fast RCNN Part

In Fast R-CNN, there is an ROI Pooling layer, which uses the results and feature maps of region proposal as input, and the corresponding features of each proposal are obtained by Pooling, and then used for classification results and border regression respectively. Before, Fast R-CNN used single-scale feature maps, but now feature maps of different scales are used (note that p2-P5 is used instead of feature layer of P6). Then, which scale feature maps should RoI extract corresponding features from? The paper believes that RoI of different scales should use different feature layers as the input of RoI pooling layer, while large-scale RoI uses the later pyramid layers, such as P5. Small RoI uses the feature layer of the previous point, such as P4. So how do you decide which layer of output to use for RoI? Here, the paper defines a calculation formula:

In the formula, K represents the P mapped to which layer is passed into the ROI Pooling layer as the feature layer, initial k0= 4K_0 =4k0=4, w and H represent the width and height of Region Proposal given by RPN, for example, w and H are 112, The corresponding P3 feature layer and Region Proposal are input to ROI Pooling to obtain a feature with a size of 7*7, which is then input to the full connection layer after passing flatten.

3. Analysis of the effects and advantages and disadvantages of FPN

3.1 Effects and advantages of FPN

The improvement in accuracy is significant, especially for small targets. As shown in the figure below.

FPN can be embedded into any network structure and task as a freely pluggable module, which is widely adopted by the industry.

3.2 Disadvantages of FPN

A large amount of computation requires a large amount of memory: a huge network structure similar to UNET is inserted into the target detection network, which inevitably reduces the speed of the network.

Reference:

Arxiv.org/pdf/1612.03…
zhuanlan.zhihu.com/p/62604038
FPN various improved conclusion: zhuanlan.zhihu.com/p/148738276

FPN detail — Multi-scale feature fusion through feature pyramid network