Links to papers: arxiv.org/abs/1711.11…

Paper code: not published yet

Summary of background

Study in depth before the rise of many object detection field improvement methods are on the classical detection model to join the relationship between the context information and the object in order to enhance the performance of target detection, but this way to study in depth the architecture of how don’t seem to work (I also want to this kind of improvement method before, but the depth study of the development of the return true as an alchemist, The training effect is better if the original data is thrown directly), this is because deep learning is still a black box model until now. The mainstream view holds that convolutional neural network has a large receptive field and has learned the context information of the object during network training. Inspired by Google’s NLP paper “Attention is All You Need”, many variable formulas in this paper are compared with those in Google’s paper. Google’s article is all Attention based and achieves a state-of-art effect without using any neural network structure. Since I don’t know much about NLP, I hadn’t read Google’s article before reading this one. However, the idea of this article is very interesting. I believe that there will be more improvements worth expecting from the perspective of combining the clever Attention model with the powerful feature extraction ability of CNN.

Object Relation Module

In the previous methods of using CNN for target detection, each object is identified individually. However, this paper conducts Relation Module processing for a group of objects simultaneously, that is, one object is integrated with the relational features of other objects, which is beneficial to enrich the features. Moreover, before and after the Relation Module processing, The dimension does not change, which means that the model can be extended to any classical TARGET detection framework based on CNN. The module diagram of this model is as follows:




Among themIs dominant transformation,Is a relational weight, indicating that the object is affected by other objects.


The denominator is the normalization of the numerator,Is the appearance weight, calculated by dot product


Among themwithUsed to combine original featuresandTo the subspace to measure the match between them.Is the geometric weight,


Among themThe calculation is divided into two steps

  1. Embed the geometric features of the two objects into the high-dimensional representation, denoted asCalculates the relative positions of objects M and n, this is a four-dimensional vector representing the coordinates of the center point and the width and height respectively.
  2. Map the 4-dimensional relative position matrix to the 64-dimensional vector, and thenDo the inner product, and then activate the function via ReLU.

The type (1)Represents a relation feature extracted from the NTH object, and an object will be extractedSpecies relationship characteristics (16 species in the author’s paper), and thenConcat, and then add the features of the original NTH object itself to obtain the features of the fused relationship. The formula is as follows:


To match dimensions between relational features and features of the object itself,Be able toPlay the role of dimensionality reduction.

Relation Networks For Object Detection

This paper applies the relative Module proposed to target detection. The current target detection architecture based on CNN contains four steps

  1. Now pretrain network models on large data sets (typically ImageNet);
  2. Extract the features of candidate regions
  3. The instance test
  4. Remove duplicate check boxes

Based on the characteristics of Relative Module, the author uses Relative Module after each full connection layer, and replaces commonly used NMS algorithm to remove duplicate detection boxes. As shown in the figure below

Relation for Instance Recognition

In the original RCNN model, after ROI Pooling processing, boundary box regression and target classification will be carried out after two full connection layers, as shown in the following steps

Since the dimensions of features do not change after the Relative Module processes them, a Relative Module can be followed by each fully connected layer. Then the flow of instance detection becomes:

In the above formula, R1 and R2 are represented as the number of repeats of relative Module. The detection schematic diagram of Instance Recognition is shown in the figure below.

Relation for Duplicate Removal

Firstly, the author points out that NMS is a suboptimal choice because it is a greedy algorithm and requires manual selection of parameters, and then explains that the Duplicate Removal problem is actually a dichotomy problem, that is, for each ground truth object, only one detection box is correct. All other boxes can be considered as duplicate. The input of this module proposed by the author is the output of instance Recognition, that is, a series of detection objects, each of which has features of 1024 dimensions and carries information including Bbox and classification scoreAs you can see from the figure below, the output of the module isandSo let’s seeThe calculation method of. The steps for this module are as follows

  1. First of all, the author points out that it is more effective to convert classification scores into ranks rather than specific numerical values. Then the rank and 1024 dimension appearance feature are converted to 128 dimension (as shown in the figure above)and)
  2. The fused features will change the appearance features of all objects through relation Module
  3. The features of each transformation are sorted by linear classification (as shown in the figure below)), and normalize the output to between [0,1] through Sigmoid.

Relation Module is the core of the above steps, because the relation Module can be used to integrate Bbox, original appearance feature and classification score, so that the whole target detection framework is still an end-to-end model.

The next task is how to determine which detected Object is correct and which is duplicate. The authors first set a threshold, output greater than this threshold will be retained. Then, in detected Object retained, the largest IOU is selected as the correct reservation, and the rest is duplicate.

Experiments

The experimental part of data set is COCO data set with 80 categories, RESNET-50 and ResNet101 used for CNN model.

Relation for Instance Recognition

Firstly, the experiment of Instance Recognition was looked at. Firstly, the Instance Recognition of 2FC was compared with 2FC +RM(Relation Module), and various parameters of RM were compared.

Relation for Duplicate Removal

A variety of network models, a variety of parameters comparison

What exactly has Relation Module learned

Relation Module proposed by the author is a good research point, but it is a pity that there is no good explanation of what Relation Module has learned, which the author says is beyond the scope of discussion of the article. In order to give an intuitive explanation to the model proposed in the paper, the author analyzed the weight of Relation in RM after the last FC in Relation Module, as shown in the figure below. Blue represents the detected object, and orange box and value represent the relevant information helpful to the detection.

The question raised by the author

1. If only one sample is classified as correct, will it cause serious imbalance between positive and negative samples? The answer is no. The web works well. Why? Because the author actually runs the discovery, most object’sThe score is low, soandIs small enough to causeAnd gradientIt’s going to be smaller.

2. Are the functions of the two designed modules contradictory? For instance recognition, objects with high scores should be recognized as much as possible, while the goal of duplicate removal is to select only one positive sample. The author believes that this contradiction is caused byandTo solve the problem of instance recognition outputCan go through the lowerTo adjust.

3. Duplicate removal is a module that can be learned, which is different from NMS. In end2end training, changes in instance recognition output will directly affect the module. The answer is also no. In fact, the author finds that end2end has a better training mode, which the author thinks is due to the unstable label playing a role of average regularization to some extent.

Finally, I want to advertise: welcome to my Nuggets account and personal Blog.