The design of DenseBox detection algorithm is very advanced, and now many anchor-free methods have their shadows. If it is not a little later than Faster R-CNN at that time, the target detection field may have started to develop toward anchor-free very early

Source: Xiaofei algorithm engineering Notes public account

DenseBox: Unifying Landmark Localization withEnd to End Object Detection

  • Thesis Address:Arxiv.org/abs/1509.04…

Introduction


DenseBox is an early anchor-free target detection algorithm. At that time, R-CNN series had obvious bottlenecks in small object detection, so the author proposed DenseBox, which also had good performance in small object detection. In the earlier period proposed by DenseBox, the famous Faster R-CNN appeared, and its strong performance led the development of target detection algorithm to the anchor-based direction. It was not until the emergence of FPN that the performance of the anchorfree algorithm was greatly improved, and more work began to involve the field of anchorfree. At present, many anchor-free target detection studies have the shadow of DenseBox, so the design idea of DenseBox is very advanced.

DenseBox for Detection


The overall design of DenseBox is shown in Figure 1. A single convolutional network outputs multiple prediction boxes and their category confidence at the same time, and the output feature graph size is 5×m4× N45 \times \frac{m}{4}\times \frac{n}{4}5×4m×4n. Suppose pixel III is located at (xi,yi)(x_i, y_i)(xi,yi), Their expectations of 5 dimensional vector for t ^ I = {s ^, DXT ^ = xi – xt, dyt ^ = yi – yt, DXB ^ = xi – xb, dyb ^ = yi – yb} \ hat {t} _i = \ {\ hat {s}, \ hat} {dx ^ t = x_i – x_t, \ hat} {dy ^ t = y_i – y_t \ hat} {dx ^ b = x_i – x_b \ hat} {dy ^ b = y_i – y_b \} t ^ I = {s ^, DXT ^ = xi – xt, dyt ^ = yi – yt, DXB ^ = xi – xb, dyb ^ = yi – yb}, The first is the classification confidence degree, and the last four are the distance between pixel position and target boundary. Finally, the output of all pixels is converted into a prediction box, and the final output is processed by the NMS.

Ground Truth Generation

DenseBox does not take a complete picture as input during training, but intercepts a large area containing the target and sufficient background for training. During training, the captured image was resized 240×240240 ×240, with the face in the center of the captured area and a height of about 50 pixels, and the feature image of 5×60×605 × times 60×605 ×60×60 was output. The positive sample area is the circle within the radius of the target center point RCR_CRC, which is related to the size of the target and is set to 0.3 in the paper. If the intercepted region contains multiple faces, only the faces whose center point is between 0.8 and 1.25 in the center of the intercepted region are retained, and the others are considered as negative samples.

Model Design

The network structure of DenseBox is shown in Figure 3, which contains 16 convolutional layers. The first 12 convolutional layers are initialized by VGG19. The network also adds some feature fusion between different layers to synthesize the features of different sensory domains.

Multi-Task Training

The network performs classification and position prediction at the same time. The network is learned by two tasks together, and the loss value of the classification task is directly calculated through L2 loss:

The loss value of position prediction task is also calculated through L2 loss:

Since the paper adopts the method of intercepting pictures for training, it will face the problem of sample making, while DenseBox has done some work on positive and negative sample making and learning:

  • Ignoring Gray Zone, the Gray Zone is a transitional area between the positive and negative points, and is not involved in the calculation of the loss value. For a non-positive sample point, if there are positive sample points within the radius of 2, it will be classified into the gray area.
  • Hard Negative Mining: In the training process, the samples are sorted according to Formula 1 and the top 1% is taken as Hard Negative, which can help the network to focus on learning these difficult samples.
  • Loss with Mask: the Mask M(t^ I)M(\hat{t} _I)M(t^ I) of the feature graph is defined according to the type of pixel points, and the final Loss value is output in combination with Formula 1, Formula 2 and the Mask:

In addition to the above points, in order to better explore the role of negative samples, the paper also generates enough random negative samples through random clipping training set. During training, positive sample images and random negative sample images were input into the network in a 1:1 ratio. In addition, in order to enhance the robustness of the network, some data enhancements are made:

  • Shake each captured image randomly
  • Turn around
  • Move horizontally within 25 pixels
  • Random scaling [0.8, 1.25]

Landmark Localization

Based on the above design, DenseBox can also be used for marker location, just by adding some layers for predictive markers. The paper also found that detection results could be further adjusted by fusing marker point branches and classification branches, as shown in FIG. 4. L2 function was used as the loss function of adjustment output, just like classification loss. At this point, the complete network loss becomes:

Experiments


Performance comparison on key points of face.

Performance comparison of vehicle critical point detection tasks.

Conclusion


The design of DenseBox detection algorithm is very advanced, and now many anchor-free methods have their shadows. If it is not a little later than Faster R-CNN at that time, the target detection field may have started to develop toward anchor-free very early.





If this article was helpful to you, please give it a thumbs up or check it out

For more information, please pay attention to wechat official account [Algorithm Engineering Notes of Xiaofei]