The design of Densebox detection algorithm is very advanced, and many anchor-free methods now have their own shadow. If the target detection field had not appeared a little later than Faster R-CNN, it might have started to develop towards anchor-free very early

Source: Xiaofei’s algorithm engineering notes public number

DenseBox: Unifying Landmark Localization withEnd to End Object Detection

  • Thesis Address:


DenseBox is an early Anchor-free target detection algorithm. At that time, R-CNN series had obvious bottlenecks in the detection of small objects, so the author proposed DenseBox, which also has a good performance in the detection of small objects. In the early days proposed by Densebox, the famous Faster R-CNN appeared, and its powerful performance led the development of target detection algorithm to the direction of anchor-based. It was not until the appearance of FPN that the performance of anchor-free algorithm was greatly improved, and more work began to be involved in the field of anchor-free. At present, many studies on Anchor-Free target detection have the shadow of DenseBox, so the design idea of DenseBox is very advanced.

DenseBox for Detection

The overall design of DenseBox is shown in Figure 1. A single convolutional network outputs multiple prediction boxes and their category confidence at the same time, and the size of the output feature graph is $5\times \ FRAC {M}{4}\times \ FRAC {N}{4}$. If pixel $I $is located at $(x_i, y_i)$, Its desired dimensional vector is $5 \ hat {t} _i = \ {\ hat {s}, \ hat} {dx ^ t = x_i – x_t, \ hat} {dy ^ t = y_i – y_t \ hat} {dx ^ b = x_i – x_b, \hat{dy^b}= Y_I-Y_b \}$, the first is the classification confidence, the last four are the distance between the pixel position and the target boundary, and finally, the output of all pixels is transformed into a prediction box, and the final output is processed by NMS.

Ground Truth Generation

Instead of taking a full image as input, Densebox takes a larger area containing the target and enough background for training. During training, the intercepted image is resized as large as $240\times 240$to ensure that the face is located in the center of the intercepted area and the height is about 50 pixels, and the feature map of $5\times 60\times 60$is output. The positive sample area is the circle within the target center point radius of $R_C $. $R_C $is related to the size of the target, and the paper set it to 0.3. If the intercepted region contains multiple faces, only the faces whose center point is within the range of 0.8 to 1.25 in the center of the intercepted region are retained, and the others are considered as negative samples.

Model Design

The network structure of Densebox is shown in Figure 3, which contains 16 convolutional layers. The first 12 convolutional layers are initialized by VGG19. The network also adds some feature fusion among different layers, which can synthesize features of different sensory domains.

Multi-Task Training

The network carries out classification and location prediction simultaneously, and the network is learned by two tasks together. The loss value of the classification task is directly calculated by L2 loss:

The loss value of the position prediction task is also calculated by L2 loss:

As the paper adopts the method of intercepting pictures for training, it will face the problem of sample making. However, Densebox has done some work on positive and negative sample making and learning:

  • Ignoring Gray Zone, the transition area between positive and negative points, does not participate in the calculation of the loss value. For a non-positive sample point, if there is a positive sample point within the range of radius 2, it is classified into the gray area.
  • Hard Negative Mining. In the training process, samples are sorted according to Formula 1 and the top 1% is taken as hard-negative, which can help the network to focus on learning these difficult samples.
  • Loss with Mask, the Mask $M(\hat{t}_i)$of the feature map is defined according to the type of pixel points, and the final Loss value is output by combining Formula 1, Formula 2 and the Mask:

In addition to the above points, in order to better explore the effect of negative samples, the paper also generates enough random negative samples by randomly clipping the training set. During training, positive sample images and random negative sample images were input into the network at a ratio of 1:1. In addition, in order to enhance the robustness of the network, some data enhancements were carried out:

  • Random shake of each intercepted picture
  • Turn around
  • Move within 25 pixels horizontally
  • Random scaling [0.8, 1.25]

Landmark Localization

Based on the above design, DenseBox can also be used for marker location, just add some layers on top of the original for predicting markers. The paper also found that the detection results can be further adjusted by the fusion of the marker branch and the classification branch, as shown in Figure 4. The loss function of the adjusted output adopts the L2 function just like the classification loss. At this point, the complete network loss becomes:


Face key points on the performance comparison.

Performance comparison on vehicle key point detection task.


The design of Densebox detection algorithm is very advanced, and many anchor-free methods now have their own shadow. If the target detection field had not appeared a little later than Faster R-CNN, it might have started to develop towards anchor-free very early.

If you found this article helpful, please feel free to like it or read it again

For more information, please pay attention to the WeChat public number [Xiaofei’s algorithm engineering notes]