In this paper, the performance of CornerNet is optimized, and two optimized varieties of CornerNet are proposed, namely cornernet-saccade and cornernet-squeeze. The optimized methods are highly targeted and limited, but there are still many places to learn

Source: Xiaofei’s algorithm engineering notes public number

Cornernet-Lite: Efficient Keypoint-BasedObject Detection

  • Thesis Address:
  • Paper Code:


CornerNet, as a classical method in Keypoint-based target detection algorithm, has a good accuracy, but its inference is very slow, requiring about 1.1s/ piece. Although you can simply reduce the size of the input image to speed up reasoning, this will greatly reduce its accuracy, and its performance is much worse than that of Yolov3. To this end, the paper proposes two lightweight CornerNet variants:

  • CornerNet – Saccade: Firstly, the initial target position is obtained by reducing the number of pixels that need to be processed. Then, the target detection is carried out by intercepting a small range of nearby image area according to the target position. The accuracy and speed can reach 43.2%AP and 190ms/ picture respectively.
  • Cornernet-Squeezer: This variety achieves acceleration primarily by reducing the number of treatments per pixel, integrating the ideas of Squeezenet and MobileNets into the proposed new backbone network by Hourglass, with accuracy and speed up to 34.4%AP and 30ms/ page, respectively.

The paper also tried to combine the two varieties, but found that the performance was worse, mainly because CornerNet-Saccade requires a strong backbone network to generate a sufficiently accurate feature map, while CornerNet-Squeeze weakens the expression ability of the backbone network to accelerate. So the combination of the two did not achieve better results.


Cornernet-Saccade carries out target detection in a small area of the possible position of the target. First, the position and size of the prediction box are obtained by predicting the attention feature graph with a miniature complete image, and then the image area centered on this position is intersected on the high-resolution image for target detection.

Estimating Object Locations

Cornernet -Saccade first obtains the initial position and size of the target that may appear:

  • The input image is reduced to two sizes of 255 pixels and 192 pixels for the long side, and the small image is zero filled so that it can be input into the network for calculation at the same time.
  • For smaller images, three attention feature maps were predicted, one for small target (long side <32 pixels), one for medium target (32 pixels <= long side <=96 pixels), and one for large target (long side >96 pixels). Such a distinction can help determine whether the location area needs to be enlarged, and for small target, it needs to be enlarged more. This will be covered in the next section.
  • The Attention feature maps are derived from the different modules in the sample section on Hourglass, and the larger module feature maps are output for smaller target detection (the backbone network structure is described later). $3\times 3$conv-relu module is connected to $1\times 1$conv-sigmoid module to generate Attention feature map.

In the test phase, we only dealt with the predicted positions where the confidence was greater than the threshold $t=0.3$, while in the training phase, we set the center of GT on the corresponding feature map as positive samples and the other as negative samples, using $\alpha=2$Focal Loss for training.

Detecting Objects

Based on the preliminary position and size of the prediction box, cornernet-saccade zoomed in on the reduced original image and cut out a $255\times $255 area centered on that position for target detection. In order to ensure that the target is clear enough, the reduced original image is first enlarged according to the preliminary size of the prediction box, and the magnification ratio is $s_s=4>s_m=2>s_l=1$. Subsequent detection of the intercepted area uses the same Hourglass network, and finally all detection results are combined for soft-NMS filtering. The detection network is trained and predicted in the same way as the original CornerNet, using corner heat maps, Embeddings vectors, and offsets.

Here are some special cases, as shown in Figure 3, that require special treatment:

  • If the detection result appears at the edge of the captured area, it needs to be removed, as the screenshot area is likely to contain part of the target.
  • If the targets are close to each other, the interception area between the two will be highly overlapping, and the network is likely to produce highly overlapping and repetitive results. For this reason, a similar NMS method is used to deal with the predicted positions that are too close in the prediction results of each size, so as to improve the efficiency.

In addition, in order to make the detection process more efficient, the paper also implemented the following details:

  • Batch the acquisition of the interception area
  • The original image is saved in the GPU memory, and the original image is directly enlarged and intercepted on the GPU
  • Batch detection of intercepted areas

Backbone Network

This paper designs a new backbone network, Hourglass-54, which contains fewer parameters and fewer layers than the original CornerNet used Hourglass-104. Hourglass-54 has a total of 54 levels and consists of three Hourglass modules, sampled twice before the first. Each module is sampled for three times and the dimensions are gradually increased (384, 384, 512). In the middle of each module is a residual module of 512 dimensions, followed by a residual module of each upper sampling layer.


In CornerNet, most of the computational time is spent reasoning in the backbone network of Hourglass-104. To this end, Cornernet-SqueezeNet combined SqueezeNet and MobileNet to reduce the complexity of Hourglass-104, and designed a new lightweight Hourglass network. The core of Squeezenet is the fire module, which first reduces the dimension of the input feature through the squeeze layer containing $1\times 1$convolution, and then extracts the feature through the expand layer containing $1\times 1$convolution and $3\times 3$convolution. MobileNet uses $3\times 3$deep separation convolution to replace the standard $3\times 3$convolution, which can effectively reduce network parameters.

The new module is shown in Table 1. In addition to replacing the residual module, the new backbone network also makes the following modifications:

  • In order to reduce the maximum feature diagram of the Hourglass module, a down-sampling module layer is added in front of the first Hourglass module. Correspondingly, one of the lower sampling layers for each Hourglass module is removed.
  • Replace the prediction module’s $3\times 3$convolution with $1\times 1$convolution.
  • Replace the nearest adjacent upper sampling layer with a deconvolution of $4\times 4$.


Cornernet -Saccade contrast experiment.

Cornernet-squeeze contrast experiment.

Performance comparison of target detection.


In this paper, the performance of CornerNet is optimized, and two optimized varieties of CornerNet are proposed, namely cornernet-saccade and cornernet-squeeze. The optimized methods are highly targeted and limited, but there are still many places to learn.

If you found this article helpful, please feel free to like it or read it again

For more information, please pay attention to the WeChat public number [Xiaofei’s algorithm engineering notes]