Thesis: Rethinking Counting and Localization in Crowds:A Purely point-based Framework

Code: github.com/TencentYout…

Focus on the technical summary of computer vision, the latest technology tracking, classical paper interpretation.

Preface:

Locating individuals in a population is more in line with the actual needs of subsequent high-level population analysis tasks than simply counting. However, existing location-based approaches that rely on intermediate representations of learning targets (i.e., density maps or pseudo-boxes) are counterintuitive and error-prone.

This paper presents a purely point-based framework for combining population counting and individual localization. For this framework, the paper proposes a new metric called Density Normalized Average Precision (nAP), rather than just reporting absolute count errors at the image level, to provide a more comprehensive and accurate performance assessment.

In addition, the paper designs an intuitive solution under this framework, called point-to-point network (P2PNet). P2PNet discarded redundant steps and directly predicted a set of point proposals to represent the head in the image, which was consistent with the result of human annotation. Through thorough analysis, the paper reveals that a key step in realizing this novel idea is to assign optimal learning objectives to these proposals.

P2PNet not only significantly outperforms SOTA methods on popular counting benchmarks, but also achieves promising positioning accuracy.

The starting point

  1. Of all the relevant specific tasks of population analysis, population counting is a basic pillar designed to estimate the number of individuals in a population. However, simply giving a number clearly falls far short of the actual needs of subsequent higher-level crowd analysis tasks, such as crowd tracking, activity identification, anomaly detection, traffic/behavior prediction, etc.

  2. In fact, there is a clear trend in this area towards more challenging fine-grained estimates (i.e., individual positions) in addition to simple counts. Specifically, some approaches treat crowd counting as a head detection problem, but leave more effort in labor-intensive annotation of small scale heads. Other methods try to generate header pseudo-bounding boxes that provide only dot comments, but this seems tricky or inaccurate to say the least. Similarly trying to directly locate individuals, several approaches run into trouble in suppressing or splitting candidate instances that are too close, and they are prone to error due to extreme variations in head scale, especially for highly crowded areas.

  3. In terms of metrics, some visionary work encourages fine-grained assessments with patch-level metrics, but they provide only a rough measure of positioning. Other existing location-sensing indicators either ignore significant density changes in the population or lack punishment for repeated predictions.

Innovative ideas

  1. To solve the above problems, the paper proposes a purely point-based framework for jointly counting and locating individuals in a population. The framework directly uses point notations as learning objectives and outputs points to locate individuals. It benefits from the high precision positioning characteristics of point representations and relatively cheap annotation costs.

  2. In this paper, a new indicator called Density Normalized Average Precision (nAP) is proposed to provide a comprehensive evaluation index for positioning and counting errors. The nAP metrics support box and dot representations as input (that is, predictions or comments) without the above defects.

  3. As an intuitive solution under this new framework, a new method is developed to directly predict a set of point proposals with head coordinates and their confidence in the image. Specifically, a point-to-point network (P2PNet) is proposed to directly receive a set of labeled head points for training and to predict points during the reasoning process.

    In order to make this idea work correctly, the paper delves into the ground truth target allocation process to reveal the key to this association. In conclusion, whether multiple proposals match a single ground truth or vice versa, the model will be confused during training, leading to overestimation or underestimation of counts.

    Therefore, the proposals are proposed to make one-to-one matching by using the Hungarian algorithm, and the proposals are correlated with the ground truth target. The unmatched proposals should be classified as negative samples. Experience has shown that this matching is beneficial to improving nAP metrics as a key component of the paper’s solution under the new framework. This simple, intuitive and efficient design results in SOTA’s counting performance and promising positioning accuracy.

Methods

Purely Point-based Framework

Here’s a brief overview of the new framework. Given an image with N individuals, N points are used to represent the center of each individual’s head. The network outputs two things, one is the center P of the prediction head, and the other is the confidence C of that center point. The goal is to make the prediction point as close as possible to the Ground truth with a high enough confidence.

Compared with the traditional counting method, the framework provides individual position helps the task based on the analysis of sports population, such as people tracking, activity recognition and anomaly detection In addition, the framework is not dependent on the labor-intensive mark or inaccurate pseudo box or tricky post-processing, benefit from the original point said high precision positioning properties, especially for highly crowded areas in the crowd.

Therefore, this new framework deserves more attention due to its advantages over traditional population counting and its practical value. However, it is very challenging to handle such a task due to severe occlusion, density variations and labeling errors, which is even considered ideal but not feasible in [13].

Density Normalized Average Precision

A prediction point P ˆj is categorized as TP only if it can match some ground truth PI. The matching process is guided by the criterion (ˆpj, PI) based on the Euclidean distance at the pixel level. However, the direct use of pixel distance to measure affinity ignores the side effects of large density variations between populations. Therefore, density normalization is introduced for this matching criterion to alleviate the problem of density variation.

To put it simply, we introduce the nearest neighbor K(take 3) points and normalize their distance.

The formula is as follows:

Prediction and Ground Truth matching scheme

(a) When selecting the most recent proposal for each ground truth point, multiple Ground truth points may match the same proposal, which will lead to undercounting. (b) When selecting the nearest Ground truth for each proposal, multiple proposals may match the same ground reality point, which leads to high estimates. (c) The one-to-one matching carried out by the Hungarian algorithm in this paper does not have the above two defects, so it is suitable for direct point prediction.

P2PNet

The overall architecture of P2PNet

Based on VGG16, it first introduces an upsampling path to obtain fine-grained depth feature maps. It then uses both branches to simultaneously predict a set of point proposals and their confidence scores. A key step in a pipeline is to ensure a one-to-one matching between point proposals and ground truth points, which determines the learning objectives of these proposals.

The loss function is as follows:

Conclusion

This article comes from the public account CV technical guide of the paper sharing series.

Welcome to the public account CV technical guide, focusing on the technical summary of computer vision, the latest technology tracking, classical paper interpretation.

Other articles

CV technical guide – Summary and classification of essential articles

Self-attention in computer vision

Review column | attitude estimation were reviewed

Gossip about CUDA optimization

Why IS GEMM central to deep learning

Why is 8 bits enough to use deep neural networks?

Classic paper series — Capsule Networks: New deep learning networks

Classic paper series | target detection – CornerNet & also named anchor boxes of defects

What about the AI bubble

Use Dice Loss for clear boundary detection

PVT– Backbone function without convolution dense prediction

CVPR2021 | open the target detection of the world

Siamese network summary

Past, present and possibility of visual object detection and recognition

What concepts or techniques have you learned as an algorithm engineer that have made you feel like you’ve grown tremendously?

Summary of computer vision terms (1) to build a knowledge system of computer vision

Summary of underfitting and overfitting techniques

Summary of normalization methods

Summary of common ideas of paper innovation

Summary of methods of reading English literature efficiently in CV direction

A review of small sample learning in computer vision

A brief overview of knowledge distillation

Optimize OpenCV video read speed

NMS summary

Technical summary of loss function

Technical summary of attention mechanisms

Summary of feature pyramid technology

Summary of pooling technology

Summary of data enhancement methods

Summary of CNN structure Evolution (I) Classic model

Summary of CNN structure evolution (II) Lightweight model

Summary of CNN structure evolution (III) Design principles

How to view the future of computer vision

Summary of CNN Visualization Technology (I) Visualization of feature map

Summary of CNN visualization technology (2) Visualization of convolution kernel

Summary of CNN Visualization Technology (III) Class visualization

CNN Visualization Technology Summary (IV) Visualization tools and projects