Abstract:Multi-target tracking is a challenging task that requires the initialization, positioning and construction of spatiotemporal tracking trajectory of the tracking target. This paper constructs this task as a frame-to-frame set prediction problem and proposes an end-to-end multi-target tracking method TrackFormer based on Transformer.

This article was shared from Huawei Cloud Community”Trackformer is a multi-target tracking method based on Transformer”, the original author: Gu Yu Run a wheat.

Multi-target tracking is a challenging task that requires the initialization, positioning and construction of spatiotemporal tracking trajectory of the tracking target. This paper constructs this task as a frame-to-frame set prediction problem and proposes an end-to-end multi-target tracking method TrackFormer based on Transformer. The model in this paper realizes the data correlation between frames through the attention mechanism and completes the prediction of tracking trajectory between video sequences. Transformer’s decoder simultaneously initializes a new target from a static target query and tracks an existing trace from a trace query and updates its position. Both types of queries can simultaneously focus on global frame-level features from the attention of self-attention and encoder-decoder. This eliminates the need for additional graph optimization and matching, and eliminates the need for modeling motion and appearance features.

1. Motivation

The multi-target tracking task needs to track the trajectories of a series of targets, which can keep their distinct tracking IDs when the targets move in the video sequence. The existing tracking-by-detection methods generally include two steps: (1) detecting targets in individual video frames; (2) associating detection targets between frames, so as to form the tracking trajectory of each target. The traditional tracking-by-detection based data association usually requires graph optimization or the use of convolutional neural network to predict the fraction between objects. In this paper, a new tracking-by-attention tracking paradigm is proposed, which models the multi-object tracking task as a set prediction problem, and implements an end-to-end trainable online multi-object tracking network by means of a proposed TrackFormer network. The network uses encoder to encode the image features from the convolutional network, and then decodes the query vector into the bounding box and the corresponding identity ID through Decoder. The tracking query in the network is used to do the data association between frames.

2. Network structure

Trackformer proposed in this paper is an end-to-end multi-target tracking method based on Transformer, which models the multi-target tracking task as a set prediction problem and introduces the tracking-by-attention paradigm. The following will introduce the network from the overall process, the tracking process and the network loss function.

Figure 1 Trackformer training flow chart

2.1 Multi-objective task based on ensemble prediction

Given a video sequence of a target with K different identities, a multi-target tracking task needs to generate a tracking trajectory including a bounding box and an identity ID (K)

A subset of total frames T (T1, T2…) The time series of the object from entering to leaving the scene is recorded.

To model the MOT (Multi-Target Tracking Task) as a sequence prediction problem, this paper utilizes the encoder-decoder structure of Transformer. The text model completes the online tracking by following four steps and outputs the bounding box, category and identity ID of the target for each frame:

1) Frame level features, such as RESNET, are extracted through the general convolutional neural network backbone

2) Complete frame-level feature coding through the self-attention module of the encoder in Transformer

3) Decode the query entity through the self- and cross-attention of the decoder in Transformer

4) The decoder output is mapped by multi-layer perceptron to complete the prediction of bounding box and category

Among them, the decoder’s attention mechanism is divided into two types: (1) Self-attention on all query vectors can be used to respond to all targets in the scene; (2) The attention between the encoder and the decoder can obtain the global visual information of the current frame. In addition, because Transformer is permutation invariant, additional position and target encodings need to be added to the feature input and decoded query entities, respectively.

2.2 Tracking process based on decoder query vector

Transformer’s decoder query vectors can be initialized in two ways: (1) static target query vectors that help the model initialize the tracking target in any frame, or (2) autoregressive trace query vectors that track the target from frame to frame. Transformer decoding both target and trace queries enables detection and tracking in a unified manner, thus introducing a tracking-by-attention mode. The detailed network structure is shown in the figure below:

Figure 2 Trackformer network structure

2.2.1 Trace initialization

New targets in the scene are detected by a fixed number of target query vectors, which are constantly learned in the network training process so that all targets in the scene can be encoded. Then the new target categories and location information can be predicted by decoding in Transformer decoder. So we can initialize the trace.

2.2.2 Trace queries

In order to realize the tracking between frames, the concept of “tracking query” is put forward in the decoding process. The tracking query will continuously track the target in the video sequence, and adjust the location prediction of the target adaptively by means of autoregression while carrying the identity information of the carrier. In order to achieve this goal, the corresponding output embedding of a frame is used to initialize the detected tracking query vector in the decoding process, and then the attention relationship between the current frame and the query vector is established through encoder and decoder in the decoding process, thus the identity and location of each instance in the tracking query are updated.

The tracking query vector is shown in the color box in Figure 1. The Transformer output embedding of the previous frame is used to initialize the query vector for the current frame, which is queriedby the characteristics of the current frame, and for complete target tracking between frames.

2.3 Network training and loss function

Because query tracing requires tracking the target of the next frame and working with the target query interaction, TrackFormer requires specialized frame-to-frame tracking training. As shown in Figure 1, this paper completes tracking training by simultaneously training two adjacent frames, and optimizes all multi-target tracking objective functions together. The set prediction loss measures all of the output for each frame

And the similarity between the bounding box prediction and the real target. The set prediction loss can be calculated in two parts:

1) The loss of the Nobject of the target query of the previous frame (t-1)

2) Loss of a total of N queries for the tracking target and the new detection target of the current frame (t) obtained from the previous step

Because the output of Transformer is unordered, matching the output embedding to the real tag needs to be done before calculating the collection prediction loss. This matching can be accomplished by tracking ID, bounding box and similarity between categories at the same time. Firstly, considering the tracking ID, we use KT-1 to represent the tracking ID set of frame T-1 and KT to represent the tracking ID set of the current frame T. Through these two tracking sets, we can complete the hard matching of NTrack tracking query and real label. The matching of the two sets can be divided into three cases: (1) The intersection of KT-1 and KT, which can be used to directly match the corresponding real tag (2) KT-1 appears in KT, and directly matches the background tag (3) KT and KT-1 does not exist, this part is the new target. The Hungarian algorithm should be used to optimize the matching between the target query and the bounding box and category of the real tag to get the matching result with the minimum loss. The matching process is as follows:

σ is the mapping relationship between GT and target query (Nobject). The optimization goal is to minimize the matching loss, which includes both category loss and bounding box loss, as shown below:



After the matching results are obtained, the set prediction loss can be calculated finally, including the loss of tracking and target query output. The calculation method is as follows:



∏ is the matching result between the output and the truth value obtained by tracking ID and Hungarian algorithm.

3. Experimental results



Table 3-1 Tracking results on MOT17

As can be seen from the results in Table 3-1, there is still a gap in tracking results on private detectors, mainly because Transformer based detectors are not as good as the current SOTA method. However, when using shared detectors, there is a significant improvement in both MOTP and IDF1 when tracking online.

Table 3-2 Tracking results on MOTS20

In addition to target detection and tracking, TrackFormer can also predict instance-level split graphs. As can be seen from Table 3-2, TrackFormer is superior to the existing SOTA method both in terms of cross-validation results and test set.

Click on the attention, the first time to understand Huawei cloud fresh technology ~