AI paper interpretation: Transformer-based multi-target tracking method TrackFormer

Abstract: multi-target tracking requires the initialization and positioning of the tracking target and the construction of the tracking trajectory in time and space at the same time. This paper constructs this task as a frame-to-frame ensemble prediction problem, and proposes an end-to-end multi-target tracking method TrackFormer based on transformer.

This article is shared from the HUAWEI CLOUD community " Paper Interpretation Series Fourteen: Transformer-based Multi-target Tracking Method TrackFormer Detailed Interpretation of ", the original author: Gu Yurun Yimai.

The challenging task of multi-target tracking requires the initialization and positioning of the tracking target at the same time to construct a tracking trajectory in time and space. This paper constructs this task as a frame-to-frame ensemble prediction problem, and proposes an end-to-end multi-target tracking method TrackFormer based on transformer. The model in this paper realizes the data association between frames through the attention mechanism, and completes the prediction of the tracking trajectory between video sequences. Transformer's decoder simultaneously initializes new targets from static target queries and tracks existing tracking trajectories from tracking queries and updates their positions. Both types of queries can be simultaneously taken from the attention of self-attention and encoder-decoder. Lizhong pays attention to global frame-level features. Therefore, additional image optimization and matching processes can be omitted, and there is no need to model motion and appearance features.

1. Motivation

The multi-target tracking task needs to track the trajectory of a series of targets, and when the targets move in the video sequence, they can maintain their distinguished tracking ids. Existing tracking-by-detection methods generally include two steps (1) detecting targets in individual video frames and (2) associating detection targets between frames to form a tracking trajectory for each target. Traditional data association based on tracking-by-detection methods generally requires graph optimization or a method of using convolutional neural networks to predict the scores between targets. This paper proposes a new tracking paradigm, tracking-by-attention, which models the multi-target tracking task as an ensemble prediction problem, and realizes an end-to-end trainable online multi-target tracking network through the proposed TrackFormer network. The network uses the encoder to encode the image features from the convolutional network, and then decodes the query vector into a bounding box and the corresponding identity id through the decoder. The tracking query is used to correlate data between frames.

2. Network structure

The TrackFormer proposed in this paper is an end-to-end multi-target tracking method based on transformer. It models the multi-target tracking task as an ensemble prediction problem and leads to a new paradigm of tracking-by-attention. The following will introduce the network from three aspects: the overall process, the tracking process, and the network loss function.

Figure 1 Trackformer training flow chart

2.1 Multi-objective tasks based on ensemble prediction

Given a video sequence with K targets with different identities, a multi-target tracking task needs to generate a tracking trajectory including a bounding box and identity id(k)

A subset of the total frame number T (t1, t2,...) records the time sequence of the target from entering to leaving the scene.

In order to model the MOT (Multi-Object Tracking Task) as a sequence prediction problem, this article uses the encoder-decoder structure of the transformer. The text model completes online tracking through the following four steps and outputs the bounding box, category, and identity id of each frame target at the same time:

1) Extract frame-level features through a common convolutional neural network backbone, such as ResNet

2) Complete frame-level feature encoding through the self-attention module of the transformer encoder

3) Complete the decoding of the query entity through the self- and cross-attention of the decoder of the transformer

4) The output of the decoder is mapped through the multilayer perceptron to complete the prediction of the bounding box and category

The attention mechanism of the decoder is divided into two types (1) self-attention on all query vectors, which can be used to respond to all targets in the scene; (2) the attention between the encoder and the decoder can be obtained Global visual information to the current frame. In addition, because the transformer has permutation invariance, additional position codes and target codes need to be added to the feature input and decoding query entities respectively.

2.2 Tracking process based on decoder query vector

Transformer's decoder query vector has two initialization methods (1) static target query vector, which helps the model to initialize the tracking target in any frame (2) autoregressive tracking query vector, which is responsible for tracking the target from frame to frame. Transformer decodes the target and tracking query at the same time to achieve detection and tracking in a unified way, so a tracking-by-attention mode is introduced. The detailed network structure is shown in the figure below:

Figure 2 Trackformer network structure

2.2.1 Tracking initialization

The new targets in the scene are detected by a fixed number (Nobject) of target query vectors. These Nobject target vectors will be continuously learned during the network training process so that all target encodings in the scene can be realized, and then decoded by the transformer. The decoding of the receiver completes the prediction of the new target category and location information, thus realizing the initialization of the tracking.

2.2.2 Tracking query

In order to achieve frame-to-frame tracking, this paper puts forward the concept of "tracking query" in the decoding process. The tracking query will continuously track the target in the video sequence, and while carrying the identity information of the person, it also adaptively adjusts the target's position prediction through autoregressive means. In order to achieve this goal, the output embedding corresponding to the previous frame is used to initialize the detected tracking query vector during the decoding process, and then the attention relationship between the current frame and the query vector is established through the encoder and decoder during the decoding process, thus completing Track the update of the identity and location of each instance in the query.

The tracking query vector is shown in the color square in Figure 1. The transformer output embedding of the previous frame will be used to initialize the query vector of the current frame, and query the current frame features to complete the target tracking between frames.

2.3 Network training and loss function

Because query tracking needs to track the target of the next frame and work with the target query interaction, TrackFormer requires special frame-to-frame tracking training. As shown in Figure 1, this paper completes the tracking training by training two adjacent frames at the same time, and optimizes all the multi-target tracking objective functions together. The ensemble prediction loss measures all the outputs of each frame

The category and the similarity between the bounding box prediction and the real target, the ensemble prediction loss can be calculated in two parts:

1) The loss of Nobject target queries in the previous frame (t-1)

2) The loss of the tracking target obtained from the previous step and the new detection target of the current frame (t) for a total of N queries

Because the output of the transformer is unordered, it is necessary to complete the matching problem between the output embedding and the true label before calculating the set prediction loss. This matching can be accomplished by tracking the similarity between the id, the bounding box and the category at the same time. First consider the tracking id. We use Kt-1 to represent the tracking id set of frame t-1, and Kt to represent the tracking of the current frame t id collection, through these two tracking collections, Ntrack tracking query and hard matching of real tags can be completed. The matching of the two sets can be divided into three situations: (1) The intersection of kt-1 and kt, this can be used to directly match the real label corresponding to the tracking query embedding (2) There is no kt in kt-1 , Directly discard the matching background label (3) There are none in kt and kt-1. This part is a new goal. The Hungarian algorithm needs to be used to optimize the matching of the target query and the bounding box and category of the real label to obtain the minimum loss The matching result. The matching process is shown in the following formula:

σ is the mapping relationship from gt to the target query (Nobject). The optimization goal is to minimize the matching loss. The matching loss includes category loss and bounding box loss as follows:

After the matching result is obtained, finally the set prediction loss can be calculated, including the loss of tracking and target query output. The calculation method is as follows:

∏ is the matching result between the output obtained by tracking id and the Hungarian algorithm and the true value.

3. Experimental results

Table 3-1 Tracking results on MOT17

From the results in Table 3-1, it can be seen that there is still a certain gap in the tracking results on the private detector. This is mainly because the detector based on the transformer is not as good as the current SOTA method, but when the shared detector is used, the online In the case of tracking, both MOTP and IDF1 have been significantly improved.

Table 3-2 Tracking results on MOTS20

In addition to target detection and tracking, TrackFormer can also predict instance-level segmentation maps. It can be seen from Table 3-2 that TrackFormer is superior to the existing SOTA method in terms of cross-validation results and test sets.

Click to follow to learn about Huawei Cloud's fresh technology for the first time~

AI paper interpretation: Transformer-based multi-target tracking method TrackFormer

1. Motivation

2. Network structure

2.1 Multi-objective tasks based on ensemble prediction

2.2 Tracking process based on decoder query vector

2.2.1 Tracking initialization

2.2.2 Tracking query

2.3 Network training and loss function

3. Experimental results

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

基于预生成 QA 对的 RAG 知识库解决方案

祛魅最热门的通用Agent赛道

Trae 开发工具与使用技巧

Midscene.js：AI 在前端测试领域的应用

使用 Ollama 和 FastAPI 部署 Python AI 应用

2025主流AI大模型终极对决：DeepSeek、通义千问、Kimi谁将问鼎？