01 background

NeurIPS (Conference on Neural Information Processing Systems) is a Conference on machine learning and computational neuroscience. The INTERPRET Trajectory PREdicTion Challenge is part of the NeurIPS 2020 Workshop: Competition Track Saturday. The competition, sponsored by UC Berkeley MSC Lab, aims to build a common data set to evaluate the performance of various trajectory prediction algorithms in the autonomous driving field. The Meituan Unmanned Vehicle Distribution Center team won the NeurIPS 2020 INTERPRET Trajectory Prediction Challenge Generalizability Track Championship and the Regular Track runner-up.

02 Introduction to the problem

The INTERPRET Competition consists of two tracks: the Generalizability Track and the Regular Track. In the track Generalizability, there is a big difference between the track of the test set and the training set (taken from different scenes), and there is no high-precision map; The Regular track has the same track distribution as the training track (from the same scene), with a high precision map. Data were collected from several countries, including the US, China, and Germany, including parallel/lane change highways and urban roads, roundabout with stop/yield signs, unprotected left turns, and so on. In addition, obstacles in the scene include pedestrians, bicycles and motor vehicles.

In this competition, teams are required to predict the trajectory of each obstacle in the next three seconds (30 frames) based on its trajectory in the past one second (10 frames). The track of the obstacle is represented by a set of discrete sampling points at a frequency of 10 Hz, that is, one track point is sampled every 0.1 seconds. The competition allows teams to output 50 predicted trajectories for each obstacle, but ranks them only according to the Average Displacement Error (ADE) of the best trajectory (rank 1). The calculation method of average displacement error is as follows:


A D E = 1 N T p i = 1 N t = 1 T p Y i t Y ^ i t 2 ADE=\frac{1}{NT_p} \sum_{i=1}^{N} \sum_{t=1}^{T_{p}} \mid\mid Y_i^t – \hat{Y}_i^t \mid\mid_2

Where NNN represents the number of obstacles, TpT_pTp represents the number of predicted moments, YYY represents the real trajectory, and Y^\hat{Y}Y^ represents the predicted trajectory.

03 Algorithm Introduction

Part 1 Map data processing

Since the data forms of Generalizability Track and Regular Track are inconsistent (the former has a high-precision map, while the latter does not), in order to ensure the effectiveness of the algorithm, we use two forms to better express the scene. Figure 1:

In Regular Track, high-precision maps are attached to all test sets. We can query the map to get lane lines near any location (as shown in Figure 1-left, the road topology in the scene is very complete). As for Generalizability Track, the test set does not provide the corresponding high-precision map, so complete structural information of the road cannot be obtained. To this end, we design a semantic map based on geographical location to describe the drivable area in unstructured scenarios. The semantic map based on geographical location relies on the historical observation track of obstacles in the scene, and the specific drawing process is mainly divided into three steps:

  1. The actual scene area of a certain size (e.g., 50 m x50 m) is divided into WWWx HHH grids (WWW=250, HHH=250) at DDD (e.g., 0.2 m) resolution;
  2. According to the historical track of the obstacle, the direction of each track point is calculated and put into the grid of corresponding position.
  3. The motion direction information of all grids is counted and the direction descriptor is formed. The specific method is: divide 360 degrees into 8 intervals, 45 degrees for each interval, count the number of track points in each interval, and then normalize.

Finally, the size of the location-based semantic map is HHHx WWWx8.

Part 2 Prediction model design

An important problem needs to be considered in the design of trajectory prediction algorithms: that is, how to model the complex interaction between obstacles and the surrounding environment during prediction, where the surrounding environment usually contains multiple types of traffic elements, such as other traffic participants, road network topology, traffic lights, etc. Among the existing prediction algorithms, the modeling methods of obstacle interaction are also different, such as the early interaction based on simple position relation [1-3], interaction based on semantic map +CNN coding [4-6], interaction based on (figure) attention mechanism [7-11], etc. With the deepening of the cognition of obstacle interaction and the iteration of new technology, the accuracy of trajectory prediction algorithm is gradually improved. In this competition, we propose a prediction algorithm based on mixed attention mechanism to solve the prediction problem of two tracks in a general form. The algorithm is based on the current mainstream graph attention mechanism. The overall design idea is to introduce the mixed attention mechanism to enable the algorithm to extract the motion characteristics of obstacles and lane topology characteristics more accurately, and at the same time code the complex interaction between obstacles and obstacles and lanes.

Figure 2 shows the overall structure of the algorithm. The whole model is based on the mainstream Encoder-Decoder structure, including Feature coding Network and Interaction & Prediction Network. The feature coding network uses Timewise + Agentwise Attention double-attention mechanism and double-channel GRU to conduct high-quality feature enhancement and time sequence coding for obstacle track and map information. The interaction prediction network uses Agentwise + Conditional Attention to model the interaction between agents and outputs the multi-modal prediction trajectory and its probability. The above two networks are graph networks based on mixed Attention, and their core is ENC-MAT and Dec-MAT (Mixture Attention Transformer Encoder) modules. Enc-mat and Dec-MAT are the improved structures of existing Transformer Encoder (Bert-like) models. The differences between traditional Transformer Encoder, ENC-MAT and Dec-MAT are compared in Figure 3.

As you can see from Figure 3, enC-MAT and Dec-MAT have improved and added an additional attention channel compared to traditional Transformer encoder, hence the mixed attention mechanism. Enc-mat encoder uses Timewise and Agentwise hybrid attention mechanism. Dec-mat encoder uses Agentwise and Conditional (the same as Distance base Attention in FIG. 3-C) hybrid Attention mechanism. The algorithm uses mixed attention to replace the original single attention mechanism to enhance the feature expression of obstacle and environment topology according to the actual needs. The lower part of Figure 2 shows three Attention structures. In terms of calculation form, the calculation formulas of the three Attention modes are consistent:


A t t e n t i o n ( Q . K . V ) = s o f t m a x ( Q K T d k ) V Attention\left(Q,K,V\right) = softmax\left( \frac{QK^T}{\sqrt{d_k}}\right)V

The difference lies in the different generation modes of QKV feature in the Attention module among the three modes of Attention:

  1. In Timewise Attention, QKV is calculated as Q=ϕ Q,ta(e; Wq, ta), K = ϕ K, ta (e. Wk, a ta), V = ϕ V, ta (e. Wv, ta), e ∈ RN * T * DQ = \ phi_ {q, ta} (e. w_{q,ta}), K=\phi_{k,ta}(e; w_{k,ta}), \ V=\phi_{v,ta}(e; W_ {v, ta}), \ e \ \ mathbb in ^ {R} {N} \ \ times times T D Q = ϕ Q, ta (e. Wq, ta), K = ϕ K, ta (e. Wk, a ta), V = ϕ V, ta (e. Wv, ta), e ∈ RN * T * D

  2. In Agentwise Attention, QKV is calculated as Q=ϕ Q,aa(e; Wq, aa), K = ϕ K, aa (e. Wk, a aa), V = ϕ V, aa (e. Wv, aa), e ∈ RT * N * DQ = \ phi_ {q, aa} (e. w_{q,aa}), \ K=\phi_{k,aa}(e; w_{k,aa}), \ V=\phi_{v,aa}(e; W_ {v, aa}), \ e \ \ mathbb in ^ {R} {T} \ times N \ times D Q = ϕ Q, aa (e. Wq, aa), K = ϕ K, aa (e. Wk, a aa), V = ϕ V, aa (e. Wv, aa), e ∈ RT * N * D

  3. In Conditional Attention, QKV is calculated as Q=ϕ Q, CA (c; Wq, ca), K = ϕ K, ca (e. Wk, a ca), V = ϕ V, ca (e. Wv, ca), e ∈ RN * T * D, c ∈ RN * T * SQ = \ phi_ {q, ca} (c; w_{q,ca}), \ K=\phi_{k,ca}(e; w_{k,ca}), \ V=\phi_{v,ca}(e; W_ {v, ca}), \ e \ \ mathbb in ^ {R} {N} \ \ times times T D, c \ \ in \ mathbb {R} ^ {N} \ \ times times T S Q = ϕ Q, ca (c; Wq, ca), K = ϕ K, ca (e. Wk, a ca), V = ϕ V, ca (e. Wv, ca), e ∈ RN * T * D, c ∈ RN * T * S

In the above formula, eEE is the Embedding encoding feature of input data (such as track and lane), and the size is NNNx TTTx DDD. CCC is the additional input conditional information including manual prior, which will be explained below. The size is NNNx TTTx SSS. ϕ ∗ (⋅) \ \ phi_ * left (\ cdot \ right) ϕ ∗ (⋅) is a coding function, Linear function, for example, w ∗ w_ * w ∗ corresponding coding function parameters. In addition, the ADD input operation is added to the three types of Attention modules.

Part 3 Trajectory prediction process

First, a few words about notation:

Assume that the number of obstacles in the scene is NNN, the historical observation duration is TTT, and the historical observation trajectory of all obstacles is PPP. The PPP size is NNNx TTTx 666.The 6-dimensional features include coordinates, speeds, and categories. The predicted duration is TpT_pTp, the predicted trajectory is Y^\hat{Y}Y^, and the real future trajectory is YYY. In the case of outputting a trajectory, the sizes of Y^\hat{Y}Y^ and YYY are both NNNx TpT_pTpx 222.

In addition, when there is a high-precision map, assuming that the number of lanes in the scene is KKK and the number of sampling points of lanes is LLL, the scene (discrete lanes) can be expressed as MMM. MMM size KKKx LLLx 222; When high precision maps are not available, location-based semantic maps are used directly, and scenes (semantic images) can also be represented as MMM. MMM The size is HHHx WWWx 888.

The coding process mainly includes obstacle history track coding and scene topology coding. Taking obstacle feature coding as an example, the process can be divided into two steps:

  1. Dec – MAT coding. Given the PPP of obstacle observation track, the algorithm first uses ENC-MAT to enhance the features of each obstacle. In ENC-MAT, the first Attention channel performs Timewise Attention operation on the data in the time dimension, aiming to strengthen the information of a certain moment according to the information of other historical moments for each obstacle. The second Attention channel performs Agentwise Attention operation in the dimension of obstacles, aiming to strengthen the information of an obstacle according to the location information of other obstacles at each moment. Finally, the features of the two channels are spliced to obtain the obstacle trajectory feature RagentR_{agent}Ragent;

  2. Slow+Fast Channel GRU sequence encoding. Feature RagentR_{agent}Ragent is the obstacle information strengthened by other moments (or obstacles). Then, we extract refined motion features and main motion features of obstacles by Slow Channel GRU and Fast Channel GRU respectively. Slow Channel is to use all timing information of feature RagentR_{agent}Ragent (NNNx TTTx 2D2D2D), and obtain H1H_1H1 through timing encoding of a Slow GRU. For Fast Channel, the feature RagentR_{agent}Ragent is down-sampled (NNNx TdownT_{DOWN}Tdownx 2D2D2D), and then the feature H2H_2H2 is obtained through Fast GRU. The final sequence coding result of obstacles HagentH_{agent}Hagent is obtained by combining H1H_1H1 and H2H_2H2, which records the sequence motion characteristics of obstacles.

A similar method is adopted for coding road topology, but there are two differences with track coding:

  1. Characteristics reinforce differences. In order to simplify the calculation, we only use conventional Transformer Encoder to enhance the Timewise Attention single-channel Attention feature of the scene information, and obtain the feature RenvR_{env}Renv. In high-precision map mode, RenvR_{env}Renv is a three-dimensional feature with the size KKKx LLLx DDD. In semantic map mode, RenvR_{env}Renv is a three-dimensional feature with a size of 111x HWHWHWx DDD.
  2. Time series feature coding difference. Based on the above feature RenvR_{env}Renv, Fast Channel BI-GRU is used to directly encode bidirectional road topological feature in high-precision map mode. In the semantic map mode, we use Fast X bi-GRU and Fast Y bi-GRU to conduct timing sequence encoding in the horizontal and vertical directions of the image. Finally, the characteristic HenvH_{env}Henv is obtained.

The decoding process mainly consists of two stages: high-level interaction and trajectory prediction. The former uses dec-MAT, a mixed attention network, and the latter uses basic MLP to achieve trajectory and probability multi-task prediction. Before introducing the process, let’s state two relatively reasonable facts:

  • Fact 1: There is a correlation between the movement direction of the obstacle and the direction of the lane in the scene (motion trend correlation).
  • Fact 2: The movement of an obstacle is more dependent on adjacent lanes (relative position) that are closer to it.

Based on the above two facts, the process of the two phases of decoder can be described as follows:

  1. High-level interaction phase. Sequence features of obstacles HagentH_{agent}Hagent and scene topology features HenvH_{env}Henv form a global graph. This graph can be represented as a mixture matrix HmixH_{mix}Hmix (size (NNN+ LLL)x 111x 2D2D2D in high-precision map mode and (NNN+ 111)x 111x 2D2D2D in semantic map mode). Based on this mixed graph, we also used dec-MAT, a dual-channel mixed attention module, for feature enhancement and encoding. Where, the first channel performs Agentwise Attention operation for the first dimension of feature HmixH_{mix}Hmix, aiming at global interaction according to the time sequence motion characteristics of each obstacle, which is based on the above fact 1. For the feature HmixH_{mix}Hmix, the second channel introduces the location information CCC as a condition for Conditional Attention operation, aiming at global interaction according to the relative location relationship between obstacles, which is based on the above fact 2. Conditional Attention is also called distance-based Attention because CCC is the absolute position coordinate of obstacle and lane. Finally, the coding features of the two channels and the input HmixH_{mix}Hmix feature are added, and the obstacle feature GagentG_{agent}Gagent (size: NNNx 2D2D2D) is obtained after filtering the road information.

  2. Trajectory prediction stage. GagentG_{agent}Gagent outputs the predicted trajectory Y^\hat{Y}Y^ and trajectory probability PrPrPr through two MLP heads after Backbone.

At the Generalizability track, we won with a ADE of 0.5339m; On the Regular track, we finished second with a ADE of 0.1912m.

04 summary

Obstacle trajectory prediction is of great significance to the safe driving of unmanned vehicles, and it is also a challenging subject recognized by the academic and industrial circles. We hope to make better solutions to continuously improve the ability of automatic driving system to predict obstacles and provide more technical support for Meituan’s actual business and travel field.

05 References

  • [1] Alahi A, Goel K, Ramanathan V, et al. Social lstm: Human trajectory prediction in crowded spaces[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 961-971.
  • [2] Gupta A, Johnson J, Fei-Fei L, et al. Social gan: Socially acceptable trajectories with generative adversarial networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 2255-2264.
  • [3] Zhu Y, Qian D, Ren D, et al. StarNet: Pedestrian trajectory prediction using deep neural network in star topology[C]//Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. 2019: 8075-8080.
  • [4] Chai Y, Sapp B, Bansal M, et al. Multipath: Multiple Probabilistic Anchor for Behavior Prediction [J]. ArXiv Preprint arXiv:1910.05449, 2019.
  • [5] Chang M F, Lambert J, Sangkloy P, et al. Argoverse: 3d tracking and forecasting with rich maps[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 8748-8757.
  • [6] Liang J, Jiang L, Niebles J C, et al. Peeking into the future: Predicting future person activities and locations in videos[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 5725-5734.
  • [7] Mohamed A, Qian K, Elhoseiny M, et al. Social-STGCNN: A Social Spatio-Temporal Graph Convolutional Neural Network for Human Trajectory Prediction[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 14424-14432.
  • [8] Liang M, Yang B, Hu R, et al. Learning lane graph representations for motion forecasting[C]//European Conference on Computer Vision. Springer, Cham, 2020: 541-556.
  • [9] Huang Y, Bi H K, Li Z, et al. STGAT: Modeling spatial-temporal interactions for human trajectory prediction[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 6272-6281.
  • [10] Gao J, Sun C, Zhao H, et al. VectorNet: Encoding HD Maps from Vectorized Representation [J]. ArXiv Preprint arXiv:2005.04259, 2020.
  • [11] Zhao H, Gao J, Lan T, et al. Tnt: Target-driven trajectory Prediction [J]. ArXiv Preprint arXiv:2008.08294, 2020.

06 Introduction to the Author

Yan Liang, Fu Zhuang, Deheng and Dong Chun are algorithm engineers of Meituan unmanned vehicle distribution center.

| want to read more technical articles, please pay close attention to Meituan technical team (meituantech) WeChat public official.

| in the public, the menu bar reply goodies for [2019], [2018] special purchases, goodies for [2017], [method] the key word, can see Meituan technology team calendar year essay collection.