Exploration and practice of deep learning in ETA prediction of Meituan delivery

1. The background

ETA (Estimated Time of Arrival) is the Estimated Time for delivery personnel to deliver the takeaway to the user after placing an order. The predicted result of delivery time will be displayed on the user’s client page in the form of “estimated delivery time”, which is a very important parameter in the distribution system. It directly affects the user’s ordering intention, transport capacity scheduling and rider assessment, thus affecting the overall cost and user experience of the distribution system.

For the whole distribution system, ETA is not only the entrance and global constraint of the distribution system, but also the regulating center of the system. Specifically reflected in:

ETA on the user’s order moment will need to be shown, the estimated time would go on to order throughout the life cycle, in the first place in the user side gives promise of punctuality, order scheduling system is then used as assigned on the basis of and constraints, and the rider will follow the ETA time order of distribution, delivery is on time can also work as the rider’s examination results.
As the regulating center of the system, ETA needs to balance the efficiency of user – rider – merchant – distribution. From the user’s demand, as soon as possible and on time, from the rider’s point of view, too short will give the rider great pressure. From the perspective of scheduling, too long or too short will affect distribution efficiency. From the point of view of the merchant, they want the order to be distributed as much as possible, because it is related to the merchant’s income.

Under such multi-dimensional constraints, modeling and estimation of ETA of takeaway delivery will become more complex. Compared with ETA in taxi-hailing scenario, ETA in take-out scenario faces the following challenges:

In the takeout scenario, ETA is an important part of customer performance commitment. Both users and riders have high requirements for the accuracy of ETA. In taxi-hailing scenarios, users are more concerned about whether they can get a taxi. ETA only provides a reference, and the driver is not particularly concerned about its accuracy.
Since the delivery ETA bears the responsibility of commitment performance, whether it can be delivered on time in accordance with the ETA is also an indicator of the assessment of delivery riders and an important indicator of the overall delivery system. Once the commitment is given, both the system scheduler and the rider will do their best to ensure that it is delivered on time. Therefore, too short ETA will bring great pressure to the rider, and reduce the ability of scheduling and ordering, and increase the cost of distribution; An excessively long ETA will greatly affect the user experience.
Delivery ETA more link in the scene, the rider to the complete implementation process, including to business, business dinner, waiting to take food, path planning, different building more links such as delivery, and higher in single rate makes the process of mutual coupling between orders, uncertainty is large, make reasonable estimates there are more difficult.

The figure below is the timeline of the whole process of the rider’s performance. Various duration parameters are involved in the process. It can be seen that there are more than ten nodes, among which seven are key duration. These hours involve many parties, such as riders (picking-to-picking-delivery), merchants (dining out) and users (delivery), and they need to experience the transformation of indoor and outdoor scenes, so the challenge is very high. For ETA modeling, it is not only a simple time estimation, but also the time estimation of the whole link. At the same time, it is necessary to take into account the balance between “single volume – capacity – user conversion rate” conversion rate. The evolution of distribution ETA includes continuous improvement of data and features, as well as the evolution of model from LR-XGB-FM-DEEP FM-custom structure.

Please refer to “Practice of Machine Learning in Meituan Distribution System: Restoring the Real World with Technology” for details of ETA’s position in the whole distribution business and the overall machine learning practice of distribution business.

2. Model improvement in business process iteration

2.1 Iteration and selection of basic model

Similar to the iterative path of most CTR models, the business iteration of distribution ETA model goes through the path of LR-> tree model ->Embedding->DeepFM-> targeted structural modification. The feature level is also iterated and enriched.

The model dimensions start from linear combination of features, to dense feature fusion of tree model, integration of ID class features considered by Embedding, and second-order feature combination after low-rank decomposition of FM mechanism. Finally, the model is adjusted according to business index requirements.
Feature dimension gradually enriched to user portrait/rider portrait/merchant portrait/address feature/track feature/regional feature/time feature/sequence feature/order feature and other dimensions.

After comparing Wide&Deep, DeepFM, AFM and other common models, DeepFM is finally selected as the preliminary Base model in consideration of computational performance and effect. After Embedding the features of the whole DeepFM model, the deep part is further added on the basis of FM (Factorization Machine) to perform targeted fusion for sparse and dense features respectively. The FM part considers first-order and second-order feature fusion by implicit variable inner product, and the DNN part learns high-order feature fusion by feed-forward. In the process of model training, Learning Decay/Clip Gradient/ solver selection /Dropout/ activation function selection are adopted, which will not be described here.

2.2 Loss function

In the scenario of ETA estimation, punctuality and confidence are important business indicators. A preliminary attempt was made to replace the loss function of Square with that of Absolute, which intuitively fits MAE’s stricter constraints compared with ME. Under proper Learning Decay, the result converges and is stable.

At the same time, considering the same ETA commitment time in the iteration, under the restriction of N minutes before and after, 1min early is better than 1min late. The design of the loss function hopes that the overall estimation result can be as forward as possible. For the early part, lower the numerical penalty appropriately. For the late part, increase the numerical penalty appropriately. After several debugging designs, N minutes before and after and the origin are finally determined as three segmentation points. On the basis of the original absolute function optimization, 1.2 times slope Absolute function was designed in the front segment and 1.8 times slope Absolute function was designed in the back segment, so as to make the results converge to the center as a whole. In addition, the estimated results were more likely to be delivered in advance, which greatly improved all ETA indicators.

2.3 Incorporating business rules into the model

The current business architecture is “model + rules”. After an ETA value is estimated by the model, specific business rules will be superimposed in time to meet the requirements of specific scenarios. The rules are generated by multiple iterations of business indicators. In this case, the overall optimization of the model and the rule is separated. When the model time and the rule time are separately optimized, that is, the influence of the rule time cannot be taken into account in the model training. However, the rule time will fluctuate in different periods of a year, and the separation degree will be increased after repeated iteration for a period of time.

After trying different schemes, the overall rules were finally written into the TF model, and the parameters of the overall rules were adjusted inside the TF model.

For simple rules such as (A *b+ C)* D, the rule logic can be directly realized by using THE OP operator of TF. For example, when B and D are constant values, a and C are learnable parameters.
For rules that are too complex, they can be replaced by model fitting with the help of a certain model structure. Too much nesting of complex OP operators is not easy to optimize simultaneously.

By adjusting different fitting parts and parameters, multiple rules are completely realized in TF model. Finally, it has a great improvement effect on business indicators, and has the ability of manual interference model through the modification of some fixed parameters.

Here, the overall architecture is simplified to the architecture of multi-objective prediction. The structure of Shared Parameters commonly used in multi-task architecture is adopted here, and different alternate training strategies are adopted in proportion during training. Structurally, starting from the fusion layer in the middle of the model at the bottom, the conventional prediction structure and multiple regular time structures are respectively implemented in TF, while the corresponding Label is still derived from the conventional historical value and regular time value. The following points are considered:

In model estimation, the influence of rules on the overall results (such as the superposition effect of multiple rules) has been fully taken into account as a whole.
Regular time is passed into the model as an auxiliary Label, which plays a further role in model convergence and Regularization.
For different target estimation, different Loss is adopted to facilitate targeted optimization and further improve the effect.

The model structure is trying to adjust the prediction target:

Tried the fixed share network part and the unfixed share part parameter, the unfixed share parameter effect is obvious.
In general, activation functions vary little, but in the shared layer to the independent target layer, different activation functions vary greatly.

2.4 Missing Value processing

In the model processing, there are inevitably some missing values at the feature level, and the missing values are completely borrowed from the method in meituan “Guess You Like” Deep learning Sequencing Model Practice. For feature X, enter TF model for judgment. If it is missing value, set parameter W1; if it is not missing value, enter model value w2* X. Here, W1 and W2 are taken as learnable parameters and put into network for training. This method is used to replace the mean/zero value as missing values.

3. Optimization of long tail problem

3.1 Model prediction result + long tail rule supplement

The basic model learns the overall statistical distribution, but it is not enough to learn some long-tail situations, which is reflected in the short estimation time in the long-tail situation (since ETA has the function of assessing the rider, the short estimation means great harm to the rider). Therefore, the long tail is disassembled into two parts for analysis:

The long tail of business, that is, the long tail caused by the overall sample distribution. Mainly reflected in distance, price and other dimensions. The farther the distance is, the higher the price is, and the longer the actual delivery time is. However, the smaller the sample proportion is, the shorter the performance of the model in this part is overall.
The model long tail, that is, the long tail caused by the uncertainty of the model’s own estimates. Models learn the statistical distribution of the whole, but are not “confident” about the prediction of every sample. In practice, standard deviation of output of RF multiple decision trees is used to measure uncertainty. The decision tree generated by RF model is independent. Each tree can be regarded as an expert, and multiple experts score jointly. The standard deviation of the score actually measures the degree of “disagreement” among experts (and the degree of “confidence” in the forecast). It can also be seen from the figure below that with the increase of RF standard deviation, the confidence and punctuality rate of the model decrease.

Under the above dismantling, the problem of short long-tail estimation is solved by using the time-filling rule: the time-filling rule is the combination of “business long-tail factor, model long-tail factor”. The business long tail factor is distance, price and other business factors, and the model long tail factor is RF standard deviation. The final ETA strategy is the model prediction result + the long tail rule supplement time.

4. Engineering development practice

4.1 Practice of training

Overall training process

For offline training, the following training process is adopted:

Spark original data integration -> Spark generation TFRecord -> data parallel training -> TensorFlow Serving GPU evaluation -> CPU Inference online prediction

The whole routine training process lasted about 4 hours under the multi-rounds of Epoch with 100-level data. In the TF training, considering the fact that TF’s actual computing efficiency was not very high, a large proportion was in the DATA IO part, and the TFRecord part was generated by Spark, which could accelerate the speed by about 3.6 times. In the data parallel training part, the parallelism expansion in 16 cards is almost linear, with good expansibility. As the number of parameters in PS is not enough for a single machine, the parameters are not segmented in PS for the time being. Serving the evaluation part of the GPU-serving line is a non-essential part of the whole flow. Although the Valid set set of the Chief Worker is useful during the training, for the whole Serving line, the Spark data is used to invoke the evaluation of the Serving GPU in a short time. A large number of complex custom indicators can be specified.

Data parallel training mode

The training of the whole model was carried out on the AFO platform of Meituan, and the distributed scheme and single-machine multi-card scheme were tried successively. Considering the stability of production and results, single machine multi-card scheme is used for routine training in online model production.

Distributed scheme:

The TF own PS – Worker architecture, asynchronous parallel data, using TF. Train. MonitoredTrainingSession coordination of the whole training process. The parameters of the whole model are stored in PS, and each Worker pulls data from each Step for parallel data calculation. Meanwhile, gradient is returned to complete an update. The current model has a single Worker throughput of 1~2W/s, and several rounds of Epoch with hundreds of millions of levels of data are completed within a few hours. At the same time, the acceleration ratio of the model on the platform was tested. Within about 16 blocks, the computing power increased linearly with the number of workers, and the separation appeared slightly after 16 cards. In current business practice, basically 4-6 cards can complete routine training tasks in a short time.

Single-machine multi-card scheme:

The ps-worker scheme has good expansibility on the platform, but it also has some disadvantages. The communication using RPC is easily affected by other tasks, the whole training process is affected by the slowest Worker, and the asynchronous update method also has certain fluctuations on the results. Therefore, in the online production, the single machine multi-card scheme was finally selected, which sacrificed certain expansibility to bring the overall training effect and stability of training speed. In the single-machine multi-card scheme, the Device of OP is manually specified by multiple Gpus, and variable sharing is completed in each Device at the same time. Finally, Loss and gradient are integrated to update Grad into model parameters.

TF model integrated preprocessing

In the process of model training, low-frequency filtering of ID class features needs to use Vocab word list, and continuous features need to be normalized. This generates a large number of pre-processed files, which can be easily processed into Libsvm format in Spark during offline processing and then loaded into the model for training. However, in online prediction, multiple word lists and normalized preprocessing files (AVG/STD value files, etc.) of continuous features need to be loaded at the engineering development end. Meanwhile, because the model is updated on a daily basis, there are problems of alignment of different date versions.

In order to simplify the difficulty of engineering development, it is considered to write all the pre-processing files into TF calculation diagrams during model training. As long as the most original features are input for each online prediction, the results can be obtained directly without engineering pre-processing:

For ID class features, low-frequency filtering is required, and then a thesaurus is made. TF model reads the list_ARr of the thesaurus and obtains the PH_IDX of the corresponding thesaurus through PH_VALS each time.

tf_look_up = tf.constant(list_arr, dtype=tf.int64)
table = tf.contrib.lookup.HashTable(tf.contrib.lookup.KeyValueTensorInitializer(tf_look_up, idx_range), 0)
ph_idx = table.lookup(ph_vals) + idx_bias
Copy the code

For continuous features, avG/STD values are directly written into the TF model calculation diagram after Spark processing as the constant node. Each PH_IN passes through two nodes to obtain the corresponding PH_OUT.

constant_avg = tf.constant(feat_avg, dtype=tf.float32, shape=[feat_dim], name="avg")
constant_std = tf.constant(feat_std, dtype=tf.float32, shape=[feat_dim], name="std")
ph_out = (ph_in - constant_avg) / constant_std
Copy the code

4.2 Online prediction of TF model

The distribution machine learning platform has a built-in model management platform, which can manage and schedule the models produced by algorithm training in a unified manner, manage the versions of models used on the line, and support switching and rollback of model versions, as well as node model version state management.

The DeepFM model used by ETA is trained with TensorFlow to generate the SavedModel format, which requires the model management platform to support TensorFlow SavedModel format.

There are many implementations of the TensorFlow SavedModel model for on-line service loading:

We construct the TensorFlow Serving CPU service via gRPC API or RESTful API. This solution is simple to implement, but not compatible with existing Thrift based model management platforms.
TensorFlow Serving using the Meituan AFO GPU platform.
In model management platform, Java API TensorFlow provided by TensorFlow is called by JNI, and SavedModel format is supported by model management platform.

Finally, TensorFlow Java API was used to load SavedModel to make prediction on CPU. When the test batch=1, the prediction time was less than 1ms, and scheme 3 was selected as the implementation scheme.

Remote computing mode

TensorFlow Java API’s underlying C++ dynamic link library requires libstdc++. So GCC must be at least 4.8.3. If every online business server supports TensorFlow SavedModel local computing, thousands of servers will need to be upgraded to the GCC version, which is a lot of work and may cause other risks.

Therefore, we re-apply for dozens of Remote computing servers. The business server only needs to serialize the Input data to the TensorFlow Remote cluster, which then returns the serialized Output to the business side after calculation. Only a few dozen computing servers need to be upgraded.

Online performance

After the model goes online, it supports the algorithm requirements of multiple business parties. The TP99 of remote cluster computing time is basically less than 5ms, which can meet the computing requirements of business parties.

Summary and Prospect

After the model is implemented and put online, business indicators are greatly improved. The effect will be further improved according to the business optimization model:

It will further enrich the multi-objective learning framework, consider the process of the whole delivery life cycle, such as food pickup, food delivery, delivery and scheduling, at the model level, model multiple objectives within the order life cycle, and improve the interpretability of the model.
Further upgrading of feature level of model fusion, features are better integrated through more LSTM/CNN/ self-designed structures besides Embedding.
Further enrichment of the characteristic level.

Author’s brief introduction

Kezer, technical expert of Meituan-Dianping, is currently in charge of strategy iteration work of machine learning group in distribution Algorithm Strategy Department.
Zhou Yue joined the Algorithm Strategy Group of Meituan Distribution Division in 2017, mainly responsible for ETA strategy development.
Xianjie, a technical expert of Meituan Dianping, joined Meituan in 2018 and is now mainly responsible for the research and development of deep learning related to the data platform of delivery algorithm.