Online estimation of deep learning based on TensorFlow Serving

One, foreword

With the continuous development of deep learning in image, language, advertising click rate estimation and other fields, many teams have begun to explore the practice and application of deep learning technology at the business level. In terms of advertising CTR estimation, new models emerge in an endless stream: Wide and Deep[1], DeepCross Network[2], DeepFM[3], xDeepFM[4]. Many Deep learning blogs in Meituan have also made detailed introduction. However, when the offline model needs to go online, it will encounter all kinds of new problems: whether the offline model performance can meet the online requirements, how to insert the model estimation into the original engineering system and so on. Only with an accurate understanding of the deep learning framework can deep learning be better deployed online, so as to be compatible with the original engineering system and meet the online performance requirements.

This article first introduces the business scenario and offline training process of the User growth group of Meituan platform, and then mainly introduces the whole process of deploying WDL model to online using TensorFlow Serving and how to optimize online service performance, hoping to inspire you.

Ii. Business scenarios and offline processes

2.1 Service Scenarios

In the scenario of precise advertisement arrangement, there will be at most several hundred advertisement recalls for each user. The model estimates the click-through rate of the user for each advertisement according to the characteristics of the user and the characteristics related to each advertisement, and then sorts them. Due to the AdExchange timeout limit for DSP, the average response time of our sorting module must be controlled within 10ms. Meanwhile, Meituan DSP needs to participate in real-time bidding according to the estimated click-through rate, so it has high requirements on the model’s estimated performance.

2.2 Offline training

In terms of offline data, Spark is used to generate tfRecord, the original data format of TensorFlow[5], to speed up data reading.

In terms of model, the classic Wide and Deep model is used, with features including user dimension feature, scene dimension feature and commodity dimension feature. The Wide part has more than 80 feature inputs, and the Deep part has more than 60 feature inputs. The Embedding input layer has about 600 dimensions, followed by 256 equal-width full connection in 3 layers. The model parameters total 350,000 parameters, corresponding to the exported model file size of about 11M.

In terms of offline training, the distributed framework of TensorFlow synchronization + Backup Workers[6] is used to solve the problems of asynchronous update delay and slow synchronous update performance.

In terms of distributed PS parameter allocation, GreedyLoadBalancing can be used to allocate parameters according to the size of the estimated parameters instead of the Round Robin mode allocation method, which can make all PS load balancing.

In terms of computing equipment, we found that using only CPU instead of GPU would lead to faster training speed. This is mainly because although GPU computing performance may be improved, it increases the cost of data transmission between CPU and GPU. When model calculation is not too complicated, the effect of using CPU would be better.

Meanwhile, Estimator advanced API is used to encapsulate data reading, distributed training, model validation, and TensorFlow Serving model export. The main benefits of using Estimator are:

Single-machine training and distributed training can be easily switched, and there is no need to modify too much code when using different devices: CPU, GPU, TPU.
Estimator’s framework is clear enough to facilitate communication between developers.
Beginners can also directly use some Estimator models that have been built: DNN model, XGBoost model, linear model, etc.

TensorFlow Serving and performance Optimization

3.1 Introduction to TensorFlow Serving

TensorFlow Serving is a high-performance open source library for MACHINE learning models Serving, which deployable the trained machine learning model online and accepts external calls using the gRPC as an interface. TensorFlow Serving supports model hot updates and automatic model versioning, and is very flexible.

The following diagram shows TensorFlow Serving. The Client constantly sends requests to the Manager, who manages model updates based on version management policies and returns the latest model calculations to the Client.

The image is from the official documentation of TensorFlow Serving

Specialized TensorFlow Serving is provided by the data platform within Meituan, which runs distributed on the cluster via YARN, periodically scanning the HDFS path to check the model version and automatically updating it. Of course, TensorFlow Serving can be installed on any local machine for experimentation.

Outside our standing advertisement fine line of scenarios, each to a user, the online request side will take all the users and recall the 100 advertising information, input format into a model, and then as a Batch sent to TensorFlow Serving, TensorFlow after Serving to accept the request, The CTR estimate is calculated and returned to the requester.

When deploying the first version of TensorFlow Serving, the packing request needs 5ms and the network overhead needs a fixed 3ms when the QPS is about 200. Only the model estimates that the network overhead needs 10ms. The TP50 line of the whole process is about 18ms, which completely fails to meet the requirements of the online. Next, we will introduce our performance optimization process in detail.

3.2 Performance Optimization

3.2.1 Requester-side Optimization

Online requester-side optimization is mainly for parallel processing of one hundred advertisements. We use OpenMP multithreading to process data in parallel and reduce the performance of request time from 5ms to about 2ms.

#pragma omp parallel for 
for (int i = 0; i < request->ad_feat_size(); ++i) {
    tensorflow::Example example;
    data_processing();
}
Copy the code

3.2.2 Build model OPS optimization

Before optimization, the input of the model is raw data in unprocessed format. For example, the channel characteristic value may be string format like ‘channel 1′,’ channel 2′, and then do One Hot processing in the model.

The initial model uses a lot of high-order TF.feature_columns to process the data into One Hot and embedding formats. The advantage of using Tf.Feature_column is that you don’t need to do anything with the original data when you enter it. You can use the Feature_column API to do a lot of common features processing inside the model, such as: Tf.feature_column. Bucketized_column can be used to split buckets, and tf.feature_column. Crossed_column can be used to cross features between categories. But the stress of feature processing is built into the model.

To further analyze the time spent using Feature_column, we used the tF.profiler tool to analyze the time spent on the entire offline training process. Using tF.profiler in the Estimator framework is very convenient and requires only one line of code.

With tf. Contrib. Tfprof. ProfileContext (job_dir + '/ TMP/train_dir) as PCTX: estimator = tf.estimator.Estimator(model_fn=get_model_fn(job_dir), config=run_config, params=hparams)Copy the code

The graph below shows the time distribution for forward propagation using tF.profiler. You can see that feature processing using the Feature_column API takes a lot of time.

The time of forward propagation accounted for 55.78% of the total training time, which was mainly spent in the pre-processing of the original data by Feature_column OPS

In order to solve the problem of time-consuming feature processing in the model, we made One Hot mapping of all string format native data in advance when processing offline data, and put the mapping relationship into the local feature_index file for online and offline use. This is equivalent to skipping the process of calculating One Hot on the model side and instead doing O(1) lookup using the dictionary. Also, when building the model, use more lower-level apis with guaranteed performance instead of higher-level apis like Feature_column. The figure below shows the proportion of forward propagation time in the whole training process after performance optimization. It can be seen that the time ratio of forward propagation is greatly reduced.

The forward propagation time of optimized profiler records accounted for 39.53% of the total training time

3.2.3 XLA, JIT compilation optimization

TensorFlow uses directed data flow diagram to express the whole calculation process. Node represents OPS, data is expressed through Tensor, and directed edges between nodes represent the direction of data flow. The whole diagram is a directed data flow diagram.

Accelerated Linear Algebra (XLA) is a compiler designed to optimize Linear Algebra operations In TensorFlow. When JIT (Just In Time) compilation mode is turned on, the XLA compiler will be used. The whole compilation process is as follows:

TensorFlow Calculation flow

First of all, the whole calculation graph of TensorFlow will be optimized, and the redundant calculation in the graph will be cut out. HLO (High Level Optimizer) will generate the original operation of HLO from the optimized calculation graph. XLA compiler will optimize the original operation of HLO, and finally give LLVM IR to generate different machine code according to different back-end devices.

The use of JIT helps LLVM IR to generate more efficient machine code based on the original HLO operation. At the same time, multiple fusible HLO primitive operations will be fused into a more efficient computing operation. But JIT compilation compiles while the code is running, which means there is some additional compilation overhead when running the code.

Influence of network structure and Batch Size on JIT performance [7]

The figure above shows the time ratio between JIT compilation and non-JIT compilation under different network structures and different Batch sizes. It can be seen that the performance optimization of large Batch Size is obvious, and the changes of layers and neurons have little influence on JIT compilation optimization.

In practical application, the specific effect will vary due to network structure, model parameters, hardware devices and other reasons.

3.2.4 Final Performance

After the above series of performance optimizations, the model’s estimated time was reduced from an initial 10ms to 1.1ms and the request time from 5ms to 2ms. The whole process takes about 6ms from the time the request is packaged and sent to the time the result is received.

Parameters related to model calculation time: QPS: 1308,50 line:1.1ms, 999line:3.0ms. The following four figures are as follows: The time distribution shows that most of the time is controlled within 1ms; The number of requests shows that there are about 80,000 requests per minute, equivalent to 1308 QPS; The average time was 1.1ms. The success rate is 100%

3.3 Model switching burr problem

Monitoring shows that a large number of requests time out when the model is updated. As shown in the following figure, each update results in a large number of requests timed out, which greatly affects the system. Based on analysis of the TensorFlow Serving log and code, the timeout problem is mainly caused by two aspects. On the one hand, the threads updating, loading the model and Serving the TensorFlow request share the same thread pool, so the request cannot be processed when switching the model. On the other hand, after the model is loaded, the calculation graph adopts Lazy Initialization, so the first request needs to wait for the Initialization of the calculation graph.

The request timed out due to model switchover

Problem 1 is mainly due to loading and unloading model thread pool configuration problems, in the source code:

uint32 num_load_threads = 0; uint32 num_unload_threads = 0;

These two parameters default to 0, indicating that the independent thread pool is not used and running in the same thread as Serving Manager. Change it to 1 to solve this problem.

The core operations of model loading are RestoreOp, including reading model files from storage, allocating memory, and searching for corresponding variables, which are performed by calling the run method of Session. By default, all Session operations within a process use the same thread pool. This results in the same thread pool being used for the loading operations and the Serving request processing operations during the model load, resulting in latency of the Serving request. The solution is to construct multiple thread pools through configuration file Settings, specifying the use of separate thread pools to perform load operations when the model is loaded.

For problem 2, the model takes a long time to run for the first time, the method of Warm Up operation is adopted after the model is loaded in advance to avoid the impact of operation on the request performance. The method of using Warm Up here is to take out the type of input data based on the Signature set when exporting the model, and then construct dummy input data to initialize the model.

Through the optimization of the above two aspects, the problem of request delay after model switching is solved well. As shown in the figure below, the burr is reduced from the original 84ms to about 4ms during model switching.

After model switching, the burr is reduced

Iv. Summary and outlook

This paper mainly introduces the exploration of user growth group based on Tensorflow Serving prediction online, and the location, analysis and solution of performance problems. Finally, an online service with high performance, strong stability and support for various deep learning models is realized.

With a complete foundation of offline training and online estimation framework, we will accelerate the rapid iteration of strategy. In terms of models, we can quickly try new models, trying to combine reinforcement learning with bidding; In terms of performance, combined with engineering requirements, we will further explore TensorFlow’s graph optimization, underlying operators, operation fusion and other aspects. In addition to this, TensorFlow Prediction is useful for model analysis, and Google has developed what-if-Tools to help model developers analyze models in depth. Finally, we will combine model analysis to re-examine the data and features.

reference

[1] Cheng, H. T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., … & Anil, R. (2016, September). Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems (pp. 7-10). ACM. [2] Wang, R., Fu, B., Fu, G., & Wang, M. (2017, August). Deep & cross network for ad click predictions. In Proceedings of the ADKDD’17 (p. 12). ACM. [3] Guo, H., Tang, R., Ye, Y., Li, Z., & He, X. (2017). Deepfm: A factorization- Machine Based Neural Network for CTR prediction. ArXiv Preprint arXiv:1703.04247. [4] X., Zhang, F., Chen, Z., Xie, X., & Sun, G. (2018). xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems. ArXiv Preprint arXiv:1803.05170. [5] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., … & Kudlur, M. (2016, November). TensorFlow: A System for Large-scale Machine Learning. In OSDI (Vol. 16, pp. 265-283). [6] Goyal, P., Dollar, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., … & He, K. (2017). Accurate, large minibatch SGD: Training imagenet in 1 hour. ArXiv Preprint arXiv:1706.02677. [7] Neill, R., Drebes, A., Pop, A. (2018). Performance Analysis of Just-in-Time Compilation for Training TensorFlow Multi-Layer Perceptrons.

Author’s brief introduction

Zhong Da graduated from the University of Rochester in 2017 and majored in data science. Later, he worked in Stentor Technology Company in The Bay Area of California. He joined Meituan in 2018 and was mainly responsible for the user growth team’s deep learning and reinforcement learning implementation business scenarios.

Hongjie joined Meituan Dianping in 2015. Meituan platform and wine Tours, head of the group of user growth group algorithm, worked at ali, mainly committed to improve by machine learning Meituan comments on platform of active users, as a technical director, dominated the Meituan DSP advertising, and stood in the new algorithm of the project such as work, enhance the marketing efficiency, lower marketing costs.

Ting Wen joined Meituan Dianping in 2015. Engaged in YARN resource scheduling and GPU computing platform construction in offline computing direction of Meituan-Dianping.

recruitment

Meituan DSP is the core business direction of meituan online digital marketing, join us, you can personally participate in building and optimizing a marketing platform that can reach hundreds of millions of users, and guide their life and entertainment decisions. At the same time, you will also face the challenge of accurate, efficient and low-cost marketing, and have access to cutting-edge AI algorithm systems and big data solutions in the field of computational advertising. You will work with meituan marketing and technical team to promote the establishment of traffic operation ecology, support the continued rapid development of hotels, take-out, in-store, taxi, finance and other businesses. We sincerely invite you with passion, ideas, experience and ability to work with us! Participated in the implementation of the advertising system outside Meituan Dianping, optimized the online advertising algorithm based on large-scale user behavior data, improved DAU and ROI, and improved the relevance and effect of online advertising. Welcome email wuhongjie#meituan.com consultation.