Application practice of GPU in precision model prediction in takeout scenario

GPU and other dedicated chips provide massive computing power at a low cost and have become the core weapon in machine learning, playing an increasingly important role in the era of artificial intelligence. How to make use of GPU to enable business scenarios is a problem that many technology developers have to face. This paper shares the process of GPU architecture design and implementation estimated by model in meituan takeout search/recommendation business, hoping to be of help or inspiration to students engaged in related application research and development.

1 introduction

In recent years, with the rapid development of machine learning technology, a series of dedicated chips represented by GPU have been widely recognized and favored in the field of machine learning due to their superior high-performance computing capability and increasingly low cost. Moreover, the continuous integration with traditional CPU system has formed a new heterogeneous hardware ecosystem.

In this wave of technology, many technology developers will be faced with the question: what can we gain by applying GPU hardware to our business? How to quickly and smoothly switch from a traditional CPU system? From the perspective of machine learning algorithm design, what impact and changes will it bring? Among the numerous technical routes and architecture selection in the GPU ecosystem, how to find a path that is most suitable for its own scene?

Meituan takeout search and recommendation team also faces similar challenges and problems. In this paper, we will share the process of GPU architecture design and implementation estimated by the model in meituan takeout search/recommendation business, and disclose some technical details and test data in detail, hoping to provide some valuable references for the majority of technical peers.

2 background

At present, Meituan takeout mainly distributes traffic through search and recommendation to meet users’ demand for “everything goes home”. In addition to the search and recommendation functions on the home page, key categories will add independent entrances on the home page (hereinafter referred to as “King Kong”), each king Kong entrance has areas similar to the search and recommendation on the home page, and different scene entrances jointly serve the final order of takeout. The linkage between the home page, King Kong and the store is shown as follows:

Deep learning for CTR/CVR estimation is the core technology in every e-commerce search/recommendation product, which directly determines the user experience and conversion effect of the product, and is also a “major consumer” of machine resources consumption. The design and practice of CTR/CVR refined layout model is also the field that Meituan takeout search recommendation (hereinafter referred to as Sotui) technical team must conquer and constantly pursue excellence.

From the point of view of search and push system design, different search and recommendation entries will naturally form independent call links. Under the traditional model design ideas, the CTR/CVR/PRICE targets of different entry links and funnel links are designed independently, which is also the classic way of meituan Takeout’s previous model design. From 2021, based on the consideration of global optimization of multiple scenarios, CTR/CVR prediction Model of search and push scenarios will gradually move toward multi-model unification, realizing linkage optimization of multiple entrances by comprehensively utilizing data of multiple entrances and combining the business characteristics of different entrances, and gradually realizing the goal of “One Model to Serve All”.

From the perspective of model computing practice, the development of the takeout precision model has significantly expanded the computation volume of the Dense network. The software and hardware architecture with CPU as the main computing force has been unable to cope with the development demand of the algorithm. Even with the sharp increase in cost consumption, the computing power ceiling is still “close at hand”. The advantages of GPU hardware for dense computing precisely coincide with the characteristics of the new model, which can fundamentally break the difficulty of computing power in the prediction/training of precision model. Therefore, starting from 2021, the deep learning system of Meituan takeout search and push scenarios will gradually shift from pure CPU architecture to CPU+GPU heterogeneous hardware computing platform, in order to meet the new requirements of meituan takeout model algorithm evolution on computing power.

The following content of this paper will start from the design of the refined model of food delivery search and promotion scenarios and combine the actual software and hardware characteristics of Meituan to share the exploration and practice process of the transformation from pure CPU architecture to CPU+GPU heterogeneous platform in the field of food delivery refined model estimation in detail for the reference of technical peers.

3. Fine arrangement model in takeout search and push scenario

This chapter mainly introduces the evolution ideas, model characteristics and practical challenges of multi-model unification in the takeout scenario. This paper only makes a simple description of the model design ideas, and leads to the practical thinking of the subsequent model calculation in GPU landing.

3.1 Design ideas of fine layout model

As mentioned above, under the scene characteristics of multi-entrance linkage of Meituan takeout, the classical single model design has the following limitations:

The home page recommendation and each diamond entrance recommendation maintain a refined model, not only the maintenance cost is high and the training data is fragmented, resulting in the refined model can not capture the user’s interest in all the recommended scenes.
The refinement model of recommendation scenarios only uses the training samples of recommendation scenarios, and does not use the training samples of users in other important portals, such as search and order pages. The model only learns the preference information of users in local scenarios.
Position Bias exists in the training samples of recommendation scenes. Specifically, the user clicks on a merchant, which may be just because the merchant has a higher ranking in the recommendation Feeds, rather than because the user is really interested in the merchant. Such Bias will lead to biased model training.
The bayesian constraint exists among multiple targets, which is not considered in the network structure. CXR=CTR × CVR, and the CXR estimate should be smaller than CTR. In the validation set of the model, CXR is higher than CTR, and the prediction is not accurate.

Based on this, in 2021, Meituan Takeout proposed the idea of “One Model to Serve All”, which is embodied in the Model design as follows:

CTR/CXR multi-objective fusion can achieve the unity of multi-objective prediction model.
The fusion of scene expert network and Attention network realizes model generalization and unification between different traffic entrances.
The integration of domain-specific network and shared network realizes the transfer learning of recommendation scene to search scene.

With the development and evolution of takeout refinement model, the number of parameters in Dense network increased significantly, and the single inference FLOPs reached 26M, which caused great pressure on CPU computing architecture. On the other hand, Float 16 compression, automatic feature selection, network crossover instead of manual crossover feature were adopted to reduce the model from 100G to less than 10G, and the model effect was not damaged through model optimization in the process.

To sum up, the dense part of the takeout search and inference model is complex and the volume of the sparse part is controllable. These good characteristics provide a relatively suitable model algorithm basis for reasoning calculation on GPU hardware architecture. Next, we will discuss how to use GPU hardware to effectively solve the cost and performance problems in online prediction of takeaway finalization model in the takeaway search and recommendation system with high throughput and low consumption, and give our practice process and results.

3.2 Characteristics and challenges of model application

In the field of search/recommendation technology, sparse model estimation (CTR/CVR) is the core element that determines the algorithm effect. Model estimation service is an indispensable part of search and recommendation system, and there are many classic implementation schemes in the industry. Before we get into the actual practice, let’s talk about the characteristics of our scenario:

① Demand level

Model structure: As described above, the dense network part of the precision row model in the takeout scenario is relatively complex, with a single inference FLOP of 26M. The sparse part of the model has been greatly optimized to effectively control the volume, and the model size is within 10G.
Service quality requirements: Recommended service is a classic high-performance To C scenario, and the timeout of most similar systems in the industry should be controlled at the level of 100 milliseconds. When decomposed To estimated service, the timeout should be controlled at the level of 10 milliseconds.

(2) Software framework level

Development framework: Model development adopts TensorFlow framework [1]. As a mainstream second-generation framework of deep learning, TensorFlow has strong model expression ability, which also leads to its small operator granularity, which will bring a lot of extra overhead to both CPU and GPU architecture.
Online Serving framework: Using TensorFlow Serving framework [2]. Based on this framework, the machine learning model trained offline can be deployed online and RPC can be used to provide real-time prediction service externally. TensorFlow Serving supports model hot update and model versioning. The main features of TensorFlow Serving are flexibility and good performance.

③ Hardware level

Model characteristics: Meituan adopts GPU BOX model in the estimation service based on the consideration of improving computing power density. Compared with traditional GPU card models, each GPU card in this type of model has relatively limited CPU and memory, which requires us to carefully consider CPU, GPU computing and data distribution when designing online services, so as to achieve better utilization balance.
Inherent attributes of GPU: GPU kernel can be roughly divided into several stages, such as data transmission, kernel startup and kernel calculation [3], among which the startup of each kernel takes about 10us. Therefore, GPU estimation will face a universal problem. A large number of small operators will lead to a short execution time of each kernel, and the kernel startup takes up the majority of the time. Adjacent kernels need to read and write video memory for data transmission, resulting in a large amount of memory access overhead. However, the GPU fetch throughput is much lower than the computational throughput, resulting in low performance and low GPU utilization.

In conclusion, compared with other mainstream search and tweet scenarios in the industry, our CTR model has two obvious features:

The dense network part has high computational complexity. Comparatively, the sparse network has undergone a lot of optimization in the process of model design, and its volume is relatively small.
If the GPU BOX model is used, the CPU quota of a single GPU card is limited. Therefore, you need to optimize the CPU computing load accordingly.

Based on these two characteristics, we can be more targeted in gPU-oriented optimization practice.

4 Model Service Architecture Overview

This chapter briefly introduces the overall architecture and role division of meituan takeout online search and push estimation service, which is also the engineering system basis for the implementation of takeout search and push refinement model on GPU.

Key System Roles

Dispatch: undertakes the functions of feature acquisition and feature calculation. As mentioned above, Meituan uses GPU BOX model to build prediction service, and the CPU resources of inference calculation are very tight. Therefore, it is natural to consider independent deployment of online feature engineering to avoid the preemption of CPU resources. This part has little to do with GPU practice and is not the focus of this article.
Engine: Assumes the function of online reasoning of the model, inputting the feature matrix and outputting the estimated results through RPC. GPU BOX model (single-container 8-core +1 NVIDIA Tesla T4) is adopted, and the average response time should be controlled within 20ms. The GPU optimization practice described below is mainly based on the characteristics of this module.
Booster: A model Optimizer that executes offline during model updates and internally, in the form of the Optimizer plug-in, combines the manual Optimizer plug-in with the DL compiler Optimizer plug-in, and is the executer of the GPU optimization operations described below.

5. GPU optimization Practice

This chapter will expand the optimization process of GPU architecture implementation based on the estimation of sharing finalization model.

Different from CV, NLP and other classical machine learning fields, sparse models represented by CTR model are difficult for hardware suppliers to provide end-to-end optimization tools for such non-convergent model structures due to their variable structure and large number of business specializations. Therefore, in the field of large-scale application of CTR model, case-by-case optimization measures are generally implemented for the model based on the application scenario combined with GPU features. According to the objectives of model optimization, it can be roughly classified into system optimization and computational optimization:

(1) System optimization: generally, it refers to the scheduling of computing, storage and transmission so that the heterogeneous hardware system of CPU+GPU can be coordinated and used more efficiently. Typical system optimizations include:

Equipment is put
Fusion operator
GPU concurrency/pipelining optimization

② Computational optimization: it generally refers to the optimization of the structure design and operator execution logic of the model forward inference network for hardware characteristics, so that the calculation cost of model inference calculation on GPU is lower and the efficiency is higher. Typical computational optimizations include:

Redundant calculations are removed
Quantitative calculation
Application of high performance library

In the optimization work introduced in this paper, we have explored and practiced most of the common optimization ideas mentioned above, which will be elaborated one by one in the following paragraphs, and give the optimization effect and summary and analysis oriented to actual scenarios.

5.1 System Optimization

5.1.1 Device Layout

TensorFlow automatically sets the Runtime Device for each Node in the graph, placing heavy computations on the GPU and light computations on the CPU. Complete inference was completed in the complex calculation diagram, and the data would be transmitted repeatedly between CPU and GPU. The H2D/D2H transmission is heavy, so the data transmission time is much higher than the calculation time of the OP (operator). On the GPU, the estimated time is in seconds, which is much higher than the time when only the CPU is used. In addition, as mentioned earlier, CPU resources on the GPU model we used were limited (a T4 card only corresponed to an 8-core CPU), which was the core technical challenge we needed to solve in the heterogeneous architecture design.

To solve the problem of automatically setting the Runtime Device for TensorFlow, we manually Set the Runtime Device for each Node in the calculation diagram. Considering the limited CPU resources, we try to place the heavier subgraph (including Attention subgraph and MLP subgraph) on GPU for calculation, and the lighter subgraph (mainly for Embedding query subgraph) on CPU for calculation.

In order to further reduce data transmission between devices, Concat OP and Split OP were added between CPU and GPU. CPU data was first Concat together and then transmitted to GPU, and then Split into multiple pieces as needed and transmitted to corresponding OP, reducing H2D/D2H from thousands to several times. As shown in the figure below, there is a large amount of H2D data transmission before equipment placement optimization. After optimization, H2D is reduced to 3 times, and the optimization effect is very obvious.

5.1.2 All On GPU

After the basic equipment placement optimization, the calculation of light Sparse query was completed in CPU, and the calculation of Dense calculation was completed in GPU. Although CPU computations are light, pressure measurements found it was still the overall throughput bottleneck. Considering the small size of the overall calculation graph (about 2G), we naturally think of whether the whole graph can be executed On GPU to bypass the limitation of CPU quota, namely, All On GPU. In order to change Saprse queries from running on the CPU to running on the GPU, we added a GPU implementation of LookupTable op. As shown in the figure below, HashTable is placed in GPU Global Memory, and its Key and Value are stored in Bucket. Threads parallel queries with multiple blocks for multiple sets of input keys.

At the same time, in order to improve GPU utilization efficiency and reduce kernel launch overhead, TVM was used to compile and optimize the computational graph (detailed below). The optimized All On GPU model solved the bottleneck caused by limited CPU resources, and the overall throughput improved significantly (QPS 55->220, about 4 times).

5.1.3 Operator Fusion

The delivery search, inference and refinement model is very complex, and the calculation diagram contains tens of thousands of computing nodes. There is the kernel launch overhead for each Node and the overhead of accessing video memory between multiple nodes when executing computations on gpus. In addition, the TensorFlow framework itself has some overhead at Node execution. The Input/Output Tensor is created and destroyed at each Node execution, and memory control introduces an additional cost. Therefore, too many nodes in the calculation diagram can seriously affect the execution efficiency. To solve this problem, the commonly used method is to carry on the operator fusion, namely, under the premise that the calculation results of equivalent to incorporate multiple Node in a Node, try to reduce the calculation chart Node number, both to the access between the Node memory overhead to access registers overhead, at the same time also can reduce the calculation chart each Node in the fixed costs.

Operator fusion is mainly carried out in three ways:

Manual fusion of specific operators. For example, in the model training stage, multiple nodes visit a single Embedding Table, and it can be integrated into one Node in the online prediction stage, that is, the query Node and Embedding Table correspond one by one. Thereafter, operators can be further fused, with one Node responsible for querying multiple Embeddding tables.
The common automatic operator fusion is mainly performed using the TensorFlow Grappler[4] optimizer.
Automatic fusion using deep learning compilers is described in more detail below.

5.2 Computational Optimization

5.2.1 Low-precision optimization of FP16

On the one hand, in the CPU architecture, the Embedding Table is compressed as FP16[5] storage in order to reduce memory overhead, but it is still expanded as FP32 when computing, which introduces conversion overhead. On the other hand, model estimation only computes the model graph forward, and the error introduced by low-precision calculation is small. Therefore, low-precision model estimation is widely used in the industry.

For the current business scene, we tried FP16, INT8[6] and other low-precision calculation, FP16 semi-precision calculation has no significant impact on the model effect, INT8 quantization will cause effect attenuation. Therefore, we adopt FP16 semi-precision calculation method to further improve the throughput of estimated services without affecting the effect of the model.

5.2.2 broadcast optimization

The data in the model diagram can be divided into user and item categories. Typically, a request contains one user and multiple items. Fetch Embedding in model Sparse part, user and item respectively. In the Dense part of the model, the two types of Embedding are combined into a matrix and calculated. After further analysis, we found redundant queries and calculations in the model diagram. As shown in orange, in the model Sparse part, user information is broadcasted to batchsize and then the Embedding is queried. As a result, batchsize is queried for the same Embedding times. In the Dense part of the model, the user information is also broadcast to batchsize, and all subsequent calculations are performed. In fact, broadcast user is not needed before item crossover, and there is also redundant calculation.

Aiming at the above problems, we manually optimize the model graph, as shown in the purple part below. In the model Sparse part, user information is only queried once. In the Dense part of the model, broadcast becomes batchsize when user information and Item are crossed, that is, broadcast postposition of user information as a whole.

5.2.3 High-performance Library Applications

When using a CPU, you can use the Intel MKL[7] library to speed up calculations. Limited by CPU hardware characteristics, the acceleration effect is limited. When using a GPU, we can use Tensor Core[8] for accelerated calculation. Each Tensor Core is a matrix multiplicative calculation unit, and the current NVIDIA T4 card has 320 Tensor cores with 65 TFLOPS for mixed precision and 8.1 TFLOPS for single precision. In TensorFlow, cuBLAS[9] can be used to call Tensor Core for GEMM acceleration calculation, cuDNN[10] can be used to call Tensor Core for CNN and RNN network acceleration calculation.

5.3 Automatic optimization based on DL compiler

As deep learning networks become more complex (Wider And Deeper) And hardware devices become more diverse (CPU, GPU, NPU), the optimization of neural networks becomes more difficult. Optimization on a single hardware and a single framework is limited by the optimization library and is difficult to further tune. In different hardware, different framework optimization is difficult to achieve general, optimization is difficult to transplant. This results in a lot of manual tuning work and high cost when optimizing neural networks.

In order to reduce the cost of manual optimization, Deep Learning Compiler is widely used in the industry to automatically tune computational graphs. Popular deep learning compilers include TensorRT[11], TVM[12], XLA[13], etc. Under the current model scenario, we have made many optimization attempts using the deep learning compiler, which will be introduced in detail below.

5.3.1 Attempts based on TensorRT

TensorRT is a high-performance inference optimization framework for deep learning launched by NVIDIA. It supports automatic operator fusion, quantitative calculation, multi-stream execution and other optimization methods, and can select the optimal implementation for specific kernel. Each optimization of TensorRT is controlled by the corresponding switch, which is very simple to use. However, the overall closed source and not many operators are supported, so only part of the operators of the calculation graph can be optimized. If you do not know other operators, it will skip, which greatly affects the optimization efficiency. After using TensorRT optimized calculation graph, there are still a lot of OP, and the overall performance improvement is limited. To solve this problem, we try from the following two perspectives.

① Manually cut molecular diagram

When using TensorRT for graph optimization, the Union Find algorithm is first used to Find identifiable OP in the whole graph and cluster it. Each cluster is compiled and optimized in detail and a corresponding TRTEngineOp is generated. As a large number of unrecognized op exist in the computational graph, the clustering process is disturbed. Even if op can be recognized, clustering may not be completed, so corresponding compilation optimization cannot be carried out, resulting in low optimization efficiency. In order to solve this problem, we first cut the graph manually before the graph optimization, divide the full computed graph into several subgraphs, put each recognized OP into the corresponding subgraph, and send the subgraph to TensorRT for optimization. By this method, the problem of unoptimized op is solved effectively, and the number of op in the whole graph is reduced effectively.

② Operator substitution

As mentioned above, TensorRT supports limited OP types, and there are a large number of ops that TensorRT cannot recognize in the whole graph, resulting in low optimization efficiency. To alleviate this problem, we try to replace TensorRT’s unrecognized OP with its supported equivalent OP. For example, in the figure below, TensorRT cannot recognize Select OP, so we replace it with the Multiply OP supported by TensorRT and remove the ExpandDims OP associated with Select from the figure. After similar equivalent conversion operation, the number of unrecognized OP is effectively reduced and the compilation optimization coverage is improved.

5.3.2 Attempts based on TVM

When we try to optimize TensorRT, we find that operator coverage of TensorRT for TensorFlow is low (only about 50+ operators are covered), and more than ten operators cannot be supported in the current calculation diagram of the model. Even after the complex operator replacement optimization work, there are still many operators difficult to replace. Therefore, we consider using other deep learning compilers for graph optimization.

TVM is an end-to-end machine learning automatic compilation framework launched by Tianqi Chen’s team, which is widely used in the industry. Compared to TensorRT, TVM code is open source and is more extensible and customizable. In addition, TVM supports more than 130 TensorFlow operators, whose operator coverage is much higher than TensorRT. In the current calculation diagram, the only OP not supported by TVM is the custom LookupTable, which is responsible for querying the Embedding without compilation optimization.

Therefore, we try to use TVM instead of TensorRT to automatically compile and optimize the current computed graph. Considering that TensorFlow has made official support for TensorRT and XLA, and has implemented the corresponding Wrapper OP, but currently does not support TVM, we adapted TensorFlow and implemented TVMEngineOp in a similar way to TensorRT to support TVM. Considering the characteristics of the model, we put the Attention subgraph and MLP subgraph that are heavy in calculation into TVMEngineOp, and use TVM to compile and optimize, as shown below:

6 Performance and analysis

This chapter presents the test data in the actual production environment, and analyzes the performance of a series of typical optimization ideas in the industry above in our specific scenario and the reasons behind them.

In the pressure test environment, the CPU is 32-bit Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz+32G memory, GPU environment: 8-core Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz+Tesla T4 GPU+16G memory. In the figure above, the left figure compares the reasoning time of the refined model under different optimization methods under different QPS (X-axis) (Y-axis), where Base-GPU represents simple graph optimization and calculation on GPU, TRT represents TensorRT optimization and calculation on GPU. TVM is optimized by TVM and superimposed with All On GPU optimization and calculated On GPU. The figure on the right shows the CPU and GPU utilization rates corresponding to different optimization methods under the limit QPS. As can be seen from the figure:

When only CPU is used for estimation, the limit QPS is 55, and CPU utilization has reached 76%, becoming the bottleneck.
When conventional manual optimization (device placement + operator fusion +Broadcast optimization + high-performance library) GPU prediction is used, the latency is greatly reduced under the same QPS, and the limit QPS can be increased to 85 (55% higher than CPU version). When reaching the limit throughput, GPU utilization is not high, and the bottleneck is still CPU utilization.
When using TensorRT optimization estimation (manual optimization +TensorRT+FP16), latency is reduced by about 40% at the same QPS thanks to graph compilation optimization. Limit throughput is unchanged because the bottleneck is still CPU.
When using TVM optimization estimation (manual optimization +TVM+FP16+All On GPU), All OP is placed On GPU for calculation and CPU is only responsible for basic RPC, which greatly alleviates the CPU quota bottleneck. Under the same QPS, latency is significantly reduced by about 70%, and limit throughput is significantly increased by about 120%. When it reaches the limit throughput, GPU utilization is high, which becomes the bottleneck.

After a series of optimization, the overall throughput increased by about 4 times (QPS from 55->220), and the optimization effect was very obvious.

7 summary

To sum up, according to the business characteristics of Meituan takeout scene, we gradually evolve the classic CTR/CVR Model from a single Model with multiple entrances, multiple links and multiple objectives to a unified multi-model form of “One Model to Serve All”.

At the same time, combined with the hardware conditions and foundation of Meituan, it realized the switch from pure CPU predictive architecture to CPU+GPU heterogeneous architecture, effectively released the space of computing power under the premise of fixed cost, and increased the computational throughput by nearly 4 times. In view of the LIMITATION of GPU BOX model On CPU resources, we adopted the idea of combining manual optimization with DL compilation optimization and model network calculation of All On GPU, which effectively improved the utilization rate of GPU in model estimation. In addition, the optimization process and measured data indexes in GPU implementation were shared in detail in this paper.

8 Introduction to the Author

Yang Jie, Junwen, Ruidong, Feng Yu, Wang Chao, Zhang Peng, etc., from Home Business Group/home R&D platform/search recommendation technology Department.
Wang Xin, Chen Zhuo, ð« Fei et al, from basic R&D Platform/Data Science and Platform Department/Data Platform

9 References

[1] www.usenix.org/system/file…
[2] www.tensorflow.org/tfx/guide/s…
[3] docs.nvidia.com/cuda/cuda-c…
[4] www.tensorflow.org/guide/graph…
[5] en.wikipedia.org/wiki/Half-p…
[6] www.nvidia.com/en-us/data-…
[7] www.intel.com/content/www…
[8] www.nvidia.com/en-us/data-…
[9] docs.nvidia.com/cuda/cublas…
[10] developer.nvidia.com/zh-cn/cudnn
[11] docs.nvidia.com/deeplearnin…
[12] tvm.apache.org/docs/
[13] www.tensorflow.org/xla

Read more technical articles from meituan’s technical team

| in the public bar menu dialog reply goodies for [2021], [2020] special purchases, goodies for [2019], [2018] special purchases, 【 2017 】 special purchases, such as keywords, to view Meituan technology team calendar year essay collection.

| this paper Meituan produced by the technical team, the copyright ownership Meituan. You are welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication. Please mark “Content reprinted from Meituan Technical team”. This article shall not be reproduced or used commercially without permission. For any commercial activity, please send an email to [email protected] for authorization.