GPU training and optimization practice of TensorFlow in Meituan takeout recommendation scenario

Meituan machine learning platform develops Booster GPU training architecture based on internal deeply customized TensorFlow. In the overall design of the architecture, the characteristics of the algorithm, architecture and new hardware are fully considered, and the in-depth optimization is carried out from multiple perspectives such as data, computing and communication. Finally, the cost performance of the architecture reaches 2~4 times that of CPU tasks. This paper mainly describes the design and implementation of Booster architecture, performance optimization and business implementation, hoping to be of help or inspiration to students engaged in related development.

1 background

In the training scenario of recommendation system, the deeply customized version of TenorFlow (TF) [1] supports a large number of internal businesses of Meituan through CPU computing power. However, with the development of business, the sample size of a single training model is increasing, and the structure is becoming more and more complex. Taking the refined scheduling model recommended by Meituan Takeout as an example, the sample size of a single training has reached tens of billions or even hundreds of billions. An experiment requires thousands of cores, and the CPU utilization rate of the optimized training task has reached more than 90%. In order to support the rapid development of business, the frequency and concurrency of model iterative experiments are increasing, which further increases the demand for computing power. Under the premise of limited budget, how to realize high-speed model training with high cost performance, so as to ensure efficient model development iteration, is an urgent problem that we need to solve.

In recent years, the hardware capability of GPU server has improved by leaps and bounds. The new generation NVIDIA A100 80GB SXM GPU server (8 cards) [2] can achieve 640GB of video memory, 1.2 TB of memory, SSD10+TB of communication: Two-way card communication 600GB/s, multi-machine communication 800 1000Gbps/s, in terms of computing power can be achieved: GPU 1248TFLOPS (TF32 Tensor Cores), CPU 96~128 physical Cores. If the training architecture can take full advantage of the new hardware, the cost of model training will be greatly reduced. However, the TensorFlow community does not have an efficient and mature solution for the recommendation system training scenario. We also tried to use the optimized TensorFlow CPU Parameter Server[3] (PS for short) +GPU Worker mode for training, but it only had certain benefits for the complex model. Although NVIDIA’s open source HugeCTR[4] performs well on classic deep learning models, it needs more work before it can be used directly in meituan’s production environment.

Meituan Basic Research and development machine learning platform training engine team, united with home Search and push technology department algorithm performance team, NVIDIA DevTech team, set up a joint project team. Based on Meituan’s internal deeply customized TenorFlow and NVIDIA HugeCTR, The high-performance GPU training architecture Booster is developed for recommending system scenarios. At present, it has been deployed in the meituan takeaway recommendation scenario, and the offline effect of the multi-generation model comprehensive alignment algorithm is improved by 2~4 times compared with the previous CPU task after optimization. Due to Booster’s good compatibility with the native TensorFlow interface, the original TensorFlow CPU task can be migrated with only one line of code. In this way, Booster can quickly perform initial validation on multiple business lines of Meituan, with an average cost performance of more than twice that of previous CPU tasks. This paper will focus on the design and optimization of Booster architecture and the whole process of implementation in meituan takeout recommendation scenario, hoping to be of help or inspiration to everyone.

2 GPU training and optimization challenges

GPU training has been widely applied to deep learning models in CV, NLP, ASR and other scenarios in Meituan, but it has not been widely applied in recommendation system scenarios, which is strongly related to the characteristics of the scenario model and the hardware characteristics of GPU server.

Features of recommendation system deep learning model

Large sample size: training samples range from tens of TB to hundreds of TB, while CV scenes are usually within hundreds of GB.
Large number of model parameters: there are large-scale sparse parameters and dense parameters at the same time, requiring hundreds of GB or even TB storage, while CV and other scene models are mainly dense parameters, usually within tens of GB.
The computational complexity of the model is relatively low: It is recommended that the single-step execution of the system model on the GPU only takes 10~~100ms, whereas the CV model is 100 in single step on the GPU~~500ms, the single-step execution of NLP model on GPU is 500ms~1s.

GPU Server Features

GPU card has strong computing power, but video memory is still limited: To make full use of GPU computing power, you need to put all kinds of data used in GPU computing into video memory in advance. From 2016 to 20 years, NVIDIA Tesla GPU card [5] increased its computing power by more than 10 times, but only increased its video memory size by about 3 times.
Resources of other dimensions are not very sufficient: Compared with the improvement speed of GPU computing power, the growth rate of CPU and network bandwidth of a single machine is slower. If a model with heavy workload of these two types of resources is encountered, the GPU capability cannot be fully utilized, and the COST performance of GPU server is not too high compared with that of CPU server.

To sum up, the model training of CV, NLP and other scenarios is computation-intensive task, and most models can fit the video memory of a single card, which matches the advantages of GPU server very well. However, in the recommender system scenario, due to the relatively less complex model, the remote reading sample size is large, and the feature processing costs more CPU, which brings great pressure to the stand-alone CPU and network. At the same time, in the case of large number of model parameters, the GPU memory on a single machine cannot be put down. These GPU server weaknesses are precisely hit by the recommended system scenario.

Fortunately, NVIDIA A100 GPU server, in the hardware upgrade to make up for the video memory, CPU, bandwidth these shortcomings, but if the system is not properly implemented and optimized, there will still not be too high cost-effective benefits. In the process of landing Booster architecture, we mainly faced the following challenges:

Data flow system: how to make use of many network cards and multi-channel CPU to realize high-performance data pipeline, so that the data supply can keep up with the consumption speed of GPU.
Mixed parameter calculation: How to make full use of the high computing power of GPU and the high bandwidth between GPU cards to realize a set of large-scale sparse parameter calculation while taking into account the calculation of dense parameter when the GPU graphics memory can not directly hold the large-scale sparse parameter?

3. System design and implementation

In the face of the above challenges, it is difficult to design purely from the perspective of the system. Booster adopts the “algorithm + System” co-design, which greatly simplifies the design of this generation of systems. In terms of the system implementation path, considering the expected delivery time and implementation risk of the business, we did not implement the landing Booster multi-machine and multi-card version in one step. Instead, we implemented the GPU single-machine and multi-card version in the first version. This paper also focuses on the work of single-machine and multi-card. In addition, relying on the powerful computing capacity of NVIDIA A100 GPU server, the computing power of a single machine can meet the single experiment requirements of most of Meituan’s businesses.

3.1 Rationalization of parameter scale

The use of sparse discrete features on a large scale leads to the rapid expansion of Embedding parameters in depth prediction models, and multi-TB models are once popular in various header business scenarios in the industry. However, the industry soon realized that under the circumstance of limited hardware cost, the excessively large model added a heavy burden to the production deployment operation and experimental iteration innovation. Academic studies [10-13] show that the strong effect of the model depends on the information capacity of the model rather than the number of parameters. Practice shows that the former can be improved by optimizing the model structure, while the latter has a lot of room for optimization under the premise of guaranteeing the effect. In 2020, Facebook proposed Compositional Embedding[14], which achieves several orders of magnitude compression of the recommended model parameter size. Alibaba also published relevant work [15], which reduced the estimation model of core business scenarios from several TB to tens of GB or even smaller. In general, the industry’s approach mainly has the following ideas:

De-intersection features: The intersection features are generated by cartesian product between single features, which generates a huge feature ID value space and the corresponding Embedding parameter table. Up to now, there have been a large number of methods to model the interaction between single features through model structure, avoiding the expansion of Embedding scale caused by cross features, such as FM series [16], AutoInt[17], CAN[18], etc.
Streamlined features: Especially nas-based ideas that achieve adaptive feature selection for deep neural networks at low training costs, such as Dropout Rank[19] and FSCD[20].
Compressed Embedding vector number: The compound ID coding and Embedding mapping are applied to the feature values. The number of Embedding vectors is far less than the feature value space to achieve rich feature representation. Such as Compositional Embedding[14], Binary Code Hash Embedding[21] and so on.
Compressed Embedding vector dimension: The dimension of a feature Embedding vector determines the upper limit of its representation information, but not all feature values have so much information that they need to be expressed. Therefore, each eigenvalue can learn to simplify the Embedding dimension adaptively to compress the total number of parameters, such as AutoDim[22] and AMTL[23].
Quantitative compression: Quantitative compression of model parameters is performed in more radical ways, such as DPQ[24] and MGQE[25], using semi-precision or even INT8.

The model recommended by Meituan takeout once reached more than 100G. By applying the above scheme, we controlled the model below 10GB on the premise of controllable loss of model prediction accuracy.

Based on the basic assumption of this algorithm, we defined the design goal of the first stage to support a parameter scale of less than 100G. This can be better adapted to the VIDEO memory of A100 and stored on multiple cards in a single machine. The bidirectional bandwidth between GPU cards is 600GB/s, which can give full play to the processing capacity of GPU and also meet the requirements of most models of Meituan.

3.2 System Architecture

The architecture design based on GPU system should fully consider the characteristics of hardware to give full play to the advantages of performance. The hardware topology of our NVIDIA A100 server is similar to that of NVIDIA DGX A100[6]. Each server contains: 2 cpus, 8 Gpus, and 8 network cards. The architecture diagram of the Booster architecture is as follows:

The whole system mainly consists of three core modules: data module, computing module, communication module:

Data module: Meituan developed a data distribution system that supports multiple data sources and multiple frames. On the GPU system, we modified the data module to support data download of multiple network cards. In consideration of NUMA Awareness, a data distribution service was deployed on each CPU.
Computing module: Each GPU card starts a TensorFlow training process to perform training.
Communication module: We use Horovod[7] for distributed training inter-card communication. We start a Horovod process on each node to perform corresponding communication tasks.

The above design is consistent with TensorFlow and Horovod’s native design paradigm. The core modules can be decoupled from each other, iterated independently, and without architectural impact if the latest features of the open source community are incorporated.

Let’s take a look at the brief execution process of the whole system. The internal execution logic of TensorFlow process started on each GPU card is shown as follows:

The whole training process involves several key modules such as parameter storage, optimizer and inter-card communication. For the input features of samples, we can divide them into sparse features (ID-like features) and dense features. In actual business scenarios, sparse features usually have a large number of IDs, and it is more appropriate to store the corresponding sparse parameters using HashTable data structure. Moreover, due to the large number of parameters, the GPU video memory cannot be stored on a single card, so we will Partition them into the video memory of multiple GPU cards by means of ID Modulo. For sparse features with a small amount of IDs, services usually use multidimensional matrix data structure to express them (the data structure in TensorFlow is Variable). Due to the small number of parameters, the GPU single card video memory can be laid down. The Replica method is used to place a parameter in the video memory of each GPU card. For dense parameters, Variable data structure is usually used, which is placed into GPU memory in the way of Replica. The internal implementation of the Booster architecture is described in detail below.

3.3 Key Implementation

3.3.1 Parameter Storage

As early as in THE PS architecture of CPU scenario, we have realized the whole set of logic of large-scale sparse parameters. Now, to transfer this logic to GPU, the first thing we need to realize is the GPU version of HashTable. We investigated various GPUHashTable implementations in the industry, such as cuDF, cuDPP, cuCollections, WarpCore, etc., and finally chose GPUHashTable based on cuCollections to implement TensorFlow version. The main reason is that in actual business scenarios, the total amount of large-scale sparse features is usually unknown, and feature crossover may occur at any time, so that the total amount of sparse features changes greatly. As a result, “dynamic expansion” capability will become a necessary function of GPU HashTable. The only implementation that can do dynamic scaling is cuCollections. We implemented a special interface (find_OR_INSERT) based on cuCollections GPU HashTable, optimized the large-scale read and write performance, and then encapsulated it into TensorFlow and implemented the low-frequency filtering function on it. Ability to align CPU versions on sparse parameter storage modules.

3.3.2 rainfall distribution on 10-12 optimizer

At present, the optimizer of sparse parameters is not compatible with the optimizer of dense parameters. We have implemented a variety of sparse optimizers on the basis of GPU HashTable, and all of them have made Momentum Fusion and other functions of the optimizer, mainly including Adam, Adagrad, FTRL and Momentum optimizers. For real business scenarios, these optimizers can already cover most business usage. The dense part of the parameter can be used directly with the sparse/dense optimizer supported by TensorFlow native.

3.3.2 Inter-card communication

During the actual training, we have different processing procedures for different types of features:

Sparse features (ID-like features, large scale, stored using HashTable) : Since the input sample data of each card is different, the feature vectors corresponding to the input sparse features may be stored on other GPU cards. In terms of the specific process, the training is carried out forward through AllToAll communication between cards, and the ID features of each card are partited into other cards by Modulo. Then, the sparse feature vectors of each card are queried by GPUHashTable in the card, and then through AllToAll communication between cards. The ID features and corresponding feature vectors obtained by AllToAll from other cards for the first time are returned in the original way. Through AllToAll communication between two cards, the ID features input by each card sample are given corresponding feature vectors. In the reverse direction of training, AllToAll communication between cards will be used again to Partition the gradient of sparse parameters into other cards by Modulo. After each card gets its own sparse gradient, the sparse optimizer will be executed to complete the optimization of large-scale sparse features. The detailed process is shown in the figure below:

Sparse feature (small scale, Variable storage) : Compared with HashTable, since each GPU card has a full number of parameters, model parameters can be directly searched in the card. During reverse aggregation of gradients, gradients on all cards are averaged by AllGather between cards and then handed to the optimizer for parameter optimization.
Density feature: The density parameter is also a full number of parameters for each card, and the parameters can be directly obtained in the card for training. Finally, the density optimizer is implemented through inter-card AllReduce aggregation of the density gradient of multiple cards.

In the whole execution process, the sparse and dense parameters are all placed in GPU memory, the model calculation is also all processed on GPU, and the communication bandwidth between GPU cards is fast enough to give full play to the powerful computing power of GPU.

To summarize, the core differences between Booster training architecture and CPU scenario PS architecture are as follows:

Training mode: PS architecture is the asynchronous training mode, Booster architecture is the synchronous training mode.
Parameter distribution: In PS architecture, model parameters are stored in PS memory. In Booster architecture, sparse parameters (HashTable) are distributed in the Partition mode in the eight cards in the single machine, and dense parameters (Variable) are stored in each card in the Replica mode. Therefore, the Worker role in the Booster architecture takes into account the functions of PS/Worker roles in the PS architecture.
Communication mode: In PS architecture, TCP (Grpc/Seastar) is used for communication between PS and workers. In Booster architecture, NVSwitch (NCCL) is used for communication between workers. The bidirectional bandwidth between any two cards is 600GB/s. This is also one of the reasons why the Training speed of Booster architecture is greatly improved.

As the input data of each card is different, and model parameters are stored in inter-card partitions or replicas, Booster architecture has parallel model and parallel data. Additionally, since NVIDIA A100 requires CUDA version >=11.0, TensorFlow 1.x only supports CUDA11.0. The vast majority of Meituan business scenarios are still using TensorFlow 1.x, so all of our changes are based on NV1.15.4.

The above is an introduction to Booster’s overall system architecture and internal execution process. The following sections focus on some performance optimizations we made based on the initial implementation of Booster architecture.

4 System performance optimization

After the implementation of the first version of the system based on the above design, we found that the end-to-end performance was not quite in line with expectations. The SM Activity index of GPU was only 10%~20%, which was not much better than CPU. In order to analyze the performance bottleneck of the architecture, we used NVIDIA Nsight Systems (hereinafter referred to as NSYS), Perf, uPerf and other tools to finally locate the performance bottleneck of data layer, computing layer, communication layer and other aspects through modular pressure measurement, simulation analysis and other analysis means. And the corresponding performance optimization is made respectively. In the following, we will take a recommended model of Meituan Outsourcing as an example to introduce our performance optimization work from the data layer, computing layer and communication layer of GPU architecture.

4.1 the data layer

As mentioned above, the deep learning model of recommendation system has large sample size, relatively uncomplicated model, and data I/O itself is the bottleneck point. If all data I/O operations on dozens of CPU servers are performed on a single GPU server, the data I/O pressure becomes even greater. Let’s first take a look at the sample data flow under the current system, as shown in the figure below:

Core process: The data distribution process reads HDFS sample data (TFRecord format) into the Memory through the network, and then transmits the sample number to the TensorFlow training process through Shared Memory. After receiving the sample data, the TensrFlow training process takes the native TensrFlow feature analysis logic and obtains the feature data into GPU memory through GPU MemcpyH2D. Through modular pressure measurement analysis, we found that the sample pull of data distribution layer, the feature analysis of TensrFlow layer, and the feature data MemcpyH2D to GPU processes all had large performance problems (as shown in the yellow process in the figure). The performance optimization work we made in these parts was introduced in detail below.

4.1.1 Optimization of sample pulling

The sample pulling and assembly Batch are completed by the data distribution process. The main optimization work we do here is that the data distribution process is independently executed by NUMACTL to NUMA, avoiding data transmission between NUMA. Secondly, the data download is expanded from single network card to multiple network cards, increasing the bandwidth of data download; Finally, the transmission channel between the data distribution process and TensrFlow process extends from a single Shared Memory to an independent Shared Memory for each GPU card, avoiding Memory bandwidth problems caused by single Shared Memory. The capability of zero copy of input data in feature analysis is realized in TensrFlow.

4.1.2 Feature analysis optimization

At present, the sample data of most of meituan’s internal businesses are still in TFRecord format, which is actually ProtoBuf (PB for short) format. PB deserialization is cpu-consuming, especially the ReadVarint64Fallback method. The actual profiling results are shown below:

The reason is that training samples in CTR scenarios usually contain a large number of int64 characteristics. Int64 is stored as Varint64 data in PB, and ReadVarint64Fallback method is used to parse int64 characteristics. While normal INT64 data types take up 8 bytes, Varint64 uses variable storage lengths for different data ranges. PB first determines the length of the current data when parsing data of Varint type. Varint stores data in 7bits, and the highest bit stores the marker bit, which indicates whether the next byte is valid. If the highest bit of the current byte is 0, it indicates that the current Varint data ends in this byte. Our actual business scenarios ID features are mostly after the Hash value, expressed in Varint64 types will be long, it also leads to the characteristics of the parsing process to determine whether the data many times over, and displacement and joining together to generate the final data for many times, this makes the CPU in the analytical process in the presence of large amounts of branch prediction and temporary variables, very affect performance. Here is a flowchart for parsing 4-byte Varint:

This process is very suitable for batch optimization with SIMD instruction set. Taking the 4-byte Varint type as an example, our optimization process mainly consists of two steps:

SIMD Finds the highest bit: The SIMD instruction sums each byte of Varint data with 0xF0 to find the first byte that equals 0. This byte is the end position of the current Varint data.
SIMD handles Varint: Normally, the SIMD instruction moves each byte of Varint to the right by 3/2/1/0 to get the final int, but SIMD does not have this instruction. Therefore, we accomplish this by processing the high and low 4bits of each byte separately with SIMD instructions. We process the high and low 4bits of Varint data into int_h4 and int_L4 respectively, and then perform or operation to obtain the final int type data. The specific optimization process is shown in the figure below (4 bytes of data) :

For Varint64 data processing, we directly divided into two Varint data processing. Through these two steps of SIMD instruction set optimization, the sample parsing speed is greatly improved, and the CPU usage is reduced by 15% while the GPU end-to-end training speed is improved. We mainly used SSE instruction set optimization here. During this period, we also tried AVX and other instruction sets with larger length, but the effect was not obvious, and we did not use them in the end. In addition, SIMD instruction set can cause severe CPU downcession on older machines, so the official community did not introduce this optimization, and the CPUS of our GPU machines are relatively new, so SIMD instruction set can be used for optimization.

4.1.3 MemcpyH2D pipeline

After analyzing the sample and obtaining the feature data, the feature data needs to be pulled to GPU to perform the model calculation, which requires the MemcpyH2D operation of CUDA. By analyzing the performance of this piece through NSYS, we found that GPU has a lot of pause time during execution, and GPU needs to wait for feature data Memcpy to arrive on GPU before performing model training, as shown in the following figure:

Data flows from the GPU system must be transferred to the graphics memory nearest to the GPU processor in advance so that the COMPUTING capability of the GPU can be fully utilized. Based on the prefetch function of TensorFlow, we implemented the PipelineDataset of GPU version and copied the data to GPU memory before computing. Note that during the process of copying CPU Memory to GPU Memory, CPU Memory needs to use Pinned Memory instead of native Paged Memory, which can speed up the MemcpyH2D process.

4.1.4 Hardware Tuning

During the performance optimization of the data layer, the server group, network group and operating system group of the internal basic R&D platform of Meituan also helped us to do relevant tuning:

In terms of network transmission, in order to reduce the cost of network protocol stack processing and improve the efficiency of data Copy, we optimized the nic configuration, enabled LRO (large-receiving-offload), TC Flower hardware Offload, TX-Nocache-copy and other features, and finally increased the network bandwidth by 17%.
In terms of CPU performance optimization, after performance profiling analysis, memory latency and bandwidth are found to be bottlenecks. So we tried three NPS configurations, combining business scenarios and NUMA features, and chose NPS2. In addition, combined with other BIOS configurations (such as APBDIS, P-State, etc.), you can reduce memory latency by 8% and increase memory bandwidth by 6%.

Through the above optimization, the network limit bandwidth is improved by 80%, and the H2D bandwidth of GPU is improved by 86% under the service demand bandwidth. I also ended up with a 10%+ performance gain on data parsing.

After data layer sample extraction, feature analysis, MemcpyH2D and hardware optimization, Booster architecture improves the end-to-end training speed by 40%, and the training cost performance is 1.4 times that of CPU. Data layer no longer becomes the performance bottleneck of the current architecture.

4.2 calculate layer

4.2.1 Embedding pipeline

When Embedding Pipeline[1] is applied to TensorFlow training performance optimization in CPU scenario, we already implement the following functions: We split the whole calculation Graph into two sub-graphs: Embedding Graph (EG) and Main Graph (MG), which are executed asynchronously and independently, realizing Overlap (the whole separation process can be transparent to users). EG mainly covers extracting the Embedding Key from the sample, querying and assembling the Embedding vector, updating the Embedding vector, etc. MG mainly includes calculation of dense molecular network, gradient calculation and partial update of dense parameters.

The interaction between the two subgraphs is as follows: EG passes the Embedding vector to MG (from MG’s perspective, it reads the value from a dense Variable), and MG passes the corresponding gradient of the Embedding parameter to EG. The expression of the above two processes is the calculation diagram of TensorFlow. We use two Python threads and two TensorFlow sessions to execute two calculation diagrams concurrently to Overlap the two stages, so as to achieve greater training throughput.

We also realized this process under GPU architecture, and added inter-card synchronization process. AllToAll communication with large-scale sparse features and AllToAll communication with reverse gradient are executed in EG. AllGather synchronization between cards with reverse gradient of common sparse features and AllReduce synchronization between cards with reverse gradient of dense parameters are performed in MG. It should be noted that in GPU scenarios, EG and MG execute CUDA Kernel on the same GPU Stream. We have tried to execute EG and MG on independent GPU Stream, respectively, resulting in poor performance. The deep reason is related to the underlying CUDA implementation, which is still waiting to be solved.

4.2.2 Operator optimization and XLA

Compared with CPU level optimization, optimization on GPU is more complex. First of all, for TensorFlow operators, there are some implementations without GPU. When these CPU operators are used in the model, data copy between memory and video memory will follow up with downstream GPU operators, affecting the overall performance. We have implemented operators that are frequently used and have great impact on GPU. In addition, for TensorFlow framework, operator granularity is very fine, which makes it convenient for users to build various complex models flexibly. However, this is a disaster for GPU processor, as a large number of Kernel Launch and memory overhead lead to the failure to make full use of GPU computing power. For optimization on GPU, there are usually two directions, manual optimization and compilation optimization. In terms of manual optimization, we re-implemented some common operators and layers (Unique, DynamicPartition, Gather, etc.).

Take the Unique operator as an example. The Unique operator of native TensorFlow requires that the order of the output elements be the same as that of the input elements. However, in the actual scene, we do not need this restriction.

In terms of compilation optimization, we currently mainly use XLA[9] provided by the TensorFlow community to do some automatic optimization. Normal XLA activation in native TensorFlow 1.15 provides a 10-20% end-to-end performance improvement. However, XLA does not support operator dynamic Shape well, and this situation is very common in the model of recommendation system scenario, which leads to XLA acceleration performance not meeting expectations, or even negative optimization. Therefore, we do the following mitigation work:

Local optimization: For dynamic shape operators (such as Unique) that we introduced manually, we mark the subgraphs, do not perform XLA compilation, XLA only optimizes the subgraphs that can be steadily accelerated.
OOM pocket bottom: XLA will cache the intermediate results of compilation according to the operator type, input Type, shape and other information to avoid repeated compilation. However, due to the particularity of sparse scenes and GPU architecture implementation, there are dynamic operators such as Unique and DynamicPartition in natural Output shape, which results in that these operators and the operators connected after these operators cannot hit the XLA cache during XLA compilation and are recompiled. More and more new cache, and the old cache will not be released, eventually resulting in CPU memory OOM. We implemented LRUCache inside XLA, proactively weeding out the old XLA cache to avoid OOM issues.
Const Memcpy eliminated: When XLA rewrites TensorFlow operator with TF_HLO, it marks some data fixed at compile time as Const. However, the Output of these Const operators can only be defined at Host end. In order to send the Output of Host end to Device end, MemcpyH2D needs to be added again. This will occupy the original H2D Stream of TensorFlow, affecting the sample data to be copied to GPU in advance. Since the Const Output of XLA has been fixed at compile time, it is not necessary to make MemcpyH2D every step. We cache the Output of the Device and read it directly from the cache when using the Output in the future to avoid redundant MemcpyH2D.

The optimization of XLA is exactly a problem fix. Currently, XLA can be started normally in GPU scenarios and training speed can be improved by 10-20%. It is worth mentioning that meituan’s internal basic R & D machine learning platform/deep learning compiler team has come up with a thorough solution to the problem of operator compilation of dynamic Shape, and we will jointly solve this problem in the future.

After optimization of Embedding pipeline and XLA in computing layer, the end-to-end training speed of Booster architecture is improved by 60%, and the cost performance of GPU single-machine eight-card training is 2.2 times that of CPU with the same resources.

4.3 communication layer

During single-machine multi-card training, we found through Nsight Systems analysis that the ratio of communication time between cards was very high and GPU utilization was very low during this period, as shown in the following figure:

As can be seen from the figure, inter-card communication takes a long time during training, and GPU utilization rate is also very low during the communication. Inter-card communication is a key bottleneck affecting the improvement of training performance. After disassembling the communication process, we find that the negotiation time of inter-card communication (AllToAll, AllReduce, AllGather, etc.) is much higher than the time of data transmission:

Taking AllToAll, which is responsible for large-scale sparse parameter communication, as an example, we observed through the tool Nsight Systems that the long communication negotiation time was mainly caused by the late execution time of the operator on a card. Since the scheduling of TensorFlow operators is not strictly ordered, the actual time at which the embedding_lookup operator of the same character is executed varies from card to card, and the first execution of the embedding_lookup operator on one card may be the last execution on another. Therefore, we suspect that the inconsistency of operator scheduling on different cards leads to different communication initiation times of each card, and ultimately leads to the excessively long communication negotiation time. We also proved that it is caused by operator scheduling through several groups of simulation experiments. For this problem, the most direct idea is to transform the core scheduling algorithm of TensorFlow computational graph, but this problem has always been a complex problem in the academic world. We take another approach to solve this problem by integrating key operators. Through statistics, we choose operators related to HashTable and Variable.

4.3.1 HashTable operators Fusion

We design and implement a graph optimization process that automatically merges the hashtables that can be merged in a graph and the corresponding embedding_lookup process, which strategically merges hashtables that are the same size as the embedding_lookup process. At the same time, in order to avoid ID conflicts between original features after HashTable merging, we introduce the function of automatic unified feature coding. Different original features are added with different offsets and classified into different feature domains, thus realizing unified feature coding during training.

We tested the graph optimization on a real business model that merged 38 hashtables into two hashtables and 38 embedding_lookup into two, This reduces the number of embedding_lookup related operators in the EmbeddingGraph by 90% and the number of synchronous inter-card communications by 90%. In addition, the GPU operators in embedding_lookup are merged after operator merging, reducing the number of Kernel launches and making the execution of EmbeddingGraph faster.

4.3.2 Fusion of Variable related operators

Similar to the optimization idea of HashTable Fusion, we observed that the business model usually contains dozens to hundreds of TensorFlow native variables, and these variables need inter-card synchronization during training. Similarly, A large number of variables takes a long time to negotiate the synchronization between cards. We combine all Trainable Variables together automatically through Concat/Split operator, so that the reverse of the whole MG only produces a few gradient Tensor, which greatly reduces the synchronization between cards. At the same time, after the completion of Variable Fusion, the number of operators actually executed in the optimizer is greatly reduced, which speeds up the execution speed of the calculation graph itself.

It should be noted that TensorFlow Variable is divided into two types. One is the Dense Variable in which all parameters of each Step participate in the training, such as the Weight of MLP. The other is a specialized embedding_lookup Variable, with only part of the value of each Step involved in training, called Sparse variants. For the former, Variable combination will not affect the algorithm effect. For the latter, the reverse gradient is IndexedSlices objects, and the synchronization between cards defaults to AllGather communication. If the optimization of Sparse Variables in the business model uses the Lazy optimizer, that is, each Step only optimizes and updates some rows in the Variable, At this point, merging Sparse Variables will result in the reverse gradient from IndexedSlices object to Tensor object, and inter-card synchronization to AllReduce process, which may affect the algorithm effect. In this case, we provided a switch for the business to control whether to incorporate Sparse Variables. Through our measurement, the combination of Sparse Variables in a recommended model can improve the training performance by 5-10%, but the impact on the actual business effect is within one thousand point.

The optimization of the fusion of the two operators not only optimizes the inter-card communication performance, but also improves the in-card computing performance. After the optimization of the fusion of the two operators, the end-to-end training speed of GPU architecture is improved by 85% without affecting the effect of the service algorithm.

4.4 Performance Specifications

After the performance optimization of data layer, computing layer and communication layer is completed, compared with our TensorFlow[3] CPU scenario, THE GPU architecture achieves 2 ~ 4 times cost-performance benefit (different business models have different benefits). Based on a recommended model of Meituan Takeaways, we compared the training performance of native TensorFlow 1.15 with that of our optimized TensorFlow 1.15 using a single GPU node (A100 with eight cards) and a CPU Cluster with the same cost. The specific data are as follows:

It can be seen that the training throughput of our optimized TensorFlow GPU architecture is more than 3 times that of the native TensorFlow GPU, and more than 4 times that of the optimized TensorFlow CPU scenario.

Note: The native TensorFlow uses tF. Variable as the Embedding parameter to store.

5 Service Landing

The Implementation of Booster architecture in business production requires not only a good system performance, but also the completeness of the training ecosystem and the effect of the training output model.

5.1 completeness

A complete model training experiment, in addition to Train task, usually also need to Evaluate or Predict model task. We encapsulated the training architecture based on TensorFlow Estimator paradigm, and realized that a set of codes on the user side uniformly supported Train, Evaluate and Predict tasks in GPU and CPU scenarios. Flexible switching was carried out through switches, and users only needed to pay attention to the development of model code itself. We encapsulated all the architectural changes inside the engine, allowing users to migrate from a CPU scenario to a GPU architecture with just one line of code:

 tf.enable_gpu_booster()
Copy the code

In real business scenarios, users usually use the train_AND_EVALUATE mode to evaluate the effect of the model while running a training task. After installing The Booster architecture, the Evaluate speed is too fast to match the normal Checkpoint output of the training. On the basis of GPU training architecture, we support the Evaluate on GPU capability. Businesses can apply for an A100 GPU exclusively for Evaluate, and the Evaluate speed of a single GPU is 40 times faster than that of a single Evaluate process in CPU scenarios. At the same time, we also supported Predict’s GPU capability, which was 3 times faster than the CPU at the same cost with eight Predict cpus per machine.

In addition, we also provide complete options for task resource allocation. On the basis of single machine with eight cards (A100 can be configured with at most eight cards per machine), we support single machine with one card, two cards and four cards. We also Checkpoint single machine with one card, two cards, four cards, eight cards and CPU PS architecture, so that users can freely switch and continue training between these training modes. It is convenient for users to choose reasonable resource types and resources to run experiments. Meanwhile, businesses can also train new models from Checkpoint of existing models.

5.2 Training Effect

Compared with THE CPU training in PS/Worker asynchronous mode, the cards in single-machine multi-card training are fully synchronized, thus avoiding the influence of gradient update delay in asynchronous training on the training effect. However, the actual Batch Size of each iteration in synchronous mode is the total number of samples of each card, and in order to make full use of the computing power of A100 card, we will increase the Batch Size of each card (sample number of single-step iteration) as much as possible. This makes the actual training Batch Size (10,000) much larger than PS/Worker asynchronous mode (11 million). We need to face the problem of training overparameter tuning under a large Batch [26,27] : on the premise of keeping the Epoch unchanged, expanding the Batch Size will reduce the number of effective parameter updates, which may lead to poorer model training effect.

We used the principle of Linear Scaling Rule[28] to guide the adjustment of learning rate. If the training Batch Size is N times larger than that of PS/Worker mode, the learning rate can also be enlarged by N times. This way is simple and easy to operate, and the practical effect is good. Of course, it should be noted that if the learning rate of the original training method is very aggressive, the adjustment range of the learning rate of the large-batch Size training should be appropriately reduced, or more complex training strategies such as Warmup should be used [29]. We will further explore the hyperparameter optimization mode in the follow-up work.

6. Summary and Outlook

In the training scenario of Meituan recommendation system, as the model becomes more and more complex, the marginal effect of CPU optimization becomes lower and lower. Meituan has developed Booster GPU training architecture based on internal deeply customized TensorFlow and NVIDIA HugeCTR. The overall design gives full consideration to the characteristics of algorithm, architecture and new hardware, and is deeply optimized from data, computing, communication and other perspectives. Compared with previous CPU tasks, the cost performance has been increased to 2~4 times. Support all kinds of TensorFlow training interfaces (Train/Evaluate/Rredict, etc.) from the function and completeness, support CPU and GPU model into each other. Ease of use TensorFlow CPU tasks require only one line of code to complete the GPU architecture migration. At present, large-scale production and application have been achieved in meituan takeaway recommendation scene. In the future, we will comprehensively promote it to the home search recommendation technology department and the whole business line of Meituan.

Booster also has a lot of room for optimization, such as sample compression, serialization and feature analysis at the data level, multiple-graph operator scheduling and dynamic shape operator compilation and optimization at the computing level, and quantitative communication at the communication level. The next version of Booster will also support larger models and GPU training for multiple machines and cards to support a wider range of business models within Meituan.

7 Introduction to the Author

Jia Heng, Guoqing, Zheng Shao, Xiaoguang, Peng Peng, Yongyu, Junwen, Zhengyang, Ruidong, Xiangyu, Xiufeng, Wang Qing, Feng Yu, Shi Feng, Huang Jun, et al., are from the Booster joint project team of Meituan Basic R&D Platform – Machine learning platform training Engine & Home R&D platform – Search recommendation Technology Department.

8 References

[1] tech.meituan.com/2021/12/09/…
[2] images. Nvidia. Cn/aem – dam/en -…
[3] www.usenix.org/system/file…
[4] github.com/NVIDIA-Merl…
[5] en.wikipedia.org/wiki/Nvidia…
[6] www.nvidia.com/en-us/data-…
[7] github.com/horovod/hor…
[8] github.com/NVIDIA/nccl
[9] www.tensorflow.org/xla
[10] Yann LeCun, John S. Denker, and Sara A. Solla. A Study on the Effects of Temporal and spatial variability on the Damage of the Brain.
[11] Kenji Suzuki, Isao Horiba, and Noboru Sugie. A simple neural network pruning algorithm with application to filter synthesis. Neural Process. Lett., 13 (1) : 43, 53, 2001.
[12] Suraj Srinivas and R. Venkatesh Babu. Data-free parameter pruning for deep neural networks. In BMVC, Pp.31.1 — 31.12. BMVA Press, 2015.
[13] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
[14] Hao-Jun Michael Shi, Dheevatsa Mudigere, Maxim Naumov, and Jiyan Yang. Compositional embeddings using complementary partitions for memory-efficient recommendation systems. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 165-175. 2020.
[15] mp.weixin.qq.com/s/fOA_u3TYe…
[16] Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. 2018. xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems. arXiv preprint arXiv:1803.05170 (2018).
[17] Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. Autoint: Automatic feature interaction learning via self-attentive neural networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1161-1170. 2019.
[18] Guorui Zhou, Weijie Bian, Kailun Wu, Lejian Ren, Qi Pi, Yujing Zhang, Can Xiao et al. CAN: revisiting feature co-action for click-through rate prediction. arXiv preprint arXiv:2011.05625 (2020).
[19] Chun-Hao Chang, Ladislav Rampasek, and Anna Goldenberg. Dropout feature ranking for deep learning models. arXiv preprint arXiv:1712.08645 (2017).
[20] Xu Ma, Pengjie Wang, Hui Zhao, Shaoguo Liu, Chuhan Zhao, Wei Lin, Kuang-Chih Lee, Jian Xu, and Bo Zheng. Towards a Better Tradeoff between Effectiveness and Efficiency in Pre-Ranking: A Learnable Feature Selection based Approach. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2036-2040. 2021.
[21] Bencheng Yan, Pengjie Wang, Jinquan Liu, Wei Lin, Kuang-Chih Lee, Jian Xu, and Bo Zheng. Binary Code based Hash Embedding for Web-scale Applications. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 3563-3567. 2021.
[22] Xiangyu Zhao, Haochen Liu, Hui Liu, Jiliang Tang, Weiwei Guo, Jun Shi, Sida Wang, Huiji Gao, and Bo Long. Autodim: Field-aware embedding dimension searchin recommender systems. In Proceedings of the Web Conference 2021, pp. 3015-3022. 2021.
[23] Bencheng Yan, Pengjie Wang, Kai Zhang, Wei Lin, Kuang-Chih Lee, Jian Xu, and Bo Zheng. Learning Effective and Efficient Embedding via an Adaptively-Masked Twins-based Layer. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 3568-3572. 2021.
[24] Ting Chen, Lala Li, and Yizhou Sun. Differentiable product quantization for end-to-end embedding compression. In International Conference on Machine Learning, pp. 1617-1626. PMLR, 2020.
[25] Wang-Cheng Kang, Derek Zhiyuan Cheng, Ting Chen, Xinyang Yi, Dong Lin, Lichan Hong, and Ed H. Chi. Learning multi-granular quantized embeddings for large-vocab categorical features in recommender systems. In Companion Proceedings of the Web Conference 2020, pp. 562-566. 2020.
[26] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization Gap and Sharp minima. ArXiv Preprint arXiv:1609.04836 (2016)
[27] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. Advances in neural information processing systems 30 (2017).
[28] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017).
[29] Chao Peng, Tete Xiao, Zeming Li, Yuning Jiang, Xiangyu Zhang, Kai Jia, Gang Yu, and Jian Sun. Megdet: A large mini-batch object detector. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6181-6189. 2018.

Read more technical articles from meituan’s technical team

| in the public bar menu dialog reply goodies for [2021], [2020] special purchases, goodies for [2019], [2018] special purchases, 【 2017 】 special purchases, such as keywords, to view Meituan technology team calendar year essay collection.

| this paper Meituan produced by the technical team, the copyright ownership Meituan. You are welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication. Please mark “Content reprinted from Meituan Technical team”. This article shall not be reproduced or used commercially without permission. For any commercial activity, please send an email to [email protected] for authorization.