**Angel is a distributed high-performance machine learning platform developed by Tencent, which supports machine learning, deep learning, graph computing, federated learning and other scenarios. Angel’s deep learning platform has been applied in many scenarios of Tencent. Therefore, today I will introduce Angel: the application practice of deep learning in Tencent’s advertising recommendation system. The introduction will focus on the following points.

  • Angel machine learning platform

  • Advertising recommendation system and model

  • Model training and optimization

  • The optimization effect

01Angel machine learning platform

1. Angel machine learning platform architecture

Angel machine learning platform is a high-performance distributed machine learning platform developed by Tencent based on the traditional Parameter Server architecture, as shown in Figure 1, and the detailed architecture is shown in Figure 2. It is a full-stack machine learning platform that supports feature engineering, model training, model services, parameter tuning, etc., as well as machine learning, deep learning, graph computing, federated learning and other scenarios. It has been applied in various business scenarios such as Tencent’s internal advertising, finance and social networking, attracting users and developers from more than 100 external companies including Huawei, Sina and Xiaomi.

Fig1 Angel machine learning platform

Fig2 Angel machine learning platform architecture diagram

Angel machine learning platform is designed to take many problems into consideration. First of all, it is easy to use. The programming interface of Angel machine learning platform is simple and can be used quickly. The next aspect is scalability. Angel provides PsFun interface, which inherits specific classes to realize custom parameter update logic and custom data format and model sharding. Then comes flexibility. Angel implements two modes, ANGEL_PS_WORKER and ANGEL_PS_SERVICE. The training and reasoning services of the model in ANGEL_PS_WORKER mode are completed by PS and Worker of Angel platform, which focuses on speed. In ANGEL_PS_SERVICE mode, Angel only starts Master and PS, and other computing platforms (such as Spark and TensorFlow) are responsible for specific calculation. Angel is only responsible for providing the function of Parameter Server. Focus on ecology to expand the ecological niche of Angel machine learning platform. Angel communication mode supports BSP, SSP, ASP and other communication protocols, which can meet the requirements of various complicated practical communication environments. Finally, it is stability. Angel’s PS fault tolerance adopts CheckPoint mode. Angel writes the parameters loaded by PS into the distributed storage system every period of time. In Angel’s Worker fault tolerance aspect, if Work hangs, the Master will restart an instance of Work, which will get the parameter iteration information from the Master when it hangs. Angel’s Master task information is also periodically stored in the distributed storage system. If Mater fails, a Master will be pulled up again using Yarn Master restart mechanism and the information will be loaded to start the task from the previous breakpoint. Angel also has slow work detection mechanism. If a work runs too slowly, its tasks will be scheduled to other works.

Angel machine learning platform task submission execution is simple, the specific steps are shown in Figure 3. After entering Cient, start a PS instance, which will load the model from the Client. After that, the Client will start multiple works, which will load training data to start training and learning. Push and pull pull and update parameters, and store the model to the specified path after training.

Fig3 Angel machine learning platform submission execution flowchart

Angel machine learning platform has done a lot of abstraction in code structure design, which is highly scalable. The whole structure is mainly divided into four layers, as shown in Figure 4. Angel-core is the basic layer, mainly including PSAgent, PSServer, Work, Network, Storage, etc. The machine learning layer (Angel-ML) provides basic data types and methods, while users can access private models with their own methods defined by PsFunc. The interface layer (Angel-Client) is pluggable and supports a variety of uses such as accessing TensorFlow and pyTorch. The algorithm layer (Angel-MLlib) provides encapsulated algorithms such as GBDT and SVM.

Fig4 Angel machine learning code structure

2. Expansion and application of Angel machine learning platform in the direction of deep learning

There are two commonly used distributed computing paradigms for deep learning, namely MPI (communication mode based on message model) and Parameter Server, as shown in Figure 5. Both of these paradigms are implemented on Angel platform. The realization of Parameter Server paradigm is shown in figure 6. Angel Work can access common deep learning OP such as PyTorch or Tensorflow through Native C++ API interface. At the beginning of the training, Angel PS will Push the model to each Worker, and the Worker will load the model to the corresponding OP for training. After each training, the gradient information will be sent back to PS for fusion, and the updated parameters will be obtained by using the optimizer, which will be distributed to each Worker after completion. Repeat the above process until the end of the training, and finally save the model to the specified path. Angel PS provides a gradient PS controller to access multiple distributed workers, and each Worker can run some common examples of deep learning framework. We have completed the work of PyTorch version of this solution and have opened source PyTorch on Angel. The other is MPI AllReduce paradigm, as shown in FIG. 7. The gradient information of this paradigm is fused by AllReduce method. In the realization of this paradigm, Angel PS is a process controller, which will pull up a process on each Work. The process can be PyTorch or Tensorflow. This paradigm is less intrusive to users, and the algorithm developed by users can be connected to Angel platform for training without much modification.

The MPI paradigm

Parmeter Server paradigm

Fig5 Two common paradigms of distributed computing in the field of deep learning

Implementation of Fig6 Parameter Server paradigm on Angel

Fig7 Implementation of Allreduce paradigm on Angel

02Advertising recommendation system and model

1. Tencent’s advertising recommendation system

Tencent’s big data diagram, as shown in figure 8, online business data such as WeChat game to transfer to China through the real-time message middleware system, middle system includes real-time calculation, off-line calculation, scheduling system and distributed storage, the data will be carried out in some real-time computing will some offline calculation, the application of data and obtain the needed data from the message middleware.

Fig8 Tencent Big data platform

Tencent’s recommended advertising business recommendation system stratification as shown in figure 9, the user sends a request to pull users portrait characteristics, then the ads of the library a preliminary sorting and scoring, scoring after can extract characteristic information users, at the same time, advertising library ID number as the best level, within the scope of the best level will have a fine sorting, When it’s done, the AD is delivered to the user. The whole recommendation system is faced with the following major challenges. The first is the diversification of data sources, and the data on the existing data line also has the historical falling data. Secondly, diversified data formats, including user information, Item information, click rate and image data. Then there is the incremental data, frequent user requests, advertising library is constantly updated. Finally, training task diversification, the whole recommendation system involves coarse arrangement, fine arrangement, image detection and OCR and other tasks. In order to solve the above problems, we developed a set of software framework “Zhi Ling” (based on TensorFlow) to meet the training needs.

Fig9 Tencent advertising recommendation system

The framework structure of “zhilin” is shown in figure 10. The lowest C++ core of the framework encapsulates some OP classes of MQ receiver and deep learning framework, the most typical of which is TensorFlow’s dataset class, which provides the ability to obtain data from MQ by encapsulating TensorFlow’s dataset class. Data abstraction and processing is done in C++ and Python. Then there is the Deep Learning Framework (TensorFlow) layer that provides various deep learning libraries. Finally, the specific application models such as DSSM, VLAD and some image algorithm models. ZhiLing “software framework with algorithms encapsulation complete, development of new models faster, data, and the algorithm has good isolation decoupling, convenient pretreatment logic of modification and renewal, and the advantages of good compatibility, but at the same time for invasive Tensorflow framework to modify more, LAN card performance is poor, machine distributed does not support, the optimization algorithm and the OP level not enough completely shortcomings and so on. Figure 11 is the training flow chart of “Zhiling” on basic data. It can be seen from the figure that data is read from the message middleware to the local DataQueue, which distributes Batch data to each model on GPU node for training. Training after the completion of the read into the CPU for gradient fusion and backup and then distributed to each GPU for training, this design is for the design of the single structure, the CPU to realize fusion of gradient and the function of the optimizer, CPU resource is used up big, the design is unreasonable, for this kind of situation we have done a lot of optimization will introduce later.

Fig10 " ZhiLing & quot; Frame structure

Fig11 " ZhiLing & quot; Training flow charts on basic data

2. The model in Tencent’s advertising recommendation system

The DSSM enhanced semantic model is shown in Figure 12. Here, we use this model to calculate the correlation between users and recommendation IDS, and on this basis calculate the click rate of users for a given recommendation ID. The calculation formulas for the correlation and click rate are as follows:

DSSM model is relatively simple, which is divided into Quey Id and Item Id and expressed as low-Dimensional meaning vectors. Then, the distance between the two semantic vectors is calculated by cosine distance. The correlation between Query and Item is calculated by the model. The highest point of the score is the Item we want to recommend. The DSSM model in the advertising recommendation system should support the following new requirements:

  • ID class characteristic dimension hundred million level;
  • Change fast, 25% new items per week, support incremental training.

Fig12 DSSM model

VLAD/NetVLAD/NeXtVLAD and other models are mainly used to judge the distance relationship between two advertisements. Traditional VLAD can be understood as a clustering model, and its vector calculation formula is as follows:

NeXtVLAD As shown in Figure 13, a better distance effect can be obtained by changing the AK sign function into a differentiable function. The vector calculation formula of NeXtVLAD is as follows:

Among them:

Fig13 NeXtVLAD model

FIG. 14 shows the image processing model of YOLO V3, which is applied to the front end of OCR business for preliminary inspection. It is characterized by large image input size (608*608 *1024 *1024), so Loss part of YOLO model occupies a large part of calculation.

Fig14 YOLO V3 model

03Model training and optimization

1. Data flow optimization

In the previous introduction, we know that “Zhiling” software framework is single-channel data flow. Now we optimize it for multi-channel, as shown in Figure 15, that is, through multi-machine multi-data flow to solve the IO bottleneck problem of single-machine. There will be DataQueue in the original single-pipe data. If the data flow is very large, it will cause great pressure on IO. After optimization of multi-pair pipes, a DataQueue is defined for each training process GPU, and the IO bottleneck can be effectively solved through this distributed method. Administration in this case is managed through the Angel PS (AllReduce version) process controller.

Fig15 " ZhiLing & quot; Multi – pipe structure

2. Embedding calculation optimization

If SparseFillEmptyRows is added before the Embedding Lookup is done, it will add too many strings to the default value. If SparseFillEmptyRows is added after the Embedding Lookup is done, the default value will be added. Removing time-consuming string operations (million level) and saving CPU power to improve QPS, this optimization improved single-card performance by about 6%.

3. Optimization of model algorithm level

The calculation amount of Loss in YOLO is large, so we carry out special optimization for it. The YOLO model has three Feature map layers. In the traditional search for positive and negative samples, the real Bounding box performs an traversal comparison in the Feature map, first traversing horizontally and then traversing vertically. The maximum values of Feature map points and Bounding box IOU are found as positive samples in the traversal process. As the size of the image is large, the Feature map is also large, which makes the calculation of Loss time-consuming. The Loss optimization method is calculated as follows. Since the blocks in the X-axis direction and the blocks in the Y-axis direction are diagonally symmetric, we traversed the IOU of Feature map and Bounding box in the direction of diagonal points of the central axis, as shown in FIG. 16. Firstly, the blocks in the diagonal direction are calculated, and then the blocks on both sides of each feature map block are calculated. This optimization method can reduce a lot of computation. In addition, there are some characteristics of calculation skills when traversing all Feature map blocks on both sides at a certain point. For example, when traversing up to the right, x axis and Y axis change symmetrically on the diagonal, we can predict such changes, so as to consciously find the largest Anchor location and discard other information. Through this optimization method can also greatly reduce the amount of calculation. Through the above methods, we optimized the Loss calculation and improved the performance of single card by about 10%.

Fig16 Schematic diagram of YOLO V3 optimization

04The optimization effect

Through optimization at the model level and data level, as well as the application of Angel platform into the whole control process, the single card performance of DSSM is improved by 33 times, VLAD by 22 times, and YOLO by 2.5 times, as shown in Figure 17. Figure 18, 19 and 20 are the detailed evaluation results. There are three types of test modes: training data is pulled online through TDbank (MQ system developed by Tencent itself) (delay includes network transmission and data packing, etc.); Local data reading, training data to store local disks in advance (latency including disk IO, data mapping preprocessing); Benchmark mode trains data storage in memory (delay only includes data mapping preprocessing). As you can see from Figure 18, Benchmark is able to run the entire system statistics almost to the full regardless of the latency before data is read. This is a quasi-linear improvement. Considering that the actual data is read from MQ, TPS of 1 card is not improved much, TPS of more than 3000, QPS of 2 cards is more than 4000, and TPS of two machines and two cards is more than 6000. Therefore, with more machines, the training performance achieves linear improvement, and the test results of VLAD-CTR model have the same results. After YOLO V3 optimization, the performance of one machine with one card is improved by 2.5 times and that of one machine with eight cards is improved by 7.2 times.

Fig17 Performance improvement results after optimization

Fig18 Optimized PERFORMANCE improvement results of DSSM-CVR model

Fig19 Performance improvement results of optimized VLAD-CTR model

Fig20 Optimized YOLO V3 model performance improvement results

05Total knot

Main today and share the content of three parts, the first part is introduced the Angel of tencent machine learning platform and its study in depth on the direction of development and application, the content of the second part is to introduce the characteristics of tencent advertising recommendation system and commonly used models, the content of the last part introduces the Angel deep learning in tencent advertising recommendation system, the application of Model training and optimization, and the results obtained.

The author:

Guo yue super

Tencent | application researcher

Yuechao Guo graduated from Peking University. His research interests include heterogeneous accelerated computing, design, development and optimization of distributed systems, and algorithm optimization of speech /NLP. At present, I am mainly responsible for the research and development of new technologies in deep learning of Angel platform and the implementation of business scenarios in Tencent.