X-deep Learning(XDL) has just released the open source address of X-Deep Learning(XDL). Developers can download it on Github.

Earlier, at the end of November, Ali Mom announced this open source plan, attracted extensive attention in the industry. XDL breaks through the current situation that most of the existing deep learning open source frameworks are designed for low-dimensional dense data such as images and voice, and is deeply optimized for high-dimensional sparse data scenarios, which has been applied in ali Mom’s business and production scenarios on a large scale. This article will introduce the design concept and key technologies of XDL in detail.

An overview of the

Artificial intelligence technology with deep learning as its core has achieved great success in speech recognition, computer vision, natural language processing and other fields in the past few years, among which hardware computing power represented by GPU and excellent open source deep learning framework have played a huge role in promoting it.

Although open source frameworks such as TensorFlow, PyTorch and MxNet have achieved great success, when we applied deep learning technologies to large-scale industrial scenarios such as advertising, recommendation and search, we found that these frameworks did not meet our needs well. The paradox lies in that most open source frameworks are designed for low-dimensional continuous data such as images and voice, while many core application scenarios of the Internet (such as advertising/recommendation/search) are often faced with high-dimensional sparse discrete heterogeneous data, and the scale of parameters is often 10 billion or even 100 billion. Furthermore, many product applications require real-time training and updating of large-scale deep models, and the existing open source frameworks are often unable to meet the needs of industrial production applications in terms of distributed performance, computing efficiency, horizontal expansion capability and real-time system adaptation.

X-deep arning is an industrial-level DeepLearning framework designed and optimized for such scenarios. After being refined by alibaba’s advertising business, XDL has excelled in training scale, performance and horizontal expansion ability, and has built in a large number of industrial-level algorithm solutions for advertising/recommendation/search.

System core capability

1) It is born for high-dimensional sparse data scenarios. Support hundreds of billions of parameters of large scale depth model training, support batch learning, online learning and other modes.

2) Industrial-level distributed training ability. It supports mixed scheduling of CPU and GPU, has complete distributed Dr Semantics, excellent horizontal expansion ability of the system, and can easily achieve thousands of concurrent training.

3) Efficient structured compression training. According to the data characteristics of Internet samples, a structured computing model is proposed. In typical scenarios, compared with the traditional tiled sample training method, sample storage space, sample IO efficiency, training absolute computation and other aspects are greatly reduced. In recommended scenarios, the overall training efficiency can be increased by more than 10 times.

4) More mature back-end support. The dense network computing inside the single machine uses the capabilities of mature open source framework. With only a small amount of distributed driver code modification, the single-machine code such as TensorFlow/MxNet can be run on XDL to obtain XDL distributed training and high-performance sparse computing capabilities.

Built-in industrial grade algorithm solutions

1) The latest algorithms in CTR estimation, including Deep Interest Network (DIN), Deep Interest Evolution Network (DIEN), Cross Media Network (CMN).

2) Entire Space Multi-Task Model (ESMM) for CTR & CONVERSION rate modeling.

3) Tree-based Deep Match model (TDM), the latest algorithm in the field of matching recall.

4) Lightweight General model Compression algorithm (Rocket Training)

System design and optimization

Xdl-flow: Data Flow and distributed runtime

Xdl-flow drives the generation and execution of the entire deep learning computational graph, including sample pipeline, sparse representation learning, and dense network learning. At the same time, XDL-flow is also responsible for storage and switching control logic of distributed model, distributed disaster recovery and recovery control and other global consistency coordination work.

In search, recommendation, advertising and other scenarios, the sample size is huge, usually reaching tens of TB to hundreds of TB. If the sample pipeline is not well optimized, the sample IO system is easy to become the bottleneck of the whole system, resulting in low utilization of computing hardware. In the large-scale sparse scenario, sample reading is characterized by IO intensive, sparse representation calculation is characterized by parameter switching network communication intensive, and dense depth calculation is computationally intensive.

Xdl-flow better ADAPTS the performance of three different types of tasks by parallel the three main links of asynchronous pipeline. At best, the latency for the first two phases is hidden. At the same time, we are also trying to automate the parameters of the Tunning asynchronous pipeline, including the parallelism of each Step, Buffer size, etc., so that users do not need to care about the details of the whole asynchronous pipeline parallelism as much as possible.

AMS: Efficient model server

AMS is a distributed model storage and switching subsystem specially designed and optimized for sparse scenarios. We have made a lot of software and hardware optimization by integrating small packet network communication, Parameter storage structure, Parameter distribution strategy, etc., which makes AMS much better than traditional Parameter Server in terms of throughput and horizontal expansion force. AMS also supports built-in deep network computing, so that you can use AMS to perform second-order computation of subnetwork characterization.

1) AMS has made a lot of optimization in the network communication layer through the combination of hardware and software, including the use of Seastar, DPDK, CPUBind, ZeroCopy and other technologies, fully squeezing the hardware performance. Through our practical test, the packet throughput capacity caused by parameter exchange is more than 5 times that of the traditional RPC framework under large-scale concurrent training.

2) Through the built-in dynamic parameter balancing strategy, the optimal sparse parameter distribution strategy can be found in the running process, which effectively solves the hot issues caused by the non-uniform distribution of parameters in the traditional parameter server, and greatly improves the horizontal expansion capability of the system under the condition of high concurrency.

3) AMS also supports GPU-accelerated Sparse Embedding calculation in large-batch Size scenarios. For large-batch scenarios, it can play a good role in acceleration.

4) AMS supports internally defined sub-networks. For example, in the cross-media modeling provided in our algorithm solution, the image portion of the representation subnetwork is defined in a way that runs within AMS, greatly reducing double computation and network throughput.

Backend Engine: Stand-alone capability that Bridges mature frameworks

In order to take full advantage of the capabilities of existing open source deep learning frameworks on dense deep networks, XDL uses Bridging techniques, using open source deep learning frameworks (this edition of open Source XDL supports TensorFlow, MxNet) as the computing engine backend for our stand-alone dense networks. Users can retain TensorFlow or MxNet network development habits, and at the same time, through a small amount of driver code modification, directly obtain XDL’s distributed training capability on large-scale sparse computing. In other words, there is no need to learn a new framework language to use XDL. Another benefit is that XDL can be seamlessly integrated into the existing mature open source community — users can easily extend an open source model of the TensorFlow community to industrial scenarios through XDL.

Compact Computation: Structured computing improves training efficiency

Sample representation under sparse scenes in industry often presents strong structural features, such as user features, commodity features and scene features. This construction ensures that certain characteristics will appear in large numbers in the same sample — a large proportion of user characteristics will be the same across multiple samples belonging to the same user. Structured sample compression takes advantage of the repetition of a large number of local features in a large number of samples to compress features in storage and computation, saving storage, computing and communication bandwidth resources. In the sample pre-processing stage, sort the features to be aggregated (such as sorting by user ID and aggregating user features); At the Batching stage, you compress at the tensor level; In the computing stage, the compression feature will be expanded only at the last layer, which greatly saves the computing overhead of deep network. The effect verification in the recommended scenario shows that in typical production data, the AUC indicators of the aggregated sorted sample are consistent with those of the fully shuffle sample, and the overall performance is improved by more than 10 times.

Online-learning: mass Online Learning

In recent years, online learning has been widely applied in the industry. It is an in-depth combination of engineering and algorithm, which enforces the model with the ability to capture online flow changes in real time. It is of great value in some scenarios requiring high timeliness. For example, in the scenario of e-commerce promotion, online learning can capture the changes of user behavior in a more real-time manner and significantly improve the real-time effect of the model. XDL provides a complete set of online learning solutions, supporting real-time continuous learning based on the full model, reading samples in the real-time Message queue, we have built-in support for Kafka and other Message sources, and allow to write cycles according to the user set control model. In addition, in order to avoid the problem of real-time model explosion caused by the unlimited inflow of new features, XDL has built-in functions such as automatic selection of real-time features and elimination of expired features, so as to ensure the simplicity of online learning by users using XDL.

1) Sparse feature learning with de-id: Traditional machine learning frameworks generally require ID representation of sparse features (compact coding starting from 0), so as to ensure high efficiency of training. XDL allows direct training with original features, greatly simplifying the complexity of feature engineering and greatly increasing the efficiency of full-link data processing. This feature is more meaningful in real-time online learning scenarios.

2) Real-time feature frequency control: Users can set a threshold value for feature filtering. For example, only features that appear more than N times are included in the model training, the system will automatically adopt the algorithm of automatic probability discard to select features, which can greatly reduce the space occupation of invalid ultra-low frequency features in the model.

3) Elimination of expired features: In the online learning of a long period, users can also turn on the elimination function of expired features, and the system will automatically eliminate the feature parameters with weak influence that have not been touched for a long period.

X- deep arning algorithm solution

A typical click-through Rate estimation model

DIN (Deep Interest Network)

The traditional model of the Embedding&MLP class does not do much for user representation. The historical behavior of users is usually projected into a vector space of fixed length by embedding mechanism, and then a user vector expression of fixed length is obtained by sum/ AVG pooling operation. However, the interests of users are diverse, and it is very difficult to express the different interests of users with a fixed vector. In fact, when users are faced with different commodities, their interest performance is also different. Only the interest related to this commodity will affect users’ decision making.

Therefore, when we estimate the click-through rate of a specific product, we only need to express the interest associated with the product. In DIN we propose an interest activation mechanism that activates the relevant part of the user’s historical behavior through an estimated item to capture the user’s interest in a specific item.

Address: arxiv.org/abs/1706.06…

DIEN (Deep Interest Evolution Network)

DIEN mainly solves two problems: interest extraction and interest evolution. In the part of interest extraction, traditional algorithms directly take the user’s historical behavior as the user’s interest. At the same time, the supervision information in the whole modeling process is all focused on the advertisement click sample. However, a simple advertisement click sample can only reflect the user’s interest when deciding whether to click on an advertisement, and it is difficult to model the user’s interest at every behavioral moment in history.

In this paper, we proposed auxiliary Loss module for interest extraction. The implicit expression of the constraint model at each historical moment of user behavior can predict the subsequent behavior. We hope that such implicit expression can better reflect the user’s interest at each moment of behavior. After the interest extraction module, we put forward the interest evolution module. The traditional RNN similar method can only model a single sequence, but in emg users’ different interests actually have different evolution processes. In this article we propose Activation Unit GRU (AUGRU) to make the UPDATE gate of the GRU relevant to the estimated goods. In modeling the evolution of users’ interests, AUGRU will build different interest evolution paths based on different predicted target goods to infer users’ interests related to the goods.

Address: arxiv.org/abs/1809.03…

CMN (Cross Media Network)

CMN aims to introduce more modal data, such as image information, into the CTR prediction model. On the basis of the original ID features, the image visual features are added, and the advertising CTR prediction model is jointly added, which has achieved significant improvement in the large-scale data of Ali Mom. CMN includes a number of technical features: first, the image content feature extraction model and the main model are jointly trained and optimized; Secondly, image information is used to express advertising and users, in which the user expression adopts the pictures corresponding to the user’s historical behavior. Thirdly, in order to deal with the massive image data involved in training, the computing paradigm of “advanced model service” is proposed to effectively reduce the computation, communication and storage load during training. CMN is not only used for image feature introduction, but also for text, video and other content features, which can be extracted with appropriate features and processed with the same model.

Address: arxiv.org/abs/1711.06…

Typical Conversion Rate prediction model

Entire Space Multi-Task Model (ESMM)

Entire Space Multi-Task Model (ESMM) is a new multi-task joint training algorithm paradigm developed by Ali Mom. The ESMM model first proposed the idea of using the auxiliary task of learning CTR and CTCVR to learn CVR in an indirect way, and used the user behavior sequence data to model in the complete sample space, which avoided the problem of sample selection bias and sparse training data that the traditional CVR model often encountered, and achieved remarkable results.

ESMM can be easily extended to predict user behavior (browse, click, add, buy, etc.) with sequence dependence to build a full-link multi-target prediction model. The BASE sub-network in THE ESMM model can be replaced by any learning model, so the FRAMEWORK of ESMM can be easily integrated with other learning models, so as to absorb the advantages of other learning models and further improve the learning effect, with huge imagination space.

Address: arxiv.org/abs/1804.07…

Typical matching recall model

TDM (Tree-based Deep Match)

TDM independent innovation proposed a complete tree-based framework for complex deep learning recommendation matching algorithm, which realized efficient full library retrieval by establishing hierarchical tree structure of user interest. On this basis, more advanced computing structures such as Attention were introduced into the empowerment depth model. Compared with the traditional recommendation method, the accuracy, recall rate and novelty were significantly improved.

Furthermore, TDM design implements a complete set of initial tree-model training-tree-reconstruction-model retraining joint training iterative framework, which further promotes the improvement of effect. The joint training endods TDM algorithm framework with better universality, and provides a good theoretical basis and great engineering feasibility for TDM’s migration and expansion to new scenes and new fields.

Address: arxiv.org/abs/1801.02…

Typical model compression algorithm

Rocket Training

Real-time inference of online models in industry imposes strict requirements on response time, which limits the complexity of models to a certain extent. The limitation of model complexity may lead to the decrease of model learning ability and thus the decrease of effect.

At present, there are two ways of thinking to solve this problem: on the one hand, numerical compression can be used to reduce the inference time under the condition of fixed model structure and parameters; on the other hand, more simplified models can be designed and calculation methods of models can be changed, such as Mobile Net and ShuffleNet.

On the other hand, the complex model is used to assist the training of a simplified model. In the testing phase, the learned small model is used for reasoning. These two schemes do not conflict. In most cases, the second scheme can further reduce the inference time through the first scheme. At the same time, considering that compared with the strict online response time, we have more free training time and are able to train a complex model. Rocket Training is the second way of thinking. It is lightweight and elegant, and the method has a strong universality. The model complexity can be customized according to the system capability, providing a means of “stepless speed regulation”. In ali mom’s production practice, Rocket Training can greatly save online computing resources and significantly improve the system’s ability to cope with traffic peaks such as double Eleven.

Address: arxiv.org/abs/1708.04…

BenchMark

We provide some Benchmark data for your reference, focusing on XDL’s training performance and horizontal scalability in large batch and small batch scenarios, as well as the speed increase brought by structured compression training.

Deep CTR model based on CPU training

Sparse Embedding DNN was selected as the model structure, and N-path Sparse features were separately applied for Embedding, and several NFM features were obtained through BiInteraction. The total scale of Sparse features was about 1 billion (corresponding to 10 billion parameters) /10 billion (corresponding to 10 billion parameters), the dense dimension was hundreds of dimensions, and the number of single sample Sparse feature ids was about 100+/300+.

Training mode: BatchSize=100, asynchronous SGD training.

As can be seen from the Bechmark results, XDL has obvious advantages in high-dimensional sparse scenarios and maintains good linear scalability under the condition of considerable concurrency.



Deep CTR model based on GPU training



The original article was published on December 21, 2018

Author: XDL

This article is from the cloud Community PartnersAli technology”, you can follow”Ali technology”.