Introduction: After 6 years, with the efforts of all teams, Alibaba Group’s large-scale sparse model training/prediction engine DeepRec has been officially opened to the public, helping developers to improve the performance and effect of sparse model training.

The author | smoke source of autumn | ali technology to the public

After 6 years, with the efforts of all teams, Alibaba Group’s large-scale sparse model training/prediction engine DeepRec has been officially opened to the public, helping developers improve the performance and effect of sparse model training.

What is a DeepRec

DeepRec(PAI-TF) is a unified large-scale sparse model training/prediction engine of Alibaba Group, which is widely used in Taobao, Tmall, Ali Mama, Autaavi, Taobao, AliExpress, Lazada, etc., supporting taobao’s core business of search, recommendation, advertising and so on. Super-large sparse training supported by billions of features and trillions of samples.

DeepRec optimizes the sparse model in terms of distribution, graph optimization, operators, Runtime, etc. It also provides features unique to sparse scenarios.

DeepRec project has been developed from 2016 till now, co-built by AOP team, XDL team, PAI team, RTP team and AIInfra team of Ant Group in Alibaba Group, and supported by taobao recommendation algorithm and other business algorithm teams. DeepRec was also developed with support from the Intel CESG software team, Optane and PSU teams, NVIDIA GPU computing specialists, and Merlin HughCTR.

DeepRec architecture design principles

To support large-scale sparse features on TensorFlow engine, there are many ways to implement ParameterServer. The most common way is to implement ParameterServer and optimizer independently from TensorFlow. At the same time, two modules are bridged within TensorFlow via a bridge. This approach has some advantages, such as the flexible implementation of PS, but there are some limitations.

DeepRec takes a different approach to architecture, following the principle of “treating the training engine as a whole system”. TensorFlow is a graph-based static Graph training engine with corresponding layers in its architecture, such as the top API layer, the middle Graph optimization layer and the bottom operator layer. TensorFlow uses these three layers of design to support the business requirements and performance optimization requirements of the upper layers.

DeepRec also adheres to this design principle, introducing EmbeddingVariable functionality at the Graph level based on the design principle of storage/computation decoupling; Operator fusion is realized based on Graph. In this way, DeepRec enables users to use the same optimizer implementation and the same set of embeddingvariables in standalone, distributed scenarios. At the same time, a variety of optimization capabilities are introduced at the Graph level to achieve joint optimization design that cannot be achieved by independent module design.

Three advantages of DeepRec

DeepRec is a sparse model training/prediction engine constructed based on TensorFlow1.15, Intel-TF and NV-TF. Customized in-depth optimization is carried out for sparse model scenarios, mainly including the following three functional optimizations:

1. Model Effect

DeepRec provides rich support for sparse features, improving model performance while reducing the size of sparse models, and optimizing the effect of the Optimizer at very large scales. The following features of Embedding and Optimizer:

  • EmbeddingVariable (dynamic elastic feature) :

1) Solve the problems of static Shape Variable vocabulary_size unpredictability, feature conflict, memory and IO redundancy, and provide rich advanced functions of EmbeddingVariable in DeepRec. Different feature admittance methods and supporting different feature elimination strategies can obviously improve the effect of sparse model.

2) In terms of access efficiency, in order to achieve more optimized performance and lower memory footprint, the underlying HashTable of EmbeddingVariable is designed to be unlocked, and fine memory layout optimization is carried out to optimize the access frequency of HashTable. Make it possible to access the HashTable backwards and forwards only once during training.

  • DynamicDimensionEmbeddingVariable (dynamic elastic dimension) :

In typical sparse scenes, the occurrence frequency of similar features is often extremely uneven. Generally, the features of the same feature column are set as uniform dimension. If the Embedding dimension is too high, the low-frequency features are easy to be overfitted and consume a lot of extra memory. If the dimension is set too low, high frequency features may affect the effect due to insufficient expression.

Dynamic Dimension Embedding Variable provides different feature values of the same feature column. Different feature dimensions are automatically configured according to the hot and cold of features. High frequency features can be configured in higher dimensions to enhance their expression ability, while low frequency features alleviate the problem of overfitting due to the given low Dimension Embedding. And can greatly save memory (the number of low-frequency long tail features dominate).

  • Adaptive Embedding:

When the dynamic elastic feature function is used, the low frequency feature has the problem of overfitting. All features in EmbeddingVariable are learned from the initial value set by initializer (generally set to 0). For some features with low frequency to high frequency, they also need to be gradually learned to a better state, and the learning results of other features cannot be shared. AdaptiveEmbedding uses static Shape Variable and dynamic EmbeddingVariable to store sparse features. For newly added features, they are stored in conflicting variables. For features that occur more frequently in the EmbeddingVariable without conflict, feature migration to the EmbeddingVaraible can reuse the learning results in the conflicting static Shape Variable.

  • Adagrad Decay Optimizer:

An improved version of the Adagrad optimizer proposed to support very large scale training. When the sample size of the model training is large and the incremental training lasts for a long time, the gradient of the Adagrad optimizer will approach 0, resulting in that the data of the new training cannot affect the model. The existing cumulative discount scheme can solve the problem that the gradient approaches 0, but it will also cause the problem that the model effect becomes worse (the actual business scenario characteristics cannot be reflected through the iteration discount strategy). Adagrad Decay Optimizer was based on periodic discounting. Samples in the same period were discounted with the same intensity, taking into account the infinite accumulation of data and the influence of sample order on the model.

DeepRec also provides multi-hashembedding, AdamAsyncOptimizer and other features to help businesses in terms of memory footprint, performance, and model effects.

2 Training performance

DeepRec performs in-depth performance optimization for sparse model scenarios in terms of distribution, graph optimization, operator, Runtime, etc. DeepRec deeply optimizes different distributed strategies, including asynchronous training, synchronous training, semi-synchronous training, etc. GPU synchronous training supports HybridBackend and NVIDIA Hugectr-SOk. DeepRec provides a wealth of graph optimization features for sparse model training, including automatic pipeline SmartStage, structured features, automatic graph Fusion, and more. DeepRec optimizes dozens of common operators in sparse models and provides Fusion operators for universal subgraphs such as Embedding and Attention. CPUAllocator and GPUAllocator in DeepRec greatly reduce memory and video memory usage and speed up training performance of the E2E. In terms of thread scheduling and execution engine, different scheduling engine strategies are provided for different scenarios. The following is a brief introduction to distributed, graph optimization, and Runtime optimization.

  • StarServer (Asynchronous Training Framework) :

In the case of very large scale tasks (hundreds or thousands of workers), some problems in the native open source framework are exposed, such as inefficient thread pool scheduling, lock overhead on critical path, inefficient execution engine, and overhead caused by frequent small packet RPC, resulting in significant performance bottlenecks when ParameterServer is distributed. StarServer optimizes graphs, thread scheduling, execution engine and memory, changes the send/ RECV semantics from the original framework to pull/push semantics, supports this semantics in subgraph division, and implements lockfree in the execution of ParameterServer diagrams. The implementation of lock-free execution greatly improves the efficiency of concurrent execution of subgraphs. Compared with the native framework, training performance can be improved several times, and linear distributed expansion of 3000worker scale is supported.

  • SmartStage (automatic pipeline) :

Sparse model training usually includes sample data reading, Embedding search, Attention/MLP calculation, etc. Sample reading and Embedding search are not computational-intensive operations, and computing resources (CPU, GPU) cannot be efficiently utilized. The dataset. Prefetch interface provided in the native framework can asynchronously read the sample, but the Embedding search involves complex processes such as feature complement and ID, which cannot be pipelinized through Prefetch. SmartStage can automatically analyze and insert asynchronous pipelined boundaries in graphs, maximizing concurrent pipelining performance.

  • PRMalloc (Memory allocator)

How to use memory efficiently and effectively is very critical for the training of sparse model. In the training of sparse scene model, the use of large memory allocation causes a large number of minor page faults. In addition, the efficiency of multi-thread allocation has a serious problem of concurrent allocation efficiency. Aiming at the characteristics of sparse model training forward and backward, Graph calculation mode is relatively fixed and iterated repeatedly, DeepRec designed a set of memory management scheme for deep learning tasks to improve memory utilization efficiency and system performance. Using PRMalloc provided in DeepRec can greatly reduce minor Pagefaults during training and improve the efficiency of multi-threaded concurrent memory allocation and release.

  • PMEM Allocator:

PMEM Allocator based on the underlying libpmem library of PMDK divides a space from the PMEM map into several segments, and each segment is divided into several blocks. Blocks are the minimum allocation unit of allocator. To avoid thread contention, the thread that allocates the block caches some available space, including a set of segments and a free list. Maintain a free list and segment for each record size (several blocks) in available space. The segment corresponding to each record size allocates only the PMEM space of this size. All Pointers in the free list corresponding to each record size point to the free space corresponding to the record size. In addition, to balance the resources in each thread cache, a background thread periodically moves the free list from the thread cache to the background pool, which is shared by all foreground threads. Experiments show that there is little difference between the memory allocator based on persistent memory and DRAM in large model training performance, but TCO will have a great advantage.

3 Deployment and Serving

  • Incremental model export and loading:

Businesses with high timeliness requirements need frequent online model updates, which often reach the level of minutes or even seconds. For tB-10TB super-large model, minute-level model generation to online is difficult to complete. In addition, the training and prediction of the super-large model have some problems such as resource waste and extended multi-node Serving delay. DeepRec provides incremental model generation and loading capabilities, greatly accelerating the generation and loading of super-large models.

  • Embedding multi-level mixed storage:

Features in sparse models have the characteristics of hot and cold skew, which leads to the memory/video memory waste caused by the rare access and update of some unpopular features, as well as the memory/video memory of large models. DeepRec provides multilevel hybrid storage (up to four levels of hybrid storage HBM+DRAM+PMEM+SSD) capabilities that automatically store popular features on inexpensive storage media and popular features on faster, more expensive storage media. The Training and Serving of TB-10TB model can be performed by a single node.

Multi-level hybrid storage can give greater play to the GPU’s ability to train sparse models, while reducing the waste of computing resources caused by storage resource constraints. Model training of similar scale can be carried out with fewer machines, or large-scale training can be carried out with the same number of machines. Multi-level hybrid storage can also avoid the latency increase problem caused by distributed Serving when performing prediction of large models on a single machine, improving prediction performance of large models while reducing costs. The multi-level hybrid storage function also has the access feature of automatic feature discovery. Based on the efficient heat statistics strategy, features with high heat are placed in fast storage media, low-frequency features are offloaded in low-speed storage media, and features are asynchronously driven to move between multiple media.

Why open DeepRec

Open source deep learning frameworks cannot well support sparse Embedding function, model training performance, deployment iteration and online service requirements in sparse scenarios. DeepRec can support training effects and performance requirements of different types of sparse scenarios after polishing core business scenarios such as search, recommendation and advertising of Alibaba Group and various business scenarios on public cloud.

Alibaba hopes to further promote the development of sparse model training/prediction framework by establishing open source community and conducting extensive cooperation with external developers, so as to bring business effect and performance improvement for search promotion model training and prediction in different business scenarios.

Making DeepRec open source today is just a small step for us. We look forward to hearing from you. Finally, if you’re interested in DeepRec, it would be a great honor for you to drop by and contribute your code and ideas to our framework.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.