0 x00 the

In this series we will introduce NVIDIA’s HugeCTR, an industry-oriented recommendation system training framework optimized for large-scale CTR models with model parallel embedding and data-parallel intensive networks.

This article introduces NVIDIA Merlin HugeCTR: A Training Framework Dedicated to Recommender Systems, GitHub… Based on the translation, and combined with the source code analysis.

Thank you for using HugeCTR source code to read this masterpiece. I hope to enrich my understanding of HugeCTR on the basis of this masterpiece.

0 x01 background

We will briefly discuss the role of CTR estimation in modern recommendation systems and the main challenges in training.

1.1 CTR estimation in recommendation systems

Recommendation systems are everywhere, from online advertising and e-commerce to streaming services, and are having a huge impact on service providers’ revenues. The recommendation system finds the most clickable items for a given user, then ranks them and shows the user the top N items. To achieve this, the recommendation system must first estimate the likelihood that a particular user will click on an item. This task is often referred to as CTR estimation.

How do you estimate CTR? There is no magic here, it is generally a case of taking a rich data set containing user-item interactions and using it to train the ML model. Each record in the dataset can contain characteristics from the user (age, job), item (type, price), and user item click (0 or 1). For example, if user A buys or clicks on several biographies from A series of books, it makes sense for the model to assign high probability values to biographies.

The system structure of CTR is roughly as follows:

The following figure shows the CTR reasoning flow.

Figure from HugeCTR_Webinar

1.2 Challenges of CTR estimation training

First of all, features in the recommendation system have the following properties: high dimension, sparse. Mass recommendation systems face frequent changes in users and items, so it is important to identify the implicit feature interactions behind user clicks so that recommendation systems can provide higher quality and more general recommendations. For example, married people under the age of 30 and those with children under the age of 2 May be inclined to buy beer with a high ABV. Modeling these implicit feature interactions requires complex feature engineering by domain experts. To make matters worse, because the features are extremely complex and unintuitive, even human experts often fail to spot these interactions. Instead of relying on experts, Deep learning-based approaches such as Wide & Deep, DeepFM, and DLRM have been developed to capture these complex interactions.

Another challenge in training CTR estimation models is that users and items change almost daily, so the life cycle of trained models can be short. In addition, due to the increase of data set size, dimension and sparsity, CTR model usually contains a large embedded table, which may not fit into the nodes of a single GPU or even multiple Gpus. Therefore, data loading, embedded table lookup, and inter-GPU communication can take up a large part of model training time.

These factors, combined with the lack of standardized modeling methods for CTR estimation, often result in services that often achieve suboptimal performance in terms of throughput and latency. So it is very important to complete faster iterative training of the model on a single or multiple Gpus.

0x02 HugeCtr

HugeCTR is an open source framework for accelerating the training of CTR estimation models on NVIDIA Gpus and is highly optimized for NVIDIA Gpus performance, while allowing users to customize models in JSON format. It is written in CUDA C++ and highly leverages gpu-accelerated libraries such as cuBLAS, cuDNN and NCCL. It was originally intended as an internal prototype to evaluate the GPU’s potential for CTR estimation problems, but it quickly became a reference design for GPU-based recommendation systems. As it naturally became a more general framework dedicated to CTR estimation, NVIDIA opened source its initial version in September 2019 to accept external feedback while remaining interactive with some customers.

HugeCTR is also the backbone of NVIDIA Merlin, a framework and ecosystem for building large-scale recommendation systems that require large data sets for training, designed to facilitate all stages of recommendation system development and accelerate on NVIDIA Gpus.

Figure from source github.com/NVIDIA-Merl…

HugeCTR delivers 114 times faster performance on a single NVIDIA V100 GPU than TensorFlow on a 40-core CPU node and 8.3 times faster performance on TensorFlow on the same V100 GPU. As hybrid models consisting of linear and depth models became common, version 2.1 of the HugeCTR architecture was extended to support models such as Wide & Deep, DCN, and DeepFM. Updates include new data readers that can read both continuous and classified input data; And new layers, including factorization machines and cross layers. Dropout, L1/L2 regularizers, and more have been added for more flexible design space exploration.

0 x03 architecture

3.1 CTR DL model

The following diagram depicts the steps of the DL model for CTR estimation:

  1. Data records are read batch by batch, and each record is composed of high-dimensional, very sparse (or categorical) features. Each record can also contain dense digital features that can be fed directly to the full connection layer.
  2. The input sparse features are compressed into low-dimensional dense embedding vectors by embedding layer. For example, if there areNSparse feature, embedding dimension isK, the embedded table is generatedNKDimensional density vector.
  3. A feedforward neural network is used to estimate click-through rates.

The figure shows a typical CTR model consisting of a data reader, an embedded, and a full connection layer. A Training Framework Dedicated to Recommender Systems

3.2 HugeCTR architecture

HugeCTR not only supports all three steps of CTR DL, but also enhances end-to-end performance, such as:

  • To prevent data loading from becoming a major bottleneck in training, it implements a dedicated data reader that is asynchronous and multithreaded. It reads a set of batch data records, each of which is composed of high-dimensional, extremely sparse or categorical features. Each record can also contain dense numerical features that can be fed directly into the fully connected layer.
  • The embedding layer is used to compress sparse input features into low-dimensional and dense embedding vectors. There are three gPU-accelerated embedding stages:
    • Table lookup
    • Weight specifications within each slot.
    • Cross-slot weight concatenation.
  • By leveraging efficient CUDA optimization techniques and cuda-enabled libraries to support all layers in forward and backward propagation, the optimizer and loss functions are implemented in CUDA C++.

To train large-scale CTR estimation models, embedded tables in HugeCTR are model parallel and distributed across all Gpus in a homogeneous cluster consisting of multiple nodes. Each GPU has its own:

  • Feedforward neural network (data parallelism) to estimate click through rate.
  • Hash tables make data preprocessing easier and enable dynamic inserts.

So, the architecture of HugtCTR that can scale to multiple Gpus and nodes is summarized as follows:

3.3 GPU-based Parameter Server

HugeCTR implements a GPU-based parameter server that places embedding on the GPU, and the worker gets embedding by interacting with the parameter server.

Figure from HugeCTR_Webinar

0x04 Core Functions

In this section, we introduce the key features of HugeCTR that contribute to its high performance and usability. Note: multi-node training and mixed precision training can be used simultaneously.

4.1 Model parallel training

HugeCTR natively supports model parallelism and data parallelism training, making it possible to train very large models on gpus.

4.1.1 in-memory GPU hash table

Embedding is almost indispensable for obtaining good model accuracy in CTR estimation. It usually results in high memory capacity and bandwidth requirements and a fair amount of parallelism. If embedding is distributed on multiple Gpus or multiple nodes, the communication overhead can also be significant. Because of the large and growing number of users and items, large embedded tables are inevitable.

To overcome these challenges and enable faster training, HugeCTR implemented its own embedding layer, which includes a GPU-accelerated hash table and utilizes NCCL as its inter-GPU communication primitive. The implementation of hash tables is based on the implementation of RAPIDS cuDF, a GPU DataFrame library from NVIDIA. CuDF GPU hash tables can be up to 35 times faster than Threading Building Blocks (TBB) concurrent_hash_map.

In summary, HugeCTR supports parallel table embedding of models across multiple Gpus and multiple nodes in a homogeneous computing cluster. Embedded features and categories can be distributed across multiple Gpus and nodes. For example, if you have two nodes with 8xA100 80GB Gpus, you can train models up to 1TB entirely on gpus. By using the embedded training cache, you can train larger models on the same node.

4.1.2 Multi – slot embedding

Embedded tables can be split into multiple slots (or feature fields). In the process of embedding search, the sparse feature input belonging to the same slot is simplified into a single embedding vector after being transformed into the corresponding dense embedding vector. The embedding vectors from the different slots are then joined together.

Multi-slot embedding improves bandwidth utilization between Gpus in the following ways:

  • When there are many features in a dataset, it helps to reduce the number of valid features in each slot to manageable levels.
  • By stitching output from different slots, it reduces the number of transactions between Gpus, thus facilitating more efficient communication.

The following figure shows how the sequence of operations and inter-GPU communication (ALL2all) occurs.

The figure shows a parallel embedding of the model across four Gpus and how it interacts with the neural network of those Gpus. It also shows how to reduce the input characteristics of each slot and connect across two slots. A Training Framework Dedicated to Recommender Systems

Multi-slot embedding is also useful for linear models, which is basically a weighted sum of features, as long as both the slot number and the embedding dimension are set to 1. See the Wide & Deep example for more information.

4.1.3 Specific implementation

To get the best performance on different embeddings, you can choose different embedding layer implementations. Each of these implementations is targeted at a different real-world training case, for example:

  • LocalizedSlotEmbeddingHash: features in the same slot (domain) are stored in a GPU, which is why it’s called the “localization” trough, according to the index number of groove, different slot may be stored in different GPU. LocalizedSlotEmbedding is optimized for each instance where the embedding is smaller than the GPU memory size. The local protocol for each slot is used in localizedSlotemeitel. If the embedding gets the vector, the GPU card can be used after pooling. There is no global protocol between gpus, so the overall amount of data transferred in LocalizedSlotEmbedding is much smaller than DistributedSlotEmbedding.

    Note: Make sure there are no duplicate keys in the input dataset.

  • DistributedSlotEmbeddingHash: all features are stored in different domain/slot, no matter how much trough index number is all of these characteristics according to the characteristics of the index number distribution to different gpus. This means that features in the same slot may be stored in different Gpus, which is why it is called a “distributed slot”. Since the global protocol is required, DistributedSlotEmbedding works well if the embedding is greater than the GPU memory size, and therefore more memory swapping between Gpus.

    Note: Make sure there are no duplicate keys in the input dataset.

  • LocalizedSlotEmbeddingOneHot: a special kind of LocalizedSlotEmbedding, need a thermal data input. Each characteristic field must also be indexed from zero. For example, the gender should be 0,1, and 1,2 is incorrect.

It is important to note that LocalizedSlotEmbeddingHash and DistributedSlotEmbeddingHash difference is that the same slot (domain) the feature is stored in the same GPU. For example, there are 2 GPU cards with 4 slots.

  • Local mode: GPU0 stores SLOt0 and SLOt1, and GPU1 stores SloT2 and SLOt3.
  • Distribute mode: Each GPU stores some parameters of all slots, and determines how to allocate a parameter to a GPU using the hash method.

4.2 Multi-node training

Multi-node training makes it easy to train an embedded table of any size. In a multi-node solution, the sparse model (called an embedding layer) is distributed between nodes. Meanwhile, dense models (such as DNN) are data parallel and contain a copy of the dense model in each GPU (see figure below). Through our implementation, HugeCTR utilizes NCCL for high-speed and scalable inter-node and intra-node communication.

Figure from source code.

To run on multiple nodes, HugeCTR should be built using OpenMPI. GPUDirect RDMA support is recommended for high performance. For more information, see the DCN Multi-node training sample.

4.3 Mixed accuracy training

Mixed precision training has become a common technique to achieve further acceleration while maintaining model accuracy, which can help us improve and reduce memory throughput footprint. In HugeCTR, a full connection layer can be configured to take advantage of the tensor core on the NVIDIA Volta architecture and its successors. They internally use FP16 for accelerated matrix multiplication, but their inputs and outputs are still FP32.

In this mode, TensorCores are used to improve the performance of layers based on matrix multiplication, such as FullyConnectedLayer and InteractionLayer, on Volta, Turing and Ampere architectures. For the other layers, including embedding, the data type was changed to FP16 to save memory bandwidth and capacity. To enable mixed precision mode, specify the mix_precision option in the configuration file. When mixed_precision is set, the full FP16 pipeline will fire. Loss scaling will be applied to avoid arithmetic underflow (see figure). Mixed-precision training can be enabled using configuration files.

Figure 5: Arithmetic underflow diagram from source code.

4.4 SGD optimizer and learning rate scheduling

Learning rate scheduling allows the user to configure its hyperparameters, including the following:

  • learning_rate: Basic learning rate.
  • warmup_steps: Initial number of steps used for preheating.
  • decay_start: Specifies the time when learning rate decay begins.
  • decay_steps: Decay stage (gradual).

Figure 6 illustrates how these hyperparameters interact with the actual learning rate.

For more information, see Python Interfaces.

Figure 6: Learning rate scheduling diagram from source code.

4.5 Embed training cache

Model Oversubscription enables you to train large models up to terabytes. It is achieved by loading a subset of embedded tables that exceed the GPU’s memory aggregation capacity into the GPU in a coarse-grained, on-demand fashion during the training phase. To use this feature, you need to split the dataset into multiple subdatasets while extracting a unique keyset from them (see Figure 7).

This feature currently supports single-node and multi-node training. It supports all embedded types and can be used with the Norm and Raw dataset formats. We modified our Criteo2Hugectr tool to support key set extraction for Criteo datasets. For more information, see our Python Jupyter Notebook to learn how to use this feature with the Criteo dataset.

Note: The Criteo dataset is a common use case, but model prefetching is not limited to this dataset.

Figure 7: Preprocessing of dataset for Model Oversubscription diagram from source code

4.6 HugeCTR to ONNX converter

The HugeCTR to Open Neural Network Exchange (ONNX) converter is a hugectr2onnxPython package that converts HugeCTR models to ONNX. It can improve HugeCTR’s compatibility with other deep learning frameworks because ONNX is an open source format for AI models.

Use our HugeCTR Python API after training, you can get the intensive model, sparse model and graphic configuration files, these files in the use of the hugectr2onnx. The converter. The convert method is needed as input. Each HugeCTR layer will correspond to one or more ONNX operators, and the trained model weights will be loaded into the ONNX graph as initializers. Alternatively, you can use the convert_embedding flag to transform the sparse embedding layer.

4.7 Layered Parameter Server

A tiered storage mechanism is implemented between local SSDS and CPU memory on the HugeCTR Layered Parameter Server (POC). With this implementation, embedded tables no longer need to be stored in local CPU memory. Distributed Redis clusters were added as CPU caches to store larger embedded tables and interact directly with GPU embedded caches. To help the Redis cluster find missing embedded keys, local RocksDB has been implemented as a query engine to back up the full embedded tables on local SSDS.

4.8 Asynchronous multithreaded data pipeline

Without efficient data pipelines, even if forward and backward travel at the speed of light, the effect is like arriving at an airport taking much longer than the flight time. Also, when a data set is large and changing frequently, it makes perfect sense to split it into multiple files.

To effectively hide this long delay in data acquisition, HugeCTR has a multithreaded data reader that overlaps data acquisition with actual model training. As shown in the figure below, the DataReader is a facade that consists of multiple parallel workers and a collector.

Each worker reads one batch at a time from the data set file to which it is assigned. The collector distributes the collected data records to multiple Gpus. All worker, collector, and model training runs simultaneously on the CPU as separate threads.

Figure 4. HugeCTR multithreaded data reader.

A Training Framework Dedicated to Recommender Systems

The diagram below shows how the HugeCTR pipeline overlaps the three phases of “data read from disk to CPU memory,” “data transfer from CPU to GPU,” and “actual training across different batches on the GPU.”

A Training Framework Dedicated to Recommender Systems

4.9 Flexible Model Configuration

Although there are some commonalities between CTR models, their details (including hyperparameters) may differ. For flexible customization of models, HugeCTR allows models to be configured intuitively in JSON format.

For example, to describe the hybrid model shown in the figure below, you could write the “Layers” clause that is abstracted in Figure (b). You can have multiple embeds, and you can specify batch sizes, optimizers, data paths, and so on. In the same configuration file, you can also specify the number and number of Gpus used for training. For more information, see the HugeCTR User guide and sample configuration files.

  • Figure 6. A hybrid model with two embeddings and two different types of inputs. (a) An example mode expressible by HugeCTR. (b) The corresponding config. A lot of details are omitted for simplicity.

A Training Framework Dedicated to Recommender Systems

0xEE Personal information

★★★★ Thoughts on life and technology ★★★★★

Wechat official account: Rosie’s Thoughts

0 XFF reference

Introducing NVIDIA Merlin HugeCTR: A Training Framework Dedicated to Recommender Systems

Announcing NVIDIA Merlin: An Application Framework for Deep Recommender Systems

Developer.nvidia.com/blog/announ…

Developer.nvidia.com/blog/accele…

Read HugeCTR source code

How does embedding propagate back

Web.eecs.umich.edu/~justincj/t…

Info.nvidia.com/235418-onde…

HugeCTR_Webinar

www.cnblogs.com/futurehau/p…