NVIDIA HugeCTR, GPU Version parameter server -(10)- Inference architecture

0 x00 the

After nine articles that have basically covered the training process for HugeCTR, it is now worth looking at how HugeCTR reasons so that we can have a better grasp of the overall situation. And since we’ve been analyzing distributed training, this is a good place to look at distributed reasoning.

Translated from github.com/triton-infe…

HugeCTR Backend (github.com/triton-infe… Gpu-accelerated recommended model deployment framework, designed to efficiently use GPU memory to speed reasoning by decoupling parameter servers, embedded caches, and model weights. The HugeCTR back end supports concurrent model reasoning across multiple Gpus by using embedded caches shared across multiple model instances.

Other articles in this series are as follows:

NVIDIA HugeCTR, GPU version parameter server –(1)

NVIDIA HugeCTR, GPU version parameter server — (2)

NVIDIA HugeCTR, GPU version parameter server –(3)

NVIDIA HugeCTR, GPU version parameter server — (4)

NVIDIA HugeCTR GPU version Parameter server — (5) Embedded hash table

NVIDIA HugeCTR, GPU version parameter (6) — Distributed hash table

NVIDIA HugeCTR, GPU version parameter server –(7) –Distributed Hash before transmission

NVIDIA HugeCTR, GPU version parameter server –(8) –Distributed Hash after transmission

NVIDIA HugeCTR, GPU version Parameter Server –(9)– Local hash table

0 x01 design

HugeCTR Backend uses a layered framework to isolate the loading of embedded tables through the Parameter Server to prevent the impact of multiple models when the service is deployed on multiple Gpus, and implements high service availability through embedded caches. GPU cache is used to accelerate the efficiency of embedding vector lookup in inference process.

The HugeCTR back end also provides the following capabilities:

Concurrent model execution: Multiple models and multiple instances of the same model can run simultaneously on the same GPU or on multiple Gpus.
Extensible back end: HugeCTR provides an inference interface that can be easily integrated with a back-end API, which allows you to use any execution logic extension model using Python or C++.
Easily deploy new models: Updated models should be as transparent as possible and should not affect reasoning performance. This means that no matter how many models need to be deployed, as long as they are trained using HugeCTR, they can all be loaded through the same HugeCTR backend API. Note: In some cases, you may need to update the configuration files for each model accordingly.

0x02 HugeCTR Backend framework

The following components make up the HugeCTR backend framework:

The parameter server is responsible for loading and managing large embedded tables belonging to different models. Embedded tables provide synchronization and update services for embedded caches. It also ensures that the embedded table is fully loaded and updated periodically.
Embedded caches can be loaded directly into GPU memory. Therefore, it provides embedded vector lookup capabilities for the model, thus avoiding the relatively high latency associated with transferring data from the parameter server (between CPU and GPU). It also provides an update mechanism to load the latest cached embedding vector in a timely manner, which ensures high hit ratios.
The model is much smaller than the embedded table, so it can usually be loaded directly into GPU memory to speed up inference. The model can directly interact with the embedded cache in GPU memory to obtain the embedded vector. Based on the layered design structure, multiple model instances share embedded caches in GPU memory to ensure concurrent model execution. Based on hierarchical dependencies, embedded tables can be decoupled from model lookup operations and rely on embedded caches to achieve efficient and low latency lookup operations. This makes it possible to implement reasoning logic using interface-by-interface initialization and dependency injection.

HugeCTR Inference Interface Design framework

Figure 1. HugeCTR reasoning design architecture

In practice, the parameter server is used to load the embedded tables of all models. Because different models will get different embedded tables through training in different application scenarios, high memory overhead will be generated in the inference process. By introducing Parameter Server, embedded table can be directly loaded into GPU memory when the size of embedded table is small; if GPU resources are exhausted, it will be loaded into CPU memory; when the size of embedded table is too large, it will even be loaded into solid state disk (SSD). This ensures that the embedded tables shared between different models and those models are isolated.

Each embedded table will create a separate embedded cache on a different GPU. The embedded cache treats the embedded table as minimal granularity, which means that the embedded cache can directly look up and synchronize with the corresponding embedded table. This mechanism ensures that multiple model instances of the same model can share the same embedded cache on the deployed GPU node.

0x03 GPU Embedded Cache

3.1 enable

When the GPU embedding caching mechanism is enabled, the model will find the embedding vector from the GPU embedding cache. If the embedding vector does not exist in the GPU embedding cache, it will return the default embedding vector. The default value is 0.

The HugeCTR backend needs to set the following parameters in the config.pbtxt file:

parameters [
...
  {
 key: "gpucache"
 value: { string_value: "true" }
 },
 {
 key: "gpucacheper"
 value: { string_value: "0.5"}},... ]Copy the code

Gpucache: Enable GPU embedded caching with this option.
Gpucacheper: Determines the percentage of embedded vectors that will be loaded from the embedded table into the GPU embedded cache. The default value is 0.5. Therefore, in the example above, 50% of the embedded table will be loaded into the GPU embedded cache.

."inference": {
   "max_batchsize": 64."hit_rate_threshold": 0.6."dense_model_file": "/model/dcn/1/_dense_10000.model"."sparse_model_file": "/model/dcn/1/0_sparse_10000.model"."label": 1},... ]Copy the code

Hit_rate_threshold: This option determines the update mechanism for embedded caches and parameter servers based on hit ratios. If the hit ratio of the embedded vector lookup falls below the set threshold, the GPU embedded cache will update the missing vector on the parameter server. The GPU embedded cache will also read the embedded vector from the parameter server for update based on a fixed hit ratio. The hit ratio threshold must be set in the model inference configuration JSON file. For example, see dcn.json and deepfm.json.

3.2 disable

When the GPU embedding caching mechanism is disabled (i.e., “gpucache” is set to false), the model looks up the embedding vector directly from the parameter server. In this case, all other Settings related to the GPU embedded cache are ignored.

0x04 Localized Deployment

The Parameter Server can be localized on the same node and cluster, that is, each node has only one GPU and the Parameter Server is deployed on the same node. Here are several deployment scenarios supported by HugeCTR:

Scenario 1: A GPU (Node 1) deploys a model and maximizes the hit ratio of the embedding cache by starting multiple parallel instances.
Scenario 2: A GPU (Node 2) deploys multiple models to maximize GPU resources, which requires a balance between the number of concurrent instances and multiple embedded caches to ensure efficient use of GPU memory. Data transfer between each embedded cache and parameter server uses a separate CUDA stream.

Note: In the example mentioned below, multiple Gpus and a parameter server are deployed on each node.
Scenario 3: Multiple Gpus (Node 3) deploy a single model, in which case the parameter server can help improve the hit ratio of embedded caches between Gpus.
Scenario 4: Multiple gpus (Node 4) deploy multiple models. This is the most complex scenario for local deployment. It is necessary to ensure that different embedding cache can share the same Parameter Server and different models can share the same embedding cache on the same Node.

Figure 2 HugeCTR inference localization deployment architecture

0x05 Distributed deployment with hierarchical HugeCTR parameter servers

HugeCTR introduces distributed Redis clusters as CPU caches for storing larger embedded tables and interacting directly with GPU embedded caches. The local RocksDB acts as a query engine to support the full embedded table on the local SSD to assist the Redis cluster in performing the missing embedded key lookup. To enable this hierarchical lookup service, you must add the “db_type” configuration item to “hierarchy” in ps.json.

{
    "supportlonglong": false."db_type": "hierarchy"."models": [...]. }Copy the code

Distributed Redis cluster synchronous query: The missing embedding key (Keys not found in the GPU cache) is also stored in the missing Keys buffer. The missing keys buffer is exchanged synchronously with the Redis instance, which in turn performs lookup operations on any missing embedded keys. Thus, the distributed Redis cluster acts as a secondary cache, which can completely replace the localization parameter server to load the full embedded tables of all models.

The user only needs to set the IP and port of each node to enable Redis cluster service in the HugeCTR hierarchical parameter server. However, the Redis cluster, as a distributed memory cache, is still limited by the size of CPU memory per node. In other words, the size of the embedded tables for all models still cannot exceed the total CPU memory of the cluster. Therefore, users can use “cache_size_PERCENTage_redis” to control the size of the model embed table loaded into the Redis cluster.

To take advantage of a Redis cluster with HugeCTR, add the following configuration options to add to ps.json:
```
{ "supportlonglong": false, ... "db_type": "hierarchy", "redis_ip": "node1_ip:port,node2_ip:port,node3_ip:port,..." , "cache_size_percentage_redis" : "0.5",... "models": [ ... ] }Copy the code
```
We will enable local key-value Store on each node for very large scale embedded tables that still cannot be fully loaded into the Redis cluster.

Synchronous query of RocksDB: When the Redis cluster client searches for the embedding key in the distributed GPU cache, the missing embedding key (Keys not found in Redis cluster) is recorded and stored in the missing key buffer. The missing key buffer is exchanged synchronously with the local RocksDB client, and then an attempt is made to find these keys in the local SSD. Finally, the SSD query engine will perform a third lookup on all the embedded keys missing from the model.

For the Model Repository already stored in the cloud, RocksDB will act as a local SSD cache for the remainder of the Redis cluster that cannot be loaded. Thus, in practice, the localized RocksDB instance acts as a level 3 cache.

The localized RocksDB configuration needs to be added to ps.json as shown below:
```
{
  "supportlonglong":false."db_type":"hierarchy"."rocksdb_path":"/current_node/rocksdb_path"."models":[
    ...
  ]
}
Copy the code
```

Figure 3. HugeCTR reasoning distributed deployment architecture

0x06 Variant Compressed Sparse Row Input

(Variant Compressed Row (CSR)) data format is commonly used as input to HugeCTR models. It allows data to be read efficiently, data semantic information to be extracted from raw data, and data parsing to avoid spending too much time. NVTabular must output the corresponding slot information to indicate the feature file of the classified data. Using a variant of the CSR data format, the model can capture characteristic field information when reading data from the request. In addition, the reasoning process can be accelerated by avoiding excessive request data processing. For each sample, there are three main types of input data:

Dense Feature: Represents actual numerical data.
Column Indices: Upstream preprocessing tool NVTabular performs one-hot and multi-hot encoding of classified data and converts it to numeric data.
Row PTR: Contains the number of classification features for each slot.

Figure 4. HugeCTR reasoning VCSR input format

VCSR sample

A single embedded table for each model

Action 0 in the figure above. The input data contains four slots. HugeCTR parses the information about Row 0 based on Row PTR. All embedding vectors are stored in a single embedding table.

Figure 5. Example HugeCTR inference VCSR for a single embedded table for each model

Slot 1: contains one classification feature and the embedding key is 1.
Slot 2: contains one classification feature and the Slot key is 3.
Slot 3: contains 0 category features.
Slot 4: contains two classification features and the embedded keys are 8 and 9. HugeCTR will look for two embedding vectors from the GPU’s embedding cache or parameter server, and eventually come up with a final embedding vector for Slot 4.

Each model has multiple embedded tables

Again, take Row 0 in the figure above. However, this time we assume that the input data consists of four slots, with the first two slots (slots 1 and 2) belonging to the first embedded table and the last two slots (slots 3 and 4) belonging to the second embedded table. So you need two separate Row PRTS to form a full Row PRTS in the input data.

Figure 6. An example of HugeCTR inference VCSR for multiple embedded tables per model

Slot 1: contains one classification feature. The embedding key is 1 and the corresponding embedding vector is stored in embedding table 1.
Slot 2: contains one classification feature with embedding key 3 and the corresponding embedding vector is stored in embedding table 1.
Slot 3: contains 0 category features.
Slot 4: Embedding key is 8 and 9. The corresponding embedding vector is stored in Embedding Table 2. In this case, HugeCTR looks for two embedding vectors from the GPU’s embedding cache or Parameter Server and gets the final embedding vector for slot 4.

Now that the HugeCTR analysis is complete, we’ll look at distributed training for TensorFlow in the next series.

0xEE Personal information

★★★★ Thoughts on life and technology ★★★★★

Wechat official account: Rosie’s Thoughts

0 XFF reference

Github.com/triton-infe…

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

NVIDIA HugeCTR, GPU Version parameter server –(10)– Inference architecture

0 x00 the

0 x01 design

0x02 HugeCTR Backend framework

0x03 GPU Embedded Cache

3.1 enable

3.2 disable

0x04 Localized Deployment

0x05 Distributed deployment with hierarchical HugeCTR parameter servers

0x06 Variant Compressed Sparse Row Input

VCSR sample

A single embedded table for each model

Each model has multiple embedded tables

0xEE Personal information

0 XFF reference

NVIDIA HugeCTR, GPU Version parameter server –(10)– Inference architecture

0 x00 the

0 x01 design

0x02 HugeCTR Backend framework

0x03 GPU Embedded Cache

3.1 enable

3.2 disable

0x04 Localized Deployment

0x05 Distributed deployment with hierarchical HugeCTR parameter servers

0x06 Variant Compressed Sparse Row Input

VCSR sample

A single embedded table for each model

Each model has multiple embedded tables

0xEE Personal information

0 XFF reference

Related Posts

Next-generation technology: Fei-Fei Li built Cloud AutoML with it, backed by Andrew Ng

Image segmentation Based on Matlab Tsallis entropy algorithm gray image segmentation

Julia concise application: ten minutes from zero to installation of Julia