NVIDIA HugeCTR, GPU version parameter server -(3)

0 x00 the

In this series, we introduced HugeCTR, an industry-oriented recommendation system training framework optimized for large-scale CTR models with model parallel embedding and data-parallel dense networks.

This article focuses on the input data and some of the underlying data structures that HugeCTR relies on. Thanks for using HugeCTR source code to read this masterpiece. Because HugeCTR is actually a very specific deep learning system, it also implements a number of basic functions that are worth reading up on if you want to explore a deep learning framework.

Other articles in this series are as follows:

NVIDIA HugeCTR, GPU version parameter server –(1)

NVIDIA HugeCTR, GPU version parameter server — (2)

0 x01 review

Let’s first go back to the previous content. The logical relationship of assembly line is as follows:

The training process is as follows:

Based on what we’ve learned, let’s look at how to manipulate data.

0 x02 data set

HugeCTR currently supports three data set formats, namely Norm, Raw, and Parquet, as shown below:

Fig. 1: (a) Norm (b) Raw (c) Parquet Dataset Formats

2.1 Norm

To maximize data loading performance and minimize storage, the Norm dataset format consists of a set of binary data files and a list of files in ASCII format. The model file should specify the filename of the training and test (evaluation) set, the maximum number of elements (keys) in the sample, and the label dimension, as shown in Figure 1 (a).

2.1.1 Data files

A data file is the minimum read granularity for a reading thread, so at least 10 files per file list are required for optimal performance. The data file consists of header and actual tabular data.

The Header definition:

typedef  struct DataSetHeader_ {
   long  long error_check;       //0: no error check; 1: check_num
  long  longNumber_of_records;// The number of samples in this data file
  long  long label_dim;          // The dimension of the tag
  long  long density_dim;        // Dimensions of dense features
  long  long slot_num;           // Slot_num for each embed
  long  long reserved[ 3 ];      // Reserved for future useData set header;Copy the code

Data Definition (per sample) :

typedef struct Data_ {
  int length;                   // Number of bytes in this example (optional: only in check_sum mode)
  float label[label_dim];
  float dense[dense_dim];
  Slot slots[slot_num];
  char checkbits;               // Check bits of this sample (optional: only in checksum mode)
} Data;

typedef struct Slot_ {
  int nnz;
  unsigned int*  keys; // Can be changed to 'long long' using 'input_key_type' in the 'solver' object of the configuration file
} Slot;
Copy the code

Data fields usually have many samples. Each sample starts with a label formatted as an integer, followed by an NNZ (non-zero number) and an input key using the long Long (or unsigned integer) format, as shown in Figure 1 (a).

The categorical input key is distributed to slots without overlap. For example, slot[0] = {0,10,32,45}, slot[1] = {1,2,5,67}. If there is any overlap, it will result in undefined behavior. For example, given slot[0] = {0,10,32,45} and slot[1] = {1,10,5,67}, finding a 10-key table will yield different results depending on how the slot is assigned to the GPU.

2.1.2 File List

The first line of the file list should be the number of data files in the dataset, followed by the path to those files, as shown below:

$ cat simple_sparse_embedding_file_list.txt
10
./simple_sparse_embedding/simple_sparse_embedding0.data
./simple_sparse_embedding/simple_sparse_embedding1.data
./simple_sparse_embedding/simple_sparse_embedding2.data
./simple_sparse_embedding/simple_sparse_embedding3.data
./simple_sparse_embedding/simple_sparse_embedding4.data
./simple_sparse_embedding/simple_sparse_embedding5.data
./simple_sparse_embedding/simple_sparse_embedding6.data
./simple_sparse_embedding/simple_sparse_embedding7.data
./simple_sparse_embedding/simple_sparse_embedding8.data
./simple_sparse_embedding/simple_sparse_embedding9.data
Copy the code

The following is an example:

reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Norm,
                                  source = ["./wdl_norm/file_list.txt"],
                                  eval_source = "./wdl_norm/file_list_test.txt",
                                  check_type = hugectr.Check_t.Sum)
Copy the code

2.2 Raw

The Raw dataset format differs from the Norm dataset format in that the training data appears in a binary file and uses INT32. Figure 1 (b) shows the structure of the original data set sample.

Note: This format only accepts unique heat data.

Raw data set should be used together with the embedded type LocalizedSlotSparseEmbeddingOneHot format.

Example:

reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Raw,
                                  source = ["./wdl_raw/train_data.bin"],
                                  eval_source = "./wdl_raw/validation_data.bin",
                                  check_type = hugectr.Check_t.Sum)
Copy the code

2.3 Parquet

Parquet is a column-oriented, open source data format. It can be used for any project in the Apache Hadoop ecosystem. To reduce file size, it supports compression and encoding. Figure 1 (c) shows an example Parquet dataset. Refer to the Parquet documentation for additional information.

Please note the following:

The Parquet data loader does not currently support nested column types.
No missing values are allowed in the column.
As with the Norm dataset format, labels and dense feature columns should be in floating-point format.
Slot feature columns should be in Int64 format.
The columns of data in the Parquet file can be arranged in any order.
A separate is required to obtain the required information from all rows in each Parquet file and from the column index map for each label, density (number), and slot (category) feature_metadata.jsonFile.

Example _metadata. Json:

{
"file_stats": [{"file_name": "file1.parquet"."num_rows": 6528076}, {"file_name": "file2.parquet"."num_rows": 6528076}]."cats": [{"col_name": "C11"."index": 24}, {"col_name": "C24"."index": 37}, {"col_name": "C17"."index": 30}, {"col_name": "C7"."index": 20}, {"col_name": "C6"."index": 19}]."conts": [{"col_name": "I5"."index": 5}, {"col_name": "I13"."index": 13}, {"col_name": "I2"."index": 2}, {"col_name": "I10"."index": 10}]."labels": [{"col_name": "label"."index": 0}}]Copy the code

Use as follows:

reader = hugectr.DataReaderParams(data_reader_type = hugectr.DataReaderType_t.Parquet,
                                  source = ["./criteo_data/train/_file_list.txt"],
                                  eval_source = "./criteo_data/val/_file_list.txt",
                                  check_type = hugectr.Check_t.Non,
                                  slot_size_array = [278899.355877.203750.18573.14082.7020.18966.4.6382.1246.49.185920.71354.67346.11.2166.7340.60.4.934.15.204208.141572.199066.60940.9115.72.34])
Copy the code

We provide the option slot_size_array to add an offset to each slot. Slot_size_array is an array whose length is equal to the number of slots. To avoid duplicate keys after offset is added, ensure that the key of slot I is between 0 and slot_size_array[I]. We will offset it this way: for the slot I key, we add the offset slot_size_array[0] + SLOt_size_array [1] +… + slot_size_array[i-1]. In the configuration fragment mentioned above, offset 0 is added for slot 0. For the first slot, offset 278899 will be added. For the third slot, offset 634776 is added.

0 x03 CSR format

The embedding layer is built on top of the CSR format, so let’s look at the CSR format first.

3.1 What is CSR

Sparse matrix refers to the matrix in which most elements are 0. In fact, most large-scale matrices in practical problems are Sparse matrix, so there are many efficient storage formats specially for Sparse matrix, Compressed Row (CSR) is one of them.

In the simplest format, each element is represented by a triple (row number, column number, value) corresponding to the column to the right of the figure above. This approach is simple, but records a single message many (rows and columns), each triad can be positioned itself, so the space is not optimal.

CSR requires three types of data to express: value, column number, and row offset. It does not represent an element as a triple, but as a whole.

Value: one element.
Column number: The column number of the element,
Row offset: The starting offset of the first element in a row within values.

In the figure above, element 1 in the first row is offset 0, element 2 in the second row is offset 2, element 5 in the third row is offset 4, and element 6 in the fourth row is offset 7. And then you end up with the row offset plus the total number of elements of the matrix, which in this case is 9.

3.2 HugeCTR 之中的CSR

Let’s take an example of that. There is no column number because it is only used to store sparse keys in a slot, because sparse keys in a slot can be stored directly in sequence.

* For example data:
*   4.5.1.2
*   3.5.1
*   3.2
* Will be convert to the form of:
* row offset: 0.4.7.9
* value: 4.5.1.2.3.5.1.3.2
Copy the code

Let’s take some more information from the source code for samples/ NCF /preprocess-20m.py.

def write_hugeCTR_data(huge_ctr_data, filename='huge_ctr_data.dat') :
    print("Writing %d samples"%huge_ctr_data.shape[0])
    with open(filename, 'wb') as f:
        #write header
        f.write(ll(0)) # 0: no error check; 1: check_num
        f.write(ll(huge_ctr_data.shape[0])) # the number of samples in this data file
        f.write(ll(1)) # dimension of label
        f.write(ll(1)) # dimension of dense feature
        f.write(ll(2)) # long long slot_num
        for _ in range(3): f.write(ll(0)) # reserved for future use

        for i in tqdm.tqdm(range(huge_ctr_data.shape[0])):
            f.write(c_float(huge_ctr_data[i,2])) # float label[label_dim];
            f.write(c_float(0)) # dummy dense feature
            f.write(c_int(1)) # slot 1 nnz: user ID
            f.write(c_uint(huge_ctr_data[i,0]))
            f.write(c_int(1)) # slot 2 nnz: item ID
            f.write(c_uint(huge_ctr_data[i,1]))
Copy the code

3.3 action class

3.3.1 definition

Only member variables are given here, which can be verified with the above CSR format.

class CSR {
 private:
  const size_t num_rows_;       /**< num rows. */
  const size_t max_value_size_; /**< number of element of value the CSR matrix will have for num_rows rows. */

  Tensor2<T> row_offset_tensor_;
  Tensor2<T> value_tensor_; /**< a unified buffer for row offset and value. */
  T* row_offset_ptr_;       /**< just offset on the buffer, note that the length of it is * slot*batchsize+1. */
  T* value_ptr_;            /**< pointer of value buffer. */

  size_t size_of_row_offset_; /**< num of rows in this CSR buffer */
  size_t size_of_value_;      /**< num of values in this CSR buffer */

  size_t check_point_row_;   /**< check point of size_of_row_offset_. */
  size_t check_point_value_; /**< check point of size_of_value__. */
}
Copy the code

3.3.2 Constructors

Constructor, which allocates memory on top of the GPU.

/** * Ctor * @param num_rows num of rows is expected * @param max_value_size max size of value buffer. */
CSR(size_t num_rows, size_t max_value_size)
    : num_rows_(num_rows),
      max_value_size_(max_value_size),
      size_of_row_offset_(0),
      size_of_value_(0) {
  std::shared_ptr<GeneralBuffer2<CudaHostAllocator>> buff =
      GeneralBuffer2<CudaHostAllocator>::create(a); buff->reserve({num_rows + 1}, &row_offset_tensor_);
  buff->reserve({max_value_size}, &value_tensor_);
  buff->allocate(a); row_offset_ptr_ = row_offset_tensor_.get_ptr(a); value_ptr_ = value_tensor_.get_ptr(a); }Copy the code

3.3.3 Generate a new row

A new row is generated in new_row and the current total value is set to row_offset.

/** * Insert a new row to CSR * Whenever you want to add a new row, you need to call this. * When you have pushed back all the values, you need to call this method * again. */
inline void new_row(a) {  // call before push_back values in this line
  if (size_of_row_offset_ > num_rows_) CK_THROW_(Error_t::OutOfBound, "CSR out of bound");
  row_offset_ptr_[size_of_row_offset_] = static_cast<T>(size_of_value_);
  size_of_row_offset_++;
}
Copy the code

3.3.4 Inserting Data

It inserts data and increases the total number of values.

/** * Push back a value to this object. * @param value the value to be pushed back. */
inline void push_back(const T& value) {
  if (size_of_value_ >= max_value_size_)
    CK_THROW_(Error_t::OutOfBound, "CSR out of bound " + std::to_string(max_value_size_) +
                                       "offset" + std::to_string(size_of_value_));
  value_ptr_[size_of_value_] = value;
  size_of_value_++;
}
Copy the code

0x04 Basic Data Structure

Because HugeCTR is actually a concrete and subtle deep learning system, it also realizes many basic functions. To better analyze, we need to introduce some basic data structures first. Only the member variables and necessary functions of each class are given below.

4.1 tensor

The first is the basic concept of tensors.

4.1.1 TensorBuffer2

TensorBuffer2 is the underlying TensorBuffer2, perhaps better understood by referring to PyTorch’s data or storage.

class TensorBuffer2 {
 public:
  virtual ~TensorBuffer2() {}
  virtual bool allocated(a) const = 0;
  virtual void *get_ptr(a) = 0;
};
Copy the code

4.1.2 Tensor2

This corresponds to the tensor of T sub F or PyTorch.

template <typename T>
class Tensor2 {
  std::vector<size_t> dimensions_;
  size_t num_elements_;
  std::shared_ptr<TensorBuffer2> buffer_;
}
Copy the code

Member functions we choose to introduce two as follows:

static Tensor2 stretch_from(const TensorBag2 &bag) {
  return Tensor2(bag.dimensions_, bag.buffer_);
}

TensorBag2 shrink(a) const {
  return TensorBag2(dimensions_, buffer_, TensorScalarTypeFunc<T>::get_type());
}
Copy the code

Details are as follows:

4.1.3 Tensors2

Tensors2 is a vector of Tensor2.

template <typename T> using Tensors2 = std::vector<Tensor2<T>>;
Copy the code

4.1.4 TensorBag2

PyTorch also has classes with Bag suffix names, such as nn.Embedding and nn.embeddingbag. When constructing the bag model, it is common for the Embedding to follow Sum or Mean. For variable-length sequences, the nn.embeddingbag provides more efficient and faster processing, especially for variable-length sequences.

At HugeCTR, TensorBag2 can be thought of as a unified class that puts the Tensor in the bag.

class TensorBag2 {
  template <typename T>
  friend class Tensor2;

  std::vector<size_t> dimensions_;
  std::shared_ptr<TensorBuffer2> buffer_;
  TensorScalarType scalar_type_;
};

using TensorBags2 = std::vector<TensorBag2>;
Copy the code

For the correlation between Tensor and Bag, see the following function.

template <typename T>
Tensors2<T> bags_to_tensors(const std::vector<TensorBag2> &bags) {
  Tensors2<T> tensors;
  for (const auto &bag : bags) {
    tensors.push_back(Tensor2<T>::stretch_from(bag));
  }
  return tensors;
}

template <typename T>
std::vector<TensorBag2> tensors_to_bags(const Tensors2<T> &tensors) {
  std::vector<TensorBag2> bags;
  for (const auto &tensor : tensors) {
    bags.push_back(tensor.shrink());
  }
  return bags;
}
Copy the code

4.1.5 SparseTensor

SparseTensor is a Sparse type of tensor, which was added in version 3.2 to unify the CSR format, or Sparse matrix, which effectively stores and processes tensors where most elements are zero. Subsequent analysis will be performed when data is read to GPU. Comparing the CSR format, we can see that its internal mechanism corresponds to the ROWoffset and value of the CSR. Its specific definition is as follows:

template <typename T>
class SparseTensor {
  std::vector<size_t> dimensions_;
  std::shared_ptr<TensorBuffer2> value_buffer_;
  std::shared_ptr<TensorBuffer2> rowoffset_buffer_;
  std::shared_ptr<size_t> nnz_;  // maybe size_t for FixedLengthSparseTensor
  size_t rowoffset_count_;
};
Copy the code

The schematic diagram is as follows:

Let’s take an example of that. There is no column number because it is only used to store sparse keys in a slot, because sparse keys in a slot can be stored directly in sequence.

* For example data:
*   4.5.1.2
*   3.5.1
*   3.2
* Will be convert to the form of:
* row offset: 0.4.7.9
* value: 4.5.1.2.3.5.1.3.2
Copy the code

Corresponding to the following figure:

Member functions are described as follows:

static SparseTensor stretch_from(const SparseTensorBag &bag) {
  return SparseTensor(bag.dimensions_, bag.value_buffer_, bag.rowoffset_buffer_, bag.nnz_,
                      bag.rowoffset_count_);
}

SparseTensorBag shrink(a) const {
  return SparseTensorBag(dimensions_, value_buffer_, rowoffset_buffer_, nnz_, rowoffset_count_,
                         TensorScalarTypeFunc<T>::get_type());
}
Copy the code

PyTorch

PyTorch has sparse_coo_tensor that does something similar. PyTorch supports tensor_layouts. CPP is available at torch/ CSRC /utils/tensor_layouts. For example, at::Layout::Strided, at::Layout::Sparse, At ::Layout::SparseCsr, at::Layout::Mkldnn, etc. These correspond to different memory Layout modes.

When using the sparse tensor, a pair of dense tensors are provided: a value tensor, a two-dimensional Indice tensor, and other auxiliary parameters.

>>> i = [[1.1]]
>>> v =  [3.4]
>>> s=torch.sparse_coo_tensor(i, v, (3)),>>> s
tensor(indices=tensor([[1.1]]),
       values=tensor(  [3.4]),
       size=(3,), nnz=2, layout=torch.sparse_coo)
Copy the code

TensorFlow

TensorFlow also has the SparseTensor type to represent multidimensional sparse data. A SparseTensor uses three dense tensors:

Indices represents the coordinates of non-zero elements of a sparse tensor.
Values correspond to the value of each non-zero element.
Shape represents the shape of the sparse tensor after it is converted into a dense form.

For example:

indices = tf.constant([[0.0], [1.1], [2.2]], dtype=tf.int64)
values = tf.constant([1.2.3], dtype=tf.float32)
shape = tf.constant([3.3], dtype=tf.int64)
sparse = tf.SparseTensor(indices=indices,
                            values=values,
                            dense_shape=shape)
dense = tf.sparse_tensor_to_dense(sparse, default_value=0)
with tf.Session() as session:
    sparse, dense = session.run([sparse, dense])
    print('Sparse is :\n', sparse)
    print('Dense is :\n', dense)
Copy the code

Print it out as follows:

Sparse is :
 SparseTensorValue(indices=array([[0.0],
       [1.1],
       [2.2]]), values=array([1..2..3.], dtype=float32), dense_shape=array([3.3]))
Dense is :
 [[1. 0. 0.]
 [0. 2. 0.]
 [0. 0. 3.]]
Copy the code

4.1.6 SparseTensorBag

The tensorbag-like functionality is as follows:

class SparseTensorBag {
  template <typename T>
  friend class SparseTensor;

  std::vector<size_t> dimensions_;
  std::shared_ptr<TensorBuffer2> value_buffer_;
  std::shared_ptr<TensorBuffer2> rowoffset_buffer_;
  std::shared_ptr<size_t> nnz_;
  size_t rowoffset_count_;
  TensorScalarType scalar_type_;

  SparseTensorBag(const std::vector<size_t> &dimensions,
                  const std::shared_ptr<TensorBuffer2> &value_buffer,
                  const std::shared_ptr<TensorBuffer2> &rowoffset_buffer,
                  const std::shared_ptr<size_t> &nnz, const size_t rowoffset_count,
                  TensorScalarType scalar_type)
      : dimensions_(dimensions),
        value_buffer_(value_buffer),
        rowoffset_buffer_(rowoffset_buffer),
        nnz_(nnz),
        rowoffset_count_(rowoffset_count),
        scalar_type_(scalar_type) {}

 public:
  SparseTensorBag() : scalar_type_(TensorScalarType::None) {}
  const std::vector<size_t> &get_dimensions(a) const { returndimensions_; }};Copy the code

4.1.7 vector class

Here are two vector classes for user convenience.

using TensorBags2 = std::vector<TensorBag2>;

template <typename T>
using SparseTensors = std::vector<SparseTensor<T>>;
Copy the code

4.2 memory

Let’s take a look at some memory-related classes.

2 Allocator

So let’s see how you allocate memory for things like tensor.

4.2.1.1 HostAllocator

HostAllocator manages memory on a host.

class HostAllocator {
 public:
  void *allocate(size_t size) const { return malloc(size); }
  void deallocate(void *ptr) const { free(ptr); }};Copy the code

The next few implementations call CUDA functions to allocate memory, such as cudaHostAlloc, for further study.

4.2.1.2 CudaHostAllocator

Call CUDA methods to allocate memory on the host

class CudaHostAllocator {
 public:
  void *allocate(size_t size) const {
    void *ptr;
    CK_CUDA_THROW_(cudaHostAlloc(&ptr, size, cudaHostAllocDefault));
    return ptr;
  }
  void deallocate(void *ptr) const { CK_CUDA_THROW_(cudaFreeHost(ptr)); }};Copy the code

4.2.1.3 CudaManagedAllocator

CudaMallocManaged allocates memory intended for use by host or device code as a unified way of allocating memory.

class CudaManagedAllocator {
 public:
  void *allocate(size_t size) const {
    void *ptr;
    CK_CUDA_THROW_(cudaMallocManaged(&ptr, size));
    return ptr;
  }
  void deallocate(void *ptr) const { CK_CUDA_THROW_(cudaFree(ptr)); }};Copy the code

4.2.1.4 CudaAllocator

This class allocates memory on the device.

class CudaAllocator {
 public:
  void *allocate(size_t size) const {
    void *ptr;
    CK_CUDA_THROW_(cudaMalloc(&ptr, size));
    return ptr;
  }
  void deallocate(void *ptr) const { CK_CUDA_THROW_(cudaFree(ptr)); }};Copy the code

4.2.2 GeneralBuffer2

After analyzing how to allocate memory, let’s look at how to encapsulate memory, specifically with GeneralBuffer2. GeneralBuffer2 can be thought of as a unified package for a large chunk of memory, and you can have a Tensor on it.

4.2.2.1 definition

Member functions are ignored here, as are the inner classes.

Allocator: Specific memory allocator, which also distinguishes between GPU allocation and CPU allocation.
Ptr_ : points to allocated memory;
Total_size_in_bytes_ : memory size;
“Reserved_buffers_” : indicates the reserved buffer in the early stage.

The specific inner class is:

BufferInternal is the interface.
TensorBufferImpl is the buffer implementation corresponding to Tensor2.
BufferBlockImpl is used when you’re building a network.

The specific code is as follows:

template <typename Allocator>
class GeneralBuffer2 : public std::enable_shared_from_this<GeneralBuffer2<Allocator>> {
  
  class BufferInternal {
   public:
    virtual ~BufferInternal() {}
    virtual size_t get_size_in_bytes(a) const = 0;
    virtual void initialize(const std::shared_ptr<GeneralBuffer2> &buffer, size_t offset) = 0;
  };

  class TensorBufferImpl : public TensorBuffer2, public BufferInternal {
    size_t size_in_bytes_;
    std::shared_ptr<GeneralBuffer2> buffer_;
    size_t offset_;
  };

  template <typename T>
  class BufferBlockImpl : public BufferBlock2<T>, public BufferInternal {
    size_t total_num_elements_;
    std::shared_ptr<TensorBufferImpl> buffer_impl_;
    Tensor2<T> tensor_;
    bool finalized_;
    std::vector<std::shared_ptr<BufferInternal>> reserved_buffers_;
  };

  Allocator allocator_;
  void *ptr_;
  size_t total_size_in_bytes_;
  std::vector<std::shared_ptr<BufferInternal>> reserved_buffers_;
}
Copy the code

4.2.2.2 TensorBufferImpl

It points to a GeneralBuffer2 and sets its offset and size.

void initialize(const std::shared_ptr<GeneralBuffer2> &buffer, size_t offset) {
  buffer_ = buffer;
  offset_ = offset;
}
Copy the code

4.2.2.2 BufferBlockImpl Key function

BufferBlockImpl and TensorBufferImpl can be compared.

Where BufferBlock2 is the interface class for BufferBlockImpl.

template <typename T>
class BufferBlock2 {
 public:
  virtual ~BufferBlock2() {}
  virtual void reserve(const std::vector<size_t> &dimensions, Tensor2<T> *tensor) = 0;
  virtual Tensor2<T> &as_tensor(a) = 0;
};
Copy the code

BufferBlockImpl is a continuous Tensor, and some particular implementation requires continuous memory, like weights.

std::shared_ptr<BufferBlock2<float>> train_weight_buff = blobs_buff->create_block<float> ();// Omit other code......

network->train_weight_tensor_ = train_weight_buff->as_tensor(a);Copy the code

BufferBlockImpl has a reserve method for creating an internal tensor on top of memory.

void reserve(const std::vector<size_t> &dimensions, Tensor2<T> *tensor) override {
  if (finalized_) {
    throw std::runtime_error(ErrorBase + "Buffer block is finalized.");
  }
  size_t num_elements = get_num_elements_from_dimensions(dimensions);
  size_t size_in_bytes = num_elements * TensorScalarSizeFunc<T>::get_element_size(a); std::shared_ptr<TensorBufferImpl> buffer_impl = std::make_shared<TensorBufferImpl>(size_in_bytes); reserved_buffers_.push_back(buffer_impl);

  *tensor = Tensor2<T>(dimensions, buffer_impl);

  total_num_elements_ += num_elements;
}
Copy the code

Initialize configures the inside

void initialize(const std::shared_ptr<GeneralBuffer2> &buffer, size_t offset) {
  size_t local_offset = 0;
  for (const std::shared_ptr<BufferInternal> &buffer_impl : reserved_buffers_) {
    buffer_impl->initialize(buffer, offset + local_offset);
    local_offset += buffer_impl->get_size_in_bytes(a); } reserved_buffers_.clear(a);if(! finalized_) { buffer_impl_ = std::make_shared<TensorBufferImpl>( total_num_elements_ * TensorScalarSizeFunc<T>::get_element_size());
    tensor_ = Tensor2<T>({total_num_elements_}, buffer_impl_);
    finalized_ = true;
  }
  buffer_impl_->initialize(buffer, offset);
}
Copy the code

4.2.2.3 GeneralBuffer2 Key function

The reserve method records the memory requirements of a tensor in reserved_buffers_ as TensorBufferImpl, and then generates the tensor with TensorBufferImpl.

template <typename T>
void reserve(const std::vector<size_t> &dimensions, Tensor2<T> *tensor) {
  if (allocated()) {
    throw std::runtime_error(ErrorBase + "General buffer is finalized.");
  }

  size_t size_in_bytes =
      get_num_elements_from_dimensions(dimensions) * TensorScalarSizeFunc<T>::get_element_size(a); std::shared_ptr<TensorBufferImpl> buffer_impl = std::make_shared<TensorBufferImpl>(size_in_bytes); reserved_buffers_.push_back(buffer_impl);

  *tensor = Tensor2<T>(dimensions, buffer_impl);
}
Copy the code

Create_block will be created for BufferBlock2.

template <typename T>
std::shared_ptr<BufferBlock2<T>> create_block() {
  if (allocated()) {
    throw std::runtime_error(ErrorBase + "General buffer is finalized.");
  }
  std::shared_ptr<BufferBlockImpl<T>> block_impl = std::make_shared<BufferBlockImpl<T>>();
  reserved_buffers_.push_back(block_impl);
  return block_impl;
}
Copy the code

Allocate iterates over the registered BufferInternal, accumulates its total size, and finally allocator_ is called to allocate memory.

void allocate(a) {
  if(ptr_ ! =nullptr) {
    throw std::runtime_error(ErrorBase + "Memory has already been allocated.");
  }

  size_t offset = 0;
  for (const std::shared_ptr<BufferInternal> &buffer : reserved_buffers_) {
    // Configure BufferInternal (e.g. TensorBufferImpl)
    buffer->initialize(this->shared_from_this(), offset);
    size_t size_in_bytes = buffer->get_size_in_bytes(a);if (size_in_bytes % 32! =0) {
      size_in_bytes += (32 - size_in_bytes % 32);
    }
    offset += size_in_bytes;
  }
  reserved_buffers_.clear(a); total_size_in_bytes_ = offset;if(total_size_in_bytes_ ! =0) {
    ptr_ = allocator_.allocate(total_size_in_bytes_); }}Copy the code

4.2.4 summary

At this point, Tensor’s logic extends:

TensorBufferImpl’s buffer is GeneralBuffer2;
The PTR of GeneralBuffer2 is allocated by CudaAllocator among gpus; GeneralBuffer2 can be thought of as a unified package for a large chunk of memory, on which you can have several Tensor. So these Tensor reserves memory and then allocates it uniformly.
TensorBufferImpl offset_ points to a specific memory offset in the GeneralBuffer2 PTR;
BufferBlockImpl is used to implement a continuous Tensor memory.

Then if you have another Tensor2, then tensorBufferimp. offset will point to another offset of GPU memory, so you have Tensor 1 and Tensor2.

0xEE Personal information

★★★★ Thoughts on life and technology ★★★★★

Wechat official account: Rosie’s Thoughts

0 XFF reference

Developer.nvidia.com/blog/introd…

Developer.nvidia.com/blog/announ…

Developer.nvidia.com/blog/accele…

Read HugeCTR source code

How does embedding propagate back

Web.eecs.umich.edu/~justincj/t…

Sparse matrix storage format summary + storage efficiency comparison :COO,CSR,DIA,ELL,HYB

NVIDIA HugeCTR, GPU version parameter server –(3)