0 x00 the

Horovod is an easy-to-use, high-performance distributed training framework released by Uber in 2017 that has been widely used in the industry.

This series takes you through the source code analysis of Horovod. This article, the second in a series of about 15 to 18, takes a look at Horovod from a user’s perspective.

See below for the previous one:

Horovod — (1) Basics

0 x01 Horovod profile

Horovod is an easy-to-use, high-performance distributed training framework released by Uber in 2017 that supports TensorFlow, Keras, PyTorch, and MXNet. Horovod takes its name from the traditional Russian folk dance in which dancers dance hand in hand in a circle, much like the scene where the distributed TensorFlow flow uses Horovod to communicate with each other.

Because machine learning frameworks may use the underlying collection communication libraries (NCCL, OpenMPI, GLOo, etc.) at different levels, they may not be able to take full advantage of the power of these underlying collection communication libraries. Thus, Hovorod integrates these frameworks to provide an easy to use and efficient solution.

Uber’s engineers are based on a FaceBook paper: “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour and Baidu’s Bringing HPC Techniques to Deep Learning improved and released the open source framework Horovod.

Horovod has no academic contribution compared to Baidu’s work. But Horovod’s solid engineering implementation has brought it more attention. Its biggest advantage is that it provides a higher level of abstraction for RingAllReduce, enabling it to support many different frameworks. It also introduces Nvidia NCCL, which is more GPU-friendly.

Horovod relies on Nvidia’s NCCL2 for All Reduce and MPI for inter-process communication, which simplifies the development process of synchronous multi-GPU or multi-node distributed training. Thanks to the use of NCCL2, Horovod can also take advantage of NVLINK, RDMA, GPUDirectRDMA, automatically detect communication topologies, and be able to fall back to PCIe and TCP/IP communications.

We need a few questions to guide our analysis:

  • Hovorod How to split data?
  • How does Hovorod conduct training code distribution?
  • What do python and C++ do when Hovorod starts?
  • How do I ensure that Hovorod startup steps are consistent?

0x02 Hovorod mechanism overview

2.1 Horovod mechanism

Horovod uses a data parallelization strategy to assign training on the GPU.

In data parallelization, each GPU in a job receives its own separate slice of the data batch, known as its “batch slice.” Each GPU uses its own assigned data to calculate independently and perform gradient updates.

If two Gpus are used and the batch processing size is 32, the first GPU will process the forward and backward propagation of the first 16 records, and the second GPU will process the forward and backward propagation of the last 16 records. These gradient updates are then averaged together across gpus and finally applied to the model.

The operation method for each iteration is as follows:

  1. Each worker maintains its own copy of the model weights and its own copy of the dataset.

  2. Upon receipt of the execution signal, each worker process extracts a disjointed batch from the data set and computes the gradient of that batch.

  3. Workers use the Ring all-reduce algorithm to synchronize each other’s gradients to calculate the same average gradient on all nodes locally.

    1. The gradient tensor on each device is divided into fragments of roughly equal length of NUM_devices, and each subsequent communication will send a fragment of its own to the next neighbor (while accepting a new fragment from the last neighbor).

    2. ScatterReduce phase: Through num_devices-1 round of communication and addition, the sum of a tensor fragment is calculated on each device, that is, each device will have a block containing the sum of all values in that block in all devices; The details are as follows:

    3. AllGather phase: through num_Devices-1 round of communication and overlay, each tensor fragment calculated in the previous phase is broadcast to other devices; Eventually all the nodes have all the tensor shards and. The details are as follows:

      1. Merge the fragments on each device to get the gradient sum, then divide by num_devices to get the average gradient;
    4. Each worker applies the gradient update to its local copy of the model.

    5. The next batch is executed.

0x03 Example code

3.1 Summary Code

We present a partial summary of the official website sample code here. See the comments in the code below for details.

import tensorflow as tf
import horovod.tensorflow.keras as hvd

# Horovod: initialize Horovod.
hvd.init() Initialize Horovod and start the related thread and MPI thread

# Horovod: pin GPU to be used to process local rank (one GPU per process)
Assign GPU to different processes based on local rank
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
    tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')

(mnist_images, mnist_labels), _ = \
    tf.keras.datasets.mnist.load_data(path='mnist-%d.npz' % hvd.rank())

# Slice data
dataset = tf.data.Dataset.from_tensor_slices(
    (tf.cast(mnist_images[..., tf.newaxis] / 255.0, tf.float32),
             tf.cast(mnist_labels, tf.int64))
)
dataset = dataset.repeat().shuffle(10000).batch(128)

mnist_model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(32[3.3], activation='relu'),... tf.keras.layers.Dense(10, activation='softmax')])# Horovod: adjust learning rate based on number of GPUs.
scaled_lr = 0.001 * hvd.size() # Increase the size of learning rate according to the number of workers
opt = tf.optimizers.Adam(scaled_lr)

# Horovod: add Horovod DistributedOptimizer.
# Wrap the regular TensorFlow Optimizer with Horovod and use Ring-AllReduce to get the average gradient
opt = hvd.DistributedOptimizer(
    opt, backward_passes_per_step=1, average_aggregated_gradients=True)

# Horovod: Specify `experimental_run_tf_function=False` to ensure TensorFlow
# uses hvd.DistributedOptimizer() to compute gradients.
mnist_model.compile(loss=tf.losses.SparseCategoricalCrossentropy(),
                    optimizer=opt, metrics=['accuracy'],
                    experimental_run_tf_function=False)

callbacks = [
    hvd.callbacks.BroadcastGlobalVariablesCallback(0), # Broadcast initialization, which transmits model parameters from the first device to other devices to ensure consistency of initialization model parameters
    hvd.callbacks.MetricAverageCallback(),
    hvd.callbacks.LearningRateWarmupCallback(initial_lr=scaled_lr, warmup_epochs=3, verbose=1),]Horovod: Save checkpoints only on worker 0 to prevent other workers from corrupting them
if hvd.rank() == 0:
    callbacks.append(tf.keras.callbacks.ModelCheckpoint('./checkpoint-{epoch}.h5'))

# Horovod: write logs on worker 0.
verbose = 1 if hvd.rank() == 0 else 0

# Train the model.
# Horovod: adjust number of steps based on number of GPUs.
mnist_model.fit(dataset, steps_per_epoch=500 // hvd.size(), callbacks=callbacks, epochs=24, verbose=verbose)
Copy the code

3.2 horovodrun

The Horovod training script did not start as a Python script. For example, you cannot run this training script using Python train.py. You need to run the special CLI command horovodrun to start it. (The training code train.py needs to be manually copied to each node in the same directory.)

$ horovodrun -np 4 -H localhost:4 python train.py
Copy the code

0x04 Running logic

Let’s go through the sequence and see what’s going on behind the scenes during the initialization process.

4.1 Importing Python Files

The following code will introduce various related Python files.

import tensorflow as tf
import horovod.tensorflow.keras as hvd
Copy the code

4.2 Initializing in Python

The initialization of the Python world is in horovod-master/horovod/mxnet/mpi_ops.py

4.2.1 Introduction of SO library

4.2.1.1 SO library

Horovod/TensorFlow/mPI_ops.py will introduce the SO library.

Such as dist – packages/horovod/tensorflow/mpi_lib retaining – 36 m – x86_64 – Linux – gnu. So.

The SO library is the compiled result of the C++ code in horovod.

def _load_library(name) :
    """Loads a .so file containing the specified operators. """
    filename = resource_loader.get_path_to_datafile(name)
    library = load_library.load_op_library(filename)
    return library

# Check possible symbol not found error from tensorflow version mismatch
try:
    MPI_LIB = _load_library('mpi_lib' + get_ext_suffix())
except Exception as e:
    check_installed_version('tensorflow', tf.__version__, e)
    raise e
else:
    check_installed_version('tensorflow', tf.__version__)
Copy the code
4.2.2.2 SO role

The import library is used to take C++ functions and wrap them in python so that C++ code can be used in the python world.

As you can see below, python’s _allreduce function forwards the function to C++, which is completed by mpi_lib.horovod_allreduce.

def _allreduce(tensor, name=None, op=Sum, prescale_factor=1.0, postscale_factor=1.0,
               ignore_name_scope=False) :
    if name is None and not _executing_eagerly():
        name = 'HorovodAllreduce_%s' % _normalize_name(tensor.name)
    return MPI_LIB.horovod_allreduce(tensor, name=name, reduce_op=op,
                                     prescale_factor=prescale_factor,
                                     postscale_factor=postscale_factor,
                                     ignore_name_scope=ignore_name_scope)
Copy the code

4.2.2 Initial Configuration

We excerpted the main part, which is to initialize _HorovodBasics and then get various functions, variables, and configurations from _HorovodBasics, such as whether mpI, gloo, and so on are compiled.

from horovod.common.basics import HorovodBasics as _HorovodBasics

_basics = _HorovodBasics(__file__, 'mpi_lib')

# import basic methods
init = _basics.init
size = _basics.size
local_size = _basics.local_size
rank = _basics.rank
local_rank = _basics.local_rank
mpi_built = _basics.mpi_built
gloo_enabled = _basics.gloo_enabled
......
Copy the code

4.2.3 Initialization of hvd.init()

You first need to initialize it with hvd.init(), and all the state managed by horovod is passed to the HVD object.

# Horovod: initialize Horovod.
hvd.init()
Copy the code

This is calling the function in HorovodBasics, so let’s see what we do.

As you can see, this section goes all the way into the C++ world, calling a lot of MPI_LIB_CTYPES, so we’re going to go into the C++ world.

def init(self, comm=None) :
    """A function that initializes Horovod. """
    atexit.register(self.shutdown)

    if not isinstance(comm, list):
        mpi_built = self.MPI_LIB_CTYPES.horovod_mpi_built()

        from mpi4py import MPI
        if MPI._sizeof(MPI.Comm) == ctypes.sizeof(ctypes.c_int):
            MPI_Comm = ctypes.c_int
        else:
            MPI_Comm = ctypes.c_void_p
            self.MPI_LIB_CTYPES.horovod_init_comm.argtypes = [MPI_Comm]

        comm_obj = MPI_Comm.from_address(MPI._addressof(comm))
        self.MPI_LIB_CTYPES.horovod_init_comm(comm_obj)
    else:
        comm_size = len(comm)
        self.MPI_LIB_CTYPES.horovod_init(
            (ctypes.c_int * comm_size)(*comm), ctypes.c_int(comm_size))
Copy the code

The current logic is as follows:

           Import python files

                    +
                    |
                    |
                    v

           Import C++ SO files
                    |
                    |
                    |
                    v

           Create _HorovodBasics
                    +
                    |
                    |
                    v
                hvd.init()
                    +
Python              |
+------------------------------------------+
C++                 |
                    |
                    v
Copy the code

4.3 initializing in C++

4.3.1 horovod_init_comm

At initialization time, Horovod will:

  • Call MPI_Comm_dup to get a Communicator, so you have the basis for MPI coordination.
  • Then InitializeHorovodOnce is called.
void horovod_init_comm(MPI_Comm comm) {
  MPI_Comm_dup(comm, &mpi_context.mpi_comm);
  InitializeHorovodOnce(nullptr, 0);
}
Copy the code

4.3.2 InitializeHorovodOnce

InitializeHorovodOnce InitializeHorovodOnce InitializeHorovodOnce InitializeHorovodOnce InitializeHorovodOnce InitializeHorovodOnce InitializeHorovodOnce

  • Create a controller for globalState based on whether the context is mpI or gloo compiled.
  • The background thread BackgroundThreadLoop is started to coordinate between workers;
void horovod_init(const int* ranks, int nranks) {
  InitializeHorovodOnce(ranks, nranks);
}

void InitializeHorovodOnce(const int* ranks, int nranks) {
  // Ensure background thread is only started once.
  if(! horovod_global.initialize_flag.test_and_set()) {
    horovod_global.control_operation = ParseControllerOpsFromEnv(a); horovod_global.cpu_operation =ParseCPUOpsFromEnv(a);#if HAVE_MPI // It depends on whether the MPI is compiled
    // Enable mpi is it's used either in cpu data transfer or controller
    if (horovod_global.cpu_operation == LibType::MPI ||
        horovod_global.control_operation == LibType::MPI) {
      mpi_context.Enable(a); }if (horovod_global.control_operation == LibType::MPI){
      // Create an MPIController object
      horovod_global.controller.reset(new MPIController(
          horovod_global.response_cache,
          horovod_global.tensor_queue, horovod_global.timeline,
          horovod_global.parameter_manager, horovod_global.group_table,
          mpi_context));
      horovod_global.controller->SetRanks(ranks, nranks);
    }
#endif

#if HAVE_GLOO // Processing depends on whether GLOO is compiled
    // Enable gloo is it's used either in cpu data transfer or controller
    if (horovod_global.cpu_operation == LibType::GLOO ||
        horovod_global.control_operation == LibType::GLOO) {
      gloo_context.Enable(a); }if (horovod_global.control_operation == LibType::GLOO) {
      horovod_global.controller.reset(new GlooController(
          horovod_global.response_cache,
          horovod_global.tensor_queue, horovod_global.timeline,
          horovod_global.parameter_manager, horovod_global.group_table,
          gloo_context));
    }
#endif
    // Reset initialization flag
    // Start the background thread
    horovod_global.initialization_done = false;
    horovod_global.background_thread = std::thread(
        BackgroundThreadLoop, std::ref(horovod_global));
  }

  // Wait to ensure that the background thread has finished initializing MPI.
  while(! horovod_global.initialization_done) { std::this_thread::sleep_for(std::chrono::milliseconds(1)); }}Copy the code

4.3.3 HorovodGlobalState

In the C++ world, HorovodGlobalState serves as a centralized management of global variables.

HorovodGlobalState in horovod is a global variable whose elements can be accessed by different threads. HorovodGlobalState is created when the C++ code is loaded, along with the various contexts (mpi_context, nccl_context, gpu_context).

Horovod initializes different elements of HorovodGlobalState in backgroundThreadLoop. The most important ones are:

  • Controller manages the overall communication control flow;
  • Tensor_queue handles communication requirements (AllReduce, broadcast, etc.) from the front end.
// All the Horovod state that must be stored globally per-process.
HorovodGlobalState horovod_global;

#if HAVE_MPI
MPIContext mpi_context;
#endif

#if HAVE_GLOO
GlooContext gloo_context;
#endif. std::unique_ptr<OperationManager> op_manager;Copy the code

HorovodGlobalState is summarized as follows:

struct HorovodGlobalState {

  // Background thread running MPI communication.
  std::thread background_thread; // Background thread, used to coordinate between various workers

  ParameterManager parameter_manager; // Maintain the background overall parameter configuration

  // Encapsulates the fusion buffers, handles resizing and auto-tuning of buffer
  // size.
  FusionBufferManager fusion_buffer; // Combine tensor to reduce communication overhead

  std::shared_ptr<Controller> controller; // Manage the overall communication control flow

  TensorQueue tensor_queue; // Handle the communication requirements from the front end (AllReduce, broadcast, etc.)

  // Pointer to shared buffer for allgather
  void* shared_buffer = nullptr;

  // LRU cache of Responses
  ResponseCache response_cache;

  // Information on registered groups.
  GroupTable group_table;

  ~HorovodGlobalState() {
    // Make sure that the destructor of the background thread is safe to
    // call. If a thread is still joinable (not detached or complete) its
    // destructor cannot be called.
    if (background_thread.joinable()) {
      shut_down = true;
      background_thread.join(a); }}};Copy the code

The current logic is as follows:

Import python files + | | v Import C++ SO files | | | v Create _HorovodBasics + | | v hvd.init() + Python | +-------------------------------------------------------------------------------------------------------------+ | c++ | v +-----------------------------+ | HorovodGlobalState | horovod_init_comm | | + +------------------+ | | | | horovod_global +---------> | TensorQueue | | | | | | v | | | background_thread | | mpi_context | | | InitializeHorovodOnce +------------> | | | ParameterManager | + | | | | | | gloo_context | | FusionBufferManager | | | |  | | | | | | Controller | v | op_manager | | | background_threa | | | ResponseCache | +------------------+ | | | shared_buffer | +-----------------------------+Copy the code

Mobile phones are as follows:

At this point, horovod has been initialized and the user code is ready to use.

4.4 HVD concept

Next in user code is the concept of rank.

hvd.local_rank()

hvd.rank()
Copy the code

Let’s introduce the following concepts:

  • Horovod launches a copy of this training script for each GPU on the device. ** Local Rank ** is the unique number (also known as the process number or GPU device ID) assigned to each training session on a computer. The number ranges from 0 to N-1, where n is the number of GPU devices on the computer.
  • rankCan be thought of as a unique global number (used for interprocess communication) that represents an execution training in a distributed task. Rank 0 usually has a special meaning in Horovod: it is the device responsible for this synchronization.
    • In Baidu’s implementation, different ranks have different roles. Rank 0 acts as a coordinator. It coordinates MPI requests from other ranks, which is an engineering consideration. This design was later adopted by Horovod.
    • Rank 0 is also used to broadcast parameters to other processes & store the checkpoint.
  • World_size: indicates the total number of processes. Training will not start until all world_size processes are ready.

Init the purpose of this part is to allow parallel processes to know which rank/local rank they are assigned, so that they can set the required video memory allocation based on the local rank (the number of GPU cards on the node).

4.5 Data Processing

Next comes the data processing.

dataset = tf.data.Dataset.from_tensor_slices(
    (tf.cast(mnist_images[..., tf.newaxis] / 255.0, tf.float32),
             tf.cast(mnist_labels, tf.int64))
)
dataset = dataset.repeat().shuffle(10000).batch(128)
Copy the code

Here are a few caveats:

  • First, the training data needs to be placed where any node can access it.

  • Secondly, Horovod needs to shard the data, which needs to be shard by Rank on different machines to ensure that the data set trained by each GPU process is different.

  • The dataset ontology needs to be split into multiple shards for data parallelism, and the different work nodes of Horovod will read their own dataset shards separately.

You can see this more clearly from the PyTorch example script.

# Horovod: use DistributedSampler to partition the training data.
train_sampler = torch.utils.data.distributed.DistributedSampler(
    train_dataset, num_replicas=hvd.size(), rank=hvd.rank())
train_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=args.batch_size, sampler=train_sampler, **kwargs)
Copy the code
  • The sampler component of the DataLoader returns an iterable index from the data set to be drawn. The default sampler in PyTorch is sequential, returning the sequence 0, 1, 2… , n. Horovod overrides this behavior with its DistributedSampler, which handles data set partitioning across computers. The DistributedSampler itself takes two arguments as input: hvd.size() (the total number of gpus, such as 16) and hvd.rank() (the ID assigned to the device from the overall list, such as 0… 15).

  • Pytorch uses data distributed training. Each process actually loads data independently, so it needs to load the same data set and then use certain rules to cut different data subsets according to rank. DistributedSampler is a way of making sure that dataloader loads only a specific subset of the entire dataset. Do your own load data after dividing word_size subsets according to rank order to get the subset effect is the same).

  • In order to divide the data subset in order to get different parts of the data, so the data set cannot be randomly scattered, so the parameter ‘shuffle’: False is used.

4.6 Broadcast initialization variables

The following code completes the broadcast initialization function.

hvd.callbacks.BroadcastGlobalVariablesCallback(0)
Copy the code

This code guarantees that all parameters on Rank 0 are initialized at Rank 0 only and then broadcast to other nodes, i.e. variables are propagated from the first process to other processes to achieve consistent initialization of parameters.

Here’s a look at the use of radio in Horvod.

4.6.1 Broadcast definition

The concrete implementation of broadcasting is:

class BroadcastGlobalVariablesCallbackImpl(object) :
    def __init__(self, backend, root_rank, device=' ', *args) :
        super(BroadcastGlobalVariablesCallbackImpl, self).__init__(*args)
        self.backend = backend
        self.root_rank = root_rank
        self.device = device
        self.broadcast_done = False

    def on_batch_end(self, batch, logs=None) :
        if self.broadcast_done:
            return

        with tf.device(self.device):
            if hvd._executing_eagerly() and hasattr(self.model, 'variables') :# TensorFlow 2.0 or TensorFlow eager
                hvd.broadcast_variables(self.model.variables,
                                        root_rank=self.root_rank)
                hvd.broadcast_variables(self.model.optimizer.variables(),
                                        root_rank=self.root_rank)
            else:
                bcast_op = hvd.broadcast_global_variables(self.root_rank)
                self.backend.get_session().run(bcast_op)

        self.broadcast_done = True
Copy the code

4.6.2 broadcast_variables

Broadcast_variables calls the _make_broadcast_group_fn completion, and you can see that broadcast is called for each variable in the execution diagram.

def broadcast_variables(variables, root_rank) :
    """Broadcasts variables from root rank to all other processes. Arguments: variables: variables for broadcast root_rank: rank of the process from which global variables will be broadcasted to all other processes. """
    broadcast_group = _make_broadcast_group_fn()
    return broadcast_group(variables, root_rank)
Copy the code

As well as

@_cache
def _make_broadcast_group_fn() :
    if _executing_eagerly():
        # Eager mode will parallelize independent control flow
        def broadcast_group(variables, root_rank) :
            for var in variables:
                var.assign(broadcast(var, root_rank))

        return _make_subgraph(broadcast_group)
    else:
        # Graph mode requires an Op
        def broadcast_group(variables, root_rank) :
            return tf.group(*[var.assign(broadcast(var, root_rank))
                              for var in variables])

        return broadcast_group
Copy the code

4.6.3 call MPI

Broadcast is just a call to the MPI function that actually completes the function.

def broadcast(tensor, root_rank, name=None, ignore_name_scope=False) :
    """An op which broadcasts the input tensor on root rank to the same input tensor on all other Horovod processes. The broadcast operation is keyed by the name of the op. The tensor type and shape must be the same on all Horovod processes for a given name. The broadcast will not start until all processes are ready to send and receive the tensor. Returns: A tensor of the same shape and type as `tensor`, with the value broadcasted from root rank. """
    if name is None and not _executing_eagerly():
        name = 'HorovodBroadcast_%s' % _normalize_name(tensor.name)
    return MPI_LIB.horovod_broadcast(tensor, name=name, root_rank=root_rank,
                                     ignore_name_scope=ignore_name_scope)
Copy the code

4.6.4 Synchronizing Parameters

In the background process, parameters are synchronized periodically as appropriate.

bool RunLoopOnce(HorovodGlobalState& state) {
	// Business logic function

  if (state.parameter_manager.IsAutoTuning()) {
    bool should_sync =
        state.parameter_manager.Update(tensor_names, total_tensor_size);
    // See if you need to synchronize, and if so, synchronize.
    if (should_sync) {
      state.controller->SynchronizeParameters(a); }}... }Copy the code

The synchronization parameter code is also completed by calling the Bcast function.

void Controller::SynchronizeParameters(a) {
  ParameterManager::Params param;
  if (is_coordinator_) { // rank 0 to perform the operation
    param = parameter_manager_.GetParams(a); }void* buffer = (void*)(&param);
  size_t param_size = sizeof(param);
  Bcast(buffer, param_size, 0, Communicator::GLOBAL);

  if(! is_coordinator_) {// worker performs operations
    parameter_manager_.SetParams(param); }}Copy the code

4.7 DistributedOptimizer

Finally, you need to configure the DistributedOptimizer, which is one of the key points.

# Horovod: add Horovod DistributedOptimizer.
opt = hvd.DistributedOptimizer(
    opt, backward_passes_per_step=1, average_aggregated_gradients=True)
Copy the code

TF Optimizer is a key API for model training. It takes the gradient of each OP and uses it to update weights. HVD wraps the Hvd.distributedOptimizer on top of the original TF Optimizer.

The DistributedOptimizer wrapper delegates the gradient calculation to the original optimizer as input. The DistributedOptimizer calls the original optimizer to do the gradient calculation. In this way, each machine in the cluster will get its own Local Gradient using the original optimizer.

The Horovod DistributedOptimizer then uses either all-Reduce or All-Gather to do the global gradient merge and then applies those average gradients to all devices.

Let’s comb through the calling relationship:

  • Hvd. DistributedOptimizer inherits from Keras Optimizer. When calculating, the calculation is still done by the original Optimizer passed in.
  • After the gradient is computed, either hvd. allReduce or hvd. allGather is called to calculate it.
  • Finally, the gradient after these averages is implemented. To achieve the gradient merge operation of the whole cluster.

Details will be introduced later.

4.8 Possible Future

Horovod’s current architecture is based on machine learning model parameters that can be stored on a SINGLE GPU.

In the future, it will be interesting to see whether model sharding can be combined.

In addition, if the model has many full connection layers, the strong coupling of the full connection layer combined with allReduce’s bSP-like synchronization mechanism will still make the network communication time a bottleneck. Therefore, in ring-AllReduce environment, synchronization protocol modification, such as SSP to replace BSP, or gradient compression to speed up AllReduce process is also worth exploring.

0 x05 summary

To the questions raised at the beginning of the article, we will now answer them as follows:

  • Hovorod How to split data?
    • Answer: There are frameworks that do data segmentation automatically. If the framework does not provide data segmentation, the user needs to perform data segmentation to ensure that the data set trained by each GPU process is different.
  • How does Hovorod do model distribution?
    • Users need to manually copy the training code to each node.
  • What do python and C++ do when Hovorod starts?
    • Answer: python introduces the C++ library, initializing various variables and configurations. The C++ section initializes the MPI and GLOO contexts and starts background processes to handle internal communication.
  • How do I ensure that Hovorod startup steps are consistent?
    • All parameters on Rank 0 are initialized only at Rank 0 and then broadcast to other nodes, i.e. variables are propagated from the first process to other processes to achieve consistent initialization of parameters.

The next article will delve into the Python world.

0xEE Personal information

Thoughts on life and technology

Wechat public account: Rosie’s Thinking

If you want to get updates on your own articles, or check out your own technical recommendations, please pay attention.

0 XFF reference

This is enough for you to know about The Pytorch distributed training!

Horovod uses _ Distributed model training with HoroVOd

Spark’s new vision: Make deep learning easier to use

Scaling model training in PyTorch using distributed data parallel

Scaling model training in PyTorch using distributed data parallelism

A developer-friendly guide to mixed precision training with PyTorch

Developer-friendly PyTorch hybrid precision training guide

It’s 2020, why isn’t deep learning 100% on the cloud yet?

By 2020, why can’t we have 100% deep learning in the cloud?

Take you through the Horovod Distributed training framework

Using Horovod in Amazon SageMaker pipeline mode to implement multi-GPU distributed training

Kubernetes Training _ Distributed deep learning training using Horovod on Kubernetes

Horovod- Based on the TensorFlow distributed deep learning framework

Paracel asked ten

PARACEL: Making distributed machine learning easy

Introduction to Spark on Angel large-scale distributed machine learning platform

Introduction to distributed TensorFlow

Parameter server — a new killer for distributed machine learning

NCCL– Collective Communication technology of GPU

Oar heterogeneous parameter server architecture

Discuss the network related problems in distributed machine learning system

Tencent large-scale distributed machine learning system is how to select technology

How to understand Nvidia multi-GPU multi-card communication framework NCCL?

Baidu introduces high performance computing into deep learning: it can efficiently implement large-scale expansion of models

4: AdamOptimizer

Parallel computing in machine learning

Distributed Machine Learning (1) – Parallel computing and machine learning

Distributed Machine Learning (MIDDLE) – Parallel computing and machine learning

Distributed Machine Learning (part 2) – Federated learning

[Distributed ML] Parameter Server & Ring All-Reduce

Distributed and parallel processing systems for deep learning

Parallel and distributed deep learning

Distributed Machine Learning, Federated Learning (Shusen Wang)