PyTorch distributed elastic training (1) - the general idea

0 x00 the

In previous articles, we have studied the basic modules of PyTorch distributed and introduced some official examples. We will introduce The Elastic Training of PyTorch. This is the first article to introduce its history and design philosophy, as well as compare it with Horovod.

Note: A systematic review of the Horovod article will follow, and it will be updated with more comparative analysis.

0 x01 pain points

Because the machine learning model is more and more large, single GPU memory already could not accommodate model parameters, so are generally use a large number of nodes or a cluster training, as the training scale, the design of hardware is weak or reason can lead to a single point of failure probability increases, it brings some problems or weaknesses, such as:

Pain point 1: The Dr Function is unavailable.
- Problem: The failure of a single node often ends the entire training job. Although the framework provides checkpoint functionality, frequent calls can cause performance problems, so a period of training is still lost and tasks continue to queue.
- Ideal state: the failure of a single node does not affect the overall training. When a node fails, the node is automatically eliminated and the training continues smoothly.
Pain point 2: lack of elastic calculation force perception and dynamic training capacity expansion and contraction mechanism.
- Problem: Users can only determine the required fixed and static resources when submitting a task, but cannot dynamically detect cluster resources in real time. As a result, the cluster resource utilization is low.
- Ideal state: the training should start when there are a few idle machines. When there are more resources, the elastic task and the upper dispatching system can cooperate with I, so as to effectively detect these potential resources and increase the number of workers automatically in the training process. When the task is free, resources are automatically released. In addition, when the number of workers changes, the training task will not be interrupted to achieve smooth transition.
Pain point 3: The configuration and scheduling mechanism of cluster resources is inflexible
- Problem: Currently, dynamic worker configuration is not supported, and high-priority instance preemption is not supported. Therefore, when resources are insufficient, resources cannot be released for other high-priority businesses as required, and the task can only be terminated voluntarily or terminated by error.
- Ideal: Training tasks can be preempted, resources can be proactively freed up, and it can drift between machines with different uses/configurations.

0 x02 difficulties

Let’s take a look at what challenges and difficulties we need to face to achieve flexible training. This is only from an engineering perspective, without considering data segmentation/learning rate /batch size adjustment and other issues.

Difficulty 1: You need a mechanism for nodes/processes to discover each other.

How other nodes/training processes sense when a node/training process automatically enters or exits.

Difficulty 2: How to deal with member changes

How to handle member changes.

Difficulty 3: how to catch single process training failure.

How to manage all training processes on a single node so that when a process fails, it can be caught and either retry or restart the process.

Difficulty 4: how to integrate with the existing training code.

How to integrate with existing code with minimal effort, without introducing complex abstractions, or shielding the underlying implementation from the user as much as possible.

0x03 TorchElastic

Let’s take a look at PyTorch elasticity in general. PyTorch elastic mechanism is incorporated from TorchElastic github.com/pytorch/ela… TorchElastic look.

TorchElastic (TE) was officially introduced in PyTorch 1.9, and we look at the history of flexibility training in two places.

3.1 history

3.1.1 PyTorch 1.7

The Release note mentions that TE has been added to PyTorch Docker:

[Stable] TorchElastic now bundled into PyTorch docker image

Torchelastic offers a strict superset of the “torch.distributed. Launch “CLI and adds fault tolerance and resiliency. If users are not interested in fault tolerance, they can get accurate functionality/behavior by setting “max_restarts=0” and adding the convenience of automatically assigning “RANK” and “MASTER_ADDR” ports (instead of manually specifying them in “torch.distributed. Launch “).

3.1.2 PyTorch 1.9

As you can see from TE Repository, TE has now been incorporated into the main PyTorch 1.9 release.

IMPORTANT: This repository is deprecated.

TorchElastic has been upstreamed to PyTorch 1.9 under torch.distributed.elastic. Please refer to the PyTorch documentation here.

3.2 Design Concept

PET has gone through two versions, from github.com/pytorch/ela… You can see the design philosophy.

3.2.1 Basic Functions

PyTorch Elastic Trainer (PET) provides a framework for training models across clusters in a fault-tolerant and Elastic manner. PET provides these capabilities in two ways:

When the PyTorch Worker process throws some kind of reachable error, it is caught by PET and retries the training process.
As long as the number of workers remains within the range specified at the start of work, new workers can leave or join the process pool of existing training jobs at any time. When members change, all workers will re-rendezvous to establish a new process group and resume training from their previous good state.

To integrate with PET, PyTorch users need to make the following changes to their training logic:

Users need to enable PET to control their training cycle.
- In essence, the user provides an “internal training” loop that is wrapped in a retried loop by PET.
- PET cycles are retried cycles that are responsible for establishing or recreating process groups and restoring the user’s training to a good state.
When a new worker joins the process pool, the user needs to specify what the state is and how to apply the state to a new worker.

3.2.2 Overview of new design

PET V0.2 from V0.1 to gain a lot of experience, the following v0.2 design concept.

Dynamic range

In PET V.0.2, we no longer attempt to recover errors in the training function. Instead, PET tries to maintain the number of worker processes within the [min, Max] range required for the job. The application writer is responsible for loading and restarting from existing available restore point files. Unlike V0.1, PET V0.2 does not enforce how to manage the checkpoints. Application writers can use Torch. Save and Torch. Load, or higher-level frameworks such as PyTorch Lightening, to perform their own processing.

The local agent

PET V0.2 uses a new process called elastic-Agent, with each node having a separate elastic-Agent. Each agent process manages only one set of local worker processes for that node and coordinates with the elastic agents on other nodes of the job to determine changes in process group membership. The details are shown in the figure below:

Image source: github.com/pytorch/ela…

Members of the change

Member changes are handled as follows: When a worker process fails, the elastic agent that manages it kills all workers on that node, establishes a rendezvous operation with the other agents, and restarts the worker with the new rendezvous information. However, when an agent exits with a non-zero error code, it is up to the upper-level scheduling module, such as Kubernetes, to restart the agent (again, this agent will restart all workers it is responsible for). The same recovery mechanism applies to node-level failures. Choreography tools (such as Kubernetes) schedule jobs so that they can run with the minimum number of agent copies, and each agent in turn orchestrates the user’s training script.

Image source: github.com/pytorch/ela…

Compatible with

To adopt PET V0.2, an application simply needs to make its entry point or main function compatible with PyTorch Distributed Launcher. We expect distributed training jobs started with distributed initiators to be seamlessly started with elastic agents with no changes or minimal code changes. The only difference is that in the latter case, the application will be able to progress despite some failure.

3.2.3 bare bare-bones

The new PET design is intended to be bare-bones: it balances the granularity of an application’s recoverability between simplicity and robustness.

In the future, TE hopes to provide more convenient apis for checkpoint mechanisms that developers can choose to use for more efficient restart semantics.

Since PET is “bare-bones,” there are some guidelines on how to handle it, such as checkpoint.

In case of failure or member change, all surviving workers will be killed immediately. So you need to manually checkpoint and periodically save your progress to ensure that training can continue after restarting. The frequency of checkpoints should depend on the user job’s tolerance for failure. The following structure is recommended for user scripts:

  def main() :
    load_checkpoint(checkpoint_path)
    initialize()
    train()

  def train() :
    for batch in iter(dataset):
      train_step(batch)

      if should_checkpoint:
        save_checkpoint(checkpoint_path)
Copy the code

3.3 summary

It is not difficult to find that the design concept of TE mainly answers the four difficulties mentioned before.

Difficulty 1: You need a mechanism for nodes/processes to discover each other.

TE’s answer is: when members change, all workers will re-rendezvous to establish a new process group. Rendezvous was that discovery mechanism.

Difficulty 2: How to deal with member changes

TE’s answer is: when a worker process fails, the elastic agent that manages it kills all workers on the node, establishes a rendezvous operation with the other agents, and restarts the worker with the new rendezvous information. However, when an agent exits with a non-zero error code, it is up to the upper-level scheduling module, such as Kubernetes, to restart the agent (again, this agent will restart all workers it is responsible for).

Difficulty 3: how to catch training failures of a single process and how to manage all training processes on a single node.

The answer for TE is that each agent process is responsible for managing only a set of local worker processes on that node and coordinates with elastic agents on other nodes of the job to determine changes in process group membership.

Difficulty 4: how to integrate with the existing training code.

TE’s answer: An application simply needs to make its entry point or main function compatible with PyTorch Distributed Launcher.

0 x04 problem

4.1 VS Horovod

Since we already had the foundation of Horovod flexibility training, we used Horovod as a baseline, asked a series of questions, and then went to PyTorch to explore.

How do I manage the local training process?
- Horovod manages the local training process through the background Driver process.
- The TE manages the local training process through the background Agent process.
How do I save state?
- Horovod provides a built-in implementation to checkpoint using state.com MIT () between each training session.
- TE uses its own implementation to save/load checkpoint.
How do I find new nodes?
- Horovod lets users implement their own node discovery logic, which requires the user to provide onediscovery_hosts.sh, which specifies the nodes that are participating in training. Horovod periodically executes this script to discover the current node.
- TE uses distributed consistency middleware ETCD or its own C10D backend (based on TcpStore) to solve the problem of mutual discovery between nodes.
How do I catch exceptions?
- Horovod catches a collection communication Exception/node Exception/scaling, converts it into Horovod’s own Exception, and then builds a new loop based on configuration (such as internal blacklisting of Exception nodes) to continue training.
- TE defines a monitor method, which is periodically called to monitor local process abnormalities and convert them to internal status values for processing. If a worker has a problem, the Agent of the node will restart all workers of the node for a new round of rendezvous. Since it is a new rendezvous, other nodes will also restart their workers and continue training together.

4.2 TE problem

The following are some internal questions about TE, which will be answered in our subsequent analysis step by step.

The RANK and WORLD_SIZE fields do not need to be set manually.
How to determine the RANK between different nodes?RANK 0Will an instance of the master exist as a master?
How to restart worker operation after worker failure?
How does TE deal with the new worker after discovering it?
There is one rendezvous on each agent, do these rendezvous have master, slave concepts? Is there a master that keeps track of the current cluster state?
How to dynamically increase or decrease the number of workers participating in training?

0x05 PyTorch Distributed series

PyTorch distributed other articles as follows:

PyTorch distributed (1)—— History and Overview

PyTorch how to use GPU

PyTorch distributed (2) —– DataParallel – gradient

PyTorch distributed (3) —– DataParallel – gradient

PyTorch distributed (4)—— Distributed application concepts

—— DistributedDataParallel – what to use

DistributedDataParallel — gradient — gradient — — — — — —

—– DistributedDataParallel – conditional processing groups

PyTorch distributed (8) ——– DistributedDataParallel allel allel allel allel allel allel allel allel allel allel

—– DistributedDataParallel – gradient initialization

PyTorch distributed (10)—— distributed Dataparreducer static schema

—– DistributedDataParallel constructs Reducer and Join operations

—– DistributedDataParallel – gradient forward propagation

—– DistributedDataParallel – gradient back-propagation

PyTorch distributed Autograd (1) —- design

PyTorch Distributed Autograd (2) —- RPC foundation

PyTorch Distributed Autograd (3) —-

PyTorch Distributed Autograd (4) —-

PyTorch Distributed Autograd (5) —-

PyTorch Distributed Autograd (6) —-

PyTorch Distributed optimizer (1)—- cornerstone

PyTorch Distributed optimizer (2)—- Data parallel optimizer

PyTorch Distributed optimizer (3)—- model parallel

PyTorch Distributed (14) — Use Distributed Autograd and Distributed Optimizer

PyTorch distributed (15) — using a distributed RPC framework to implement parameter server

PyTorch distributed (16) — Use asynchronous execution to implement batch RPC

PyTorch distributed (17) — combined with DDP and distributed RPC framework

PyTorch distributed (18) — Distributed pipe parallelism using RPC

0xEE Personal information

★★★★ Thoughts on life and technology ★★★★★

Wechat official account: Rosie’s Thoughts

0 XFF reference

Exploration and practice of elastic distributed training for VIVO AI computing Platform

Pytorch/elastic analysis

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

PyTorch distributed elastic training (1) — the general idea

0 x00 the

0 x01 pain points

0 x02 difficulties