DataParallel

The way of parallelism is divided into data parallelism. DataParallel replicates modules across multiple cards, equally divides each batch into each card, and each card independently forwards its own data. In the case of BACKWARD, gradients on each card are aggregated onto the original module to achieve parallelness.

However, in this way, the video memory pressure of the card on which the original Module is installed is higher than that of other cards, that is, there is load imbalance in this way. For details, see pyTorch: An attempt at Multi-calorie training in this article.

Multiprocessing

Another approach is to make use of Python’s multi-processes, where each card runs one process, each process has its own model and one piece of data, and the gradients of multiple cards are summarized and propagated to each card for parallelism.

This approach avoids the load imbalance of the first approach, and multiple processes avoid Python’s GIL mechanism.

PyTorch, however, recommends using DataParallel to investigate other fields.

Use nn.DataParallel instead of multiprocessing

Most use cases involving batched inputs and multiple GPUs should default to using DataParallel to utilize more than one GPU. Even with the GIL, a single Python process can saturate multiple GPUs. large numbers of GPUs (8+) might not be fully utilized. However, this is a known issue that is under active development. As always, test your use case. There are significant caveats to using CUDA models with multiprocessing; unless care is taken to meet the data handling requirements exactly, it is likely that your program will have incorrect or undefined behavior.

Example

A multi – process multi – card training for example.

.import torch.distributed as dist
import torch.multiprocessing as mp
...

def run(gpu_id):. model = Model() model.cuda(gpu_id)for epoch in range(epochs):
        train(epoch, gpu_id, model, optimizer, data_loader)
        
        if gpu_id == 0:
            validata(...)
        
# Summarize multiple calorie gradients and average post-propagation
def average_gradients(model):
    size = float(dist.get_world_size())
    for p in model.parameters():
        if p.grad is not None:
            dist.all_reduce(p.grad.data, op=dist.ReduceOp.SUM)
            p.grad.data /= size
            
def train(epoch, gpu_id, model, optimizer, data_loader):
    model.train()
    
    for i, data in enumerate(data_loader):
        data = {key: value.to(gpu_id) for key, value in data.items()}
        model.zero_grad()
        
        loss = model.forward(data)
        
        loss.backward()
        
        if word_size > 1:
            average_gradients(model)
        optimizer.step()
        
def init_process(host, port, rank, fn, backend="nccl"):
    """ Host: STR port: STR rank: int fn: train "" Backend: PyTorch ""
    os.environ["MASTER_ADDR"] = host
    os.environ["MASTER_ADDR"] = port
    dist.init_process_group(backend, rank=rank, world_size=world_size)
    fn(rank)
    
if __name__ == "__main__":
    mp.set_start_method("spawn")
    
    processes = []
    # There are several cards that can be assigned the ranks list
    for rank in ranks:
        p = mp.Process(target=init_process, args=(host, port, rank, run)
        p.start()
        processes.append(p)
    
    for p in processes:
        p.join()
Copy the code

Or refer directly to the official documentation WRITING DISTRIBUTED APPLICATIONS WITH PYTORCH

Apex

NVIDIA developed auxiliary functions to support parallelism and mixing accuracy. Github address: Apex.

It is mainly a package for PyTorch multi-process multi-card training code. Importantly, it supports FP16 training as well as mixed precision training.

But I haven’t used apex yet, so I won’t write it here.

Accumulating gradients

The pattern of Accumulating gradients is alsoused in the pattern of Accumulating gradients. Practical Tips for 1-GPU, multi-GPU & Distributed Setups, is huggingFace/PyBurnt-Prewell-Bert.)

The method is very simple, that is, assuming that batch_size=32, but the video memory of one GPU cannot be stored, we can divide the whole batch into 8 parts when forward, pass the small batch of size=4 each time, calculate the gradient of the small batch and carry out BACKWARD, and then do not update parameters at the moment. Instead, forward the second small batch, calculate loss and then BACKWARD, and add the gradient of the second small batch to the gradient of the first small batch. In this way, when the gradient of the four small batches is calculated, then update the parameter.

This is easy to implement in PyTorch because the cache gradient is reset only when we call model.zero_grad() or optimizer.zero_grad().

Thomas Wolf gives an example of GIST, Accumulating Gradients. Or we could just refer to Huggingface/PyTorch – Pretrained -BERT.

Accumulating gradients can be used in conjunction with distributed training to increase the BATCH_size because the Batch_size can only be set to single digits for large models such as BERT with additional downstream frames. The batch_size suggested in the paper is 16 or 32, too small batch_size will also affect the performance of the model. Therefore, we need a larger BATCH_size to stabilize the training process.


Welcome to personal blogAlex Chiu’s learning space