preface

This article brings you 28 Pytorch best practices, as many as I can muster. Most of them I’ve used and learned about, but there are some that I haven’t used and haven’t learned about, such as distributed optimization, which you can read selectively until you know where to find them when you need them.

Without further ado, the text begins below 👏👏👏

General optimization

The first two pieces of advice are about data loading. In Pytorch, the core program for data loading is the DataLoader, whose constructor takes the following arguments:

DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,
           batch_sampler=None, num_workers=0, collate_fn=None,
           pin_memory=False, drop_last=False, timeout=0,
           worker_init_fn=None, *, prefetch_factor=2,
           persistent_workers=False)
Copy the code

1. Use multi-process data loading

The default setting for DataLoader is num_workers = 0, which means that data loading is synchronous and done in the main process, which means that the training of the main process must wait for data to be available before it can continue, which means that data loading can block computation. Of course, this default setting makes sense. That is, this pattern may be preferred when resources for sharing data between processes (for example, shared memory, file descriptors) are limited, or when the entire data set is small and can be fully loaded into memory. Furthermore, error tracing in this mode is more readable and therefore useful for debugging.

To avoid data loading blocking computation code, PyTorch provides a simple switch to perform multi-process data loading by setting num_workers > 0. In this mode, num_workers processes are created each time a DataLoader iterator is created (for example, when we call enumerate(DataLoader)). Each process receives the following initialization parameters: DATASET, collate_FN, and worker_init_fn.

For map-style datasets, the main process uses the sampler to generate indexes and send them to the worker process. In fact, any shuffle randomization is done in the main process by assigning indexes to guide the load.

For iterable style datasets, multi-process loading usually results in data duplication because each worker process gets a copy of the dataset object. Users can configure each copy individually using either torch.utils.data.get_worker_info() or worker_init_fn. For similar reasons, in multi-process loads, the drop_last parameter deletes the last incomplete batch of an iterated dataset copy for each worker process.

Extension: Pytorch divides the dataset into two types: map-style and iterable style. The difference is that the former only represents a mapping from an index/key to a data sample, while the latter is a truly iterable data sample. See official documentation for details

Multi-process data loading has two advantages: first, because data acquisition is asynchronous, the main process can perform training work at the same time when acquiring new data batches; The second is that the process of loading large amounts of data becomes faster due to multiple processes running in parallel. However, the disadvantages of this method are obvious, that is, it is easy to burst memory. Therefore, set the num_workers value carefully according to your own memory condition.

As a rule of thumb, num_workers is best set to 4 * num_GPU.

2. Use fixed memory

CUDA divides host side memory into pageable memroy and page-locked or pinned memory. The differences are as follows:

Memory type	Memory allocation mode	paging	Change the page	DMA (Direct Memory Acess) access
Paged memory	Through the operating system API (MALLOc)	✔	✔	❌
Page locked memory	Through CUDA API (cudaMallocHost/cudaHostAlloc)	❌	❌	✔

Compared to paged memory, paged memory has a faster transfer rate, typically about twice that of paged memory.

By default, host (CPU) data allocation is paged. The GPU cannot access data directly from paged memory, so when transferring data from paged memory to device memory, CUDA drivers must first allocate a fixed memory, first copy host data to fixed memory, and then transfer data from fixed memory to device memory, as shown in the figure below (Source).

As you can see, the fixed memory is only used as a temporary storage area for data transfer between device memory and host memory. We can speed things up by storing data directly in fixed memory without copying data from paged memory to fixed memory. For data loading, simply set the DataLoader parameter pin_memory = True to place the captured data tensor into fixed memory for faster data transfer to cudA-enabled Gpus (asynchronous memory replication).

However, the default memory locking logic only recognizes tensors and the mappings and iterations that contain them. Therefore, if the fixed logic sees a batch of a custom type (for example, when collate_FN returns a custom batch type), or if every element of the batch is a custom type, the fixed logic will not recognize them and will return that batch (or those elements) without fixing memory. Memory locking can only be enabled for a custom batch or data type if the pin_memory() method is defined on a custom type.

The following is an example:

class SimpleCustomBatch:
    def __init__(self, data) :
        transposed_data = list(zip(*data))
        self.inp = torch.stack(transposed_data[0].0)
        self.tgt = torch.stack(transposed_data[1].0)

    # Custom memory fixation methods on custom types
    def pin_memory(self) :
        self.inp = self.inp.pin_memory()
        self.tgt = self.tgt.pin_memory()
        return self

def collate_wrapper(batch) :
    return SimpleCustomBatch(batch)

inps = torch.arange(10 * 5, dtype=torch.float32).view(10.5)
tgts = torch.arange(10 * 5, dtype=torch.float32).view(10.5)
dataset = TensorDataset(inps, tgts)

loader = DataLoader(dataset, batch_size=2, collate_fn=collate_wrapper,
                    pin_memory=True)

for batch_ndx, sample in enumerate(loader):
    print(sample.inp.is_pinned())
    print(sample.tgt.is_pinned())
Copy the code

For the performance gains from using multi-process data loading and fixed memory, see the following figure:

Fixed memory is supposed to be in GPU optimization, but data loading is covered here.

3. Disable gradient calculation for verification or inference

PyTorch saves intermediate buffers from all operations involving tensors that require gradients. In general, gradients are not required for verification or inference. The context manager torch.no_grad() can be used to disable gradient calculations within a given block of code, which speeds up execution and reduces the amount of memory required. Torch.no_grad() can also be used as a function decorator.

Example:

>>> x = torch.tensor([1], requires_grad=True)
>>> with torch.no_grad():
.  y = x * 2
>>> y.requires_grad
False
>>> @torch.no_grad()
.def doubler(x) :
.    return x * 2
>>> z = doubler(x)
>>> z.requires_grad
False
Copy the code

4. Disable bias in the convolutional layer before batchnorm

The argument bias for Conv2d() defaults to True (as does Conv1d and Conv3d).

If the nn.Conv2d layer is directly followed by the nn.BatchNorm2d layer, the bias in convolution is not required, that is, nn.Conv2d should be set to bias = False. Because the BatchNorm subtracts the average in the first step, this effectively eliminates the effect of bias. Keeping the bias is superfluous, and it is better to disable it to reduce the parameters of the model.

This also applies to 1D and 3D convolution, as long as the BatchNorm normalization (or other normalization layer) deviates from the convolution in the same dimension. If we were using the model provided by TorchVision, we wouldn’t have this concern, because it already helped us achieve this optimization.

This optimization works best at the convolution layer because it often uses blocks like Conv -> BatchNorm -> ReLU. For example, here is a block used by MobileNet:

nn.Conv2d(hidden_dim, hidden_dim, 3, stride, 1, groups=hidden_dim, bias=False),
nn.BatchNorm2d(hidden_dim),
nn.ReLU6(inplace=True),
Copy the code

The following block does not apply to this optimization:

nn.Linear(64.32),
nn.ReLU(),
nn.BatchNorm2d(),
Copy the code

In this case, although nn.Linear and Nn.batchNorm2D do have a bias term on the same dimension, the effect of this term is nonlinearly compressed by ReLU, so there is no actual duplication of work.

5. Use None to zero the gradient instead of.zero_grad()

To zero the gradient, instead of using this method:

model.zero_grad()
# or
optimizer.zero_grad()
Copy the code

Instead, use the following method:

for param in model.parameters():
    param.grad = None
Copy the code

The second method does not zero the memory for each individual parameter, and subsequent backpasses also use assignment (write) rather than addition (read and write) to store gradients, which reduces the number of memory operations.

Starting with PyTorch 1.7, we can do the same: zero_grad(set_to_none=True). Thus, the gradient is initialized to None instead of 0.

6. Fusion point by point operation

Point-by-point operations (element-by-element addition, multiplication, mathematical functions sin(), cos(), sigmoid(), etc.) can be fused into a single kernel to amortize memory access time and kernel boot time.

The PyTorch JIT can fuse the kernel automatically, although there may be other fusion methods in the compiler that have not yet been implemented, and not all device types are equally supported.

Point-by-point operations are memory-limited, and PyTorch launches a separate kernel for each operation. Each kernel loads the data from memory, performs the calculation (usually cheaply), and stores the results back into memory.

The fusion operator starts a kernel for multiple fusion point-by-point operations and loads/stores data into memory only once. This makes JIT useful for activating functions, optimizers, custom RNN cells, and so on.

In the simplest case, we can enable fusion by applying the torch.jit. Script decorator to the function definition, for example:

@torch.jit.script
def fused_gelu(x) :
    return x * 0.5 * (1.0 + torch.erf(x / 1.41421))
Copy the code

7. Enable channels_last memory format for CV models

PyTorch 1.5 introduced support for the Channels_last memory format for convolutional networks. This format is intended to be used in conjunction with AMP to further accelerate convolutional neural networks using Tensor Cores.

The channels_last memory format is another way of sorting NCHW tensors, and this format only works with 4D NCWH tensors.

The classical storage of the NCHW tensor (in the example, two 4×4 images with three color channels) is shown below:

The channels_last memory format looks like this:

Channels_last support is still in beta, but is expected to work with standard CV models (e.g. Resnet-50, SSD). To convert the model to channels_last Format, see Channels Last Memory Format Tutorial.

8. Checkpoint intermediate buffer

Buffer checking is a technique to reduce the memory burden of model training. Instead of storing inputs from all layers to calculate an upstream gradient in backpropagation, it stores inputs from several layers, while the others are recalculated during backpropagation. Reduced memory requirements can increase batch sizes, thereby improving utilization.

Checkpoint targets should be selected carefully. It is best not to store large layer outputs that have low recalculation costs. Example target layers are activation functions (e.g. ReLU, Sigmoid, Tanh), up/down sampling, and matrix vector operations with small cumulative depth.

PyTorch supports the native torch.utils.checkpoint API for automatic checkpoint and recalculation.

9. Disable the debugging API

Many of the PyTorch apis are used for debugging and should be disabled during regular training runs:

Detect_anomaly: Torch.autograd.detect_anomaly, torch.autograd.set_DETECt_anomaly (True)
Profiler related: torch. Autograd. Profilers. Emit_nvtx, torch. Autograd. Profilers. Profile
Automatic derivation gradient check: torch. Autograd. Gradcheck, torch. Autograd. Gradgradcheck

10. Use multiple batches per gradient update

Propagating some values forward through the neural network in training mode creates a computational graph that assigns a weight (gradient) to each free parameter in the model. Backpropagation then adjusts these gradients to more accurate and better values, consuming the computed graph in the process.

However, not every forward pass requires a reverse pass. In other words, we can call model(Batch) as many times as we want before finally calling Loss.Backward (). The computed graphs continue to accumulate gradients until you finally decide to fold them all.

For models whose performance is constrained by GPU memory, this simple technique provides an easy way to achieve “virtual” batch sizes that are larger than memory.

For example, if each batch can only hold 16 samples in GPU memory, two batches can be passed forward and then one can be passed backward, with an equivalent batch size of 32. Or pass it forward four times and pass it back one time with an equivalent batch size of 64.

The following code example illustrates how it works:

model.zero_grad() # Reset gradients tensors for i, (inputs, labels) in enumerate(training_set): predictions = model(inputs) # Forward pass loss = loss_function(predictions, labels) # Compute loss function loss = loss / accumulation_steps # Normalize our loss (if averaged) loss.backward() # Backward pass if (i+1) % accumulation_steps == 0: # Wait for several backward steps optimizer.step() # Now we can do an optimizer step model.zero_grad() # Reset gradients  tensors if (i+1) % evaluation_steps == 0: # Evaluate the model when we... evaluate_model() # ... have no gradients accumulatedCopy the code

Please note that we need to combine the losses from each lot in some way, usually averaging.

There is only one disadvantage to using multi-batch gradient accumulative: any fixed costs incurred during training will be doubled. For example, the latency when transferring data between the host and GPU memory.

Using an Accumulate_grad_batch batch in Lightning is quite simple.

trainer = Trainer(accumulate_grad_batches=16)
trainer.fit(model)
Copy the code

11. Use gradient clipping

Gradient clipping is a technique originally developed to deal with gradient explosions in RNN by clipping gradient values that have become too large to more realistic values. We set a max_grad and PyTorch applies min(max_grad, actual_grad) for backpropagation (note: max_grad of 10 will ensure that the gradient falls within the range of [-10, 10]).

Gradient clipping can speed up model convergence and allow us to choose a higher learning rate. Why Gradient Clipping Michelin-caliber Training: A Theoretical Justification for Adaptivity

Gradient clipping in PyTorch is provided via torch.nn.utils.clip_grad_norm_. You can apply this to individual parameter groups on a case-by-case basis, but the simplest and most common way to use it is to apply clipping to the model as a whole:

torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad)
Copy the code

What is the appropriate setting for max_grad? There’s no pattern. It is best thought of as a hyperparameter. In this paper, Generating Sequences With Recurrent Neural Networks (2013), the value is set to 10 in the middle layer and 100 in the output header.

12. Use 20

PyTorch Lightning is an abstraction and wrapper around PyTorch. Its advantages are strong reusability, easy maintenance, logical and so on. The downside is obvious, there’s a lot to learn and understand about this package. After all suitable does not suit oneself or vary from person to person, we might as well try.

Use profilers to analyze code

In Lightning, if we just want to analyze standard operations, we can set profiler=”simple”. It uses the built-in SimpleProfiler.

# by passing a string trainer = Trainer(... . profiler="simple") # or by passing an instance from pytorch_lightning.profiler import SimpleProfiler profiler = SimpleProfiler() trainer = Trainer(... , profiler=profiler)Copy the code

The results of the analyzer will be printed when the training trainer.fit() is complete. As shown below:

If we want to learn more about the functions called during each event, we can use AdvancedProfiler. This option uses Python’s cProfiler to provide an in-depth report on the time spent in each function called in the code.

# by passing a string trainer = Trainer(... . profiler="advanced") # or by passing an instance from pytorch_lightning.profiler import AdvancedProfiler profiler = AdvancedProfiler() trainer = Trainer(... , profiler=profiler)Copy the code

This report can be quite long, so you can also specify the directory path dirpath and filename filename to save the report rather than logging it to the output of the terminal. The following figure shows the analysis of the get_train_batch operation:

Click on: link for details

CPU specific optimization

14. Use of non-uniform memory access (NUMA) controls

NUMA or non-uniform memory access is a memory layout design for data center machines designed to exploit memory locality in multi-slot machines with multiple memory controllers and blocks. In general, all deep learning workloads (training or reasoning) can achieve better performance without having to access hardware resources across NUMA nodes. Therefore, reasoning can be run on multiple instances, each running on a socket to improve throughput. For single-node training tasks, distributed training is recommended, with each training process running on a socket.

In general, the following commands execute the PyTorch script only on the kernel of the NTH node and avoid cross-socket memory access to reduce memory access overhead.

# numactl --cpunodebind=N --membind=N python <pytorch_script>
Copy the code

Click on: link for details

15. Using OpenMP

OpenMP is used to bring better performance to parallel computing tasks. OMP_NUM_THREADS is the simplest switch you can use to speed up calculations. It determines the number of threads used for OpenMP calculations. CPU affinity Settings control how the workload is distributed across multiple cores. It affects communication overhead, cache row invalidation overhead, or page jitter, so setting CPU affinity correctly gives performance advantages. GOMP_CPU_AFFINITY or KMP_AFFINITY determines how OpenMP* threads are bound to physical processing units. Click on: link for details.

PyTorch runs the task on N OpenMP threads using the following command.

# export OMP_NUM_THREADS=N
Copy the code

In general, the following environment variables are used to set CPU affinity with the GNU OpenMP implementation. OMP_PROC_BIND specifies whether threads can be moved between processors. Setting it to CLOSE keeps The OpenMP thread CLOSE to the main thread in the contiguous location partition. OMP_SCHEDULE Determines how the OpenMP thread is scheduled. GOMP_CPU_AFFINITY binds threads to a specific CPU.

# export OMP_SCHEDULE=STATIC
# export OMP_PROC_BIND=CLOSE
# export GOMP_CPU_AFFINITY="N-M"
Copy the code

16. Intel OpenMP Runtime Library (Libiomp)

By default, PyTorch uses GNU OpenMP (GNU Libgomp) for parallel computing. On the Intel platform, the Intel OpenMP Runtime Library (Libiomp) provides support for the OpenMP API specification. It sometimes provides more performance advantages than libgomp. LD_PRELOAD can be used to switch the OpenMP library to Libiomp:

# export LD_PRELOAD=<path>/libiomp5.so:$LD_PRELOAD
Copy the code

Similar to CPU association Settings in GNU OpenMP, environment variables are provided in Libiomp to control CPU association Settings. KMP_AFFINITY binds OpenMP threads to physical processing units. KMP_BLOCKTIME sets the amount of time, in milliseconds, that a thread should wait after completing execution of a parallel region before sleeping. In most cases, setting KMP_BLOCKTIME to 1 or 0 yields good performance. The following command shows common Settings for the Intel OpenMP runtime library.

F: Export KMP_AFFINITY= Fine,compact,1,0Copy the code

17. Switch the memory allocator

For deep learning workloads, Jemalloc or TCMalloc can achieve better performance than the default MALloc function by reusing as much memory as possible. Jemalloc is a general-purpose malloc implementation that emphasizes fragmentation avoidance and extensible concurrency support. TCMalloc also has some optimization features to speed up program execution. One is to keep memory in the cache to speed up access to frequently used objects. Holding such a cache even after unallocation helps avoid costly system calls when such memory is reallocated later. Use the environment variable LD_PRELOAD to take advantage of one of these.

# export LD_PRELOAD=<jemalloc.so/tcmalloc.so>:$LD_PRELOAD
Copy the code

18. Distribute dataparallel function to train models on the CPU

Training on the CPU is also a good option for small-scale models or memory-constrained models such as DLRM. On machines with multiple sockets, distributed training brings efficient use of hardware resources to speed up the training process. Torch- CCL, optimized with Intel (R) oneCCL (Collective Communication Library) for efficient distributed deep learning training for groups such as AllReduce, AllGather, and AllToAll, Implements the PyTorch C10D ProcessGroup API and can be loaded dynamically as an external ProcessGroup. Optimized in the PyTorch DDP module, TORHC-CCL speeds up communication operations. In addition to the optimization of the communication kernel, Torch – CCL also has the simultaneous computation communication capability.

Gpu-specific optimization

19. Open cudNN benchmarks

This optimization technique only applies to models that use a large number of convolutional layers (for example, ordinary convolutional neural networks, or model architectures with CNN backbone).

The core of convolution layer is convolution operation, which is the basic operation of image processing, signal processing, statistical modeling, compression and other applications. Fortunately, we have developed a number of different algorithms that can efficiently compute convolution on different array sizes and hardware platforms. In PyTorch, it is based on NVIDIA’s cuDNN framework for accelerated computing. CuDNN has a benchmark API that runs a short program to select the best algorithm to perform convolution based on a given input size and hardware.

We can set up the torch. Backends. Cudnn. Benchmark = True to enable the benchmark. When turned on, the first time a convolution of a particular size is run on our GPU device, a quick benchmark is first run to determine the best cuDDN convolution implementation for a given input size. Thereafter, each convolution operation on a matrix of the same size will use this algorithm instead of the default algorithm.

The performance gains from using cudNN benchmarks can be seen below:

Note: Using cudNN benchmarks improves speed only if the input size (that is, the shape of the batch tensor passed to the model) is kept constant. Otherwise, the benchmark is triggered each time a size change is entered. Fortunately, most models use fixed tensor shapes and batch sizes, so this is usually not a problem.

20. Avoid unnecessary CPU-GPU synchronization

Avoid unnecessary synchronization and keep the CPU ahead of the accelerator as much as possible to ensure that the accelerator work queue contains many operations.

Avoid operations that require synchronization, such as:

print(cuda_tensor)
cuda_tensor.item()
Memory replication:tensor.cuda().cuda_tensor.cpu().tensor.to(device)
cuda_tensor.nonzero()
Python control flow that depends on the result of operations performed on CUDA tensors, for example:if (cuda_tensor ! = 0).all()

If you are trying to clean up additional graphs, use.detach() instead. This does not transfer memory to the GPU, it removes any computed graphs attached to the variable.

21. Create the tensor directly on the target device

Do not generate random tensors as follows:

torch.rand(size).cuda()
Copy the code

Because this first creates the CPU tensor and then transfers it to the GPU, which is really slow. We should generate output directly on the target device:

torch.rand(size, device=torch.device('cuda'))
Copy the code

If we were using Lightning, it would automatically put models and lots on the correct GPU. However, if we create a new tensor somewhere in the code (for example, sampling random noise for VAE or something similar), then we must place the tensor ourselves:

torch.rand(size, device=self.device)
Copy the code

This applies to all functions that create a new tensor and accept device arguments: torch.rand(), torch.zeros(), torch.full(), and so on.

22. Use mixed accuracy and AMP

Hybrid precision takes advantage of Tensor Cores and provides up to 3x overall acceleration on Volta and newer GPU architectures. To use a tensor core, AMP should be enabled, and the matrix/tensor dimension should meet the requirements for calling a kernel that uses a tensor core.

Use Tensor Cores:

Set the size to a multiple of 8 (mapping to the dimension at the core of the tensor)
Enable the AMP

It is recommended to adjust the accuracy to 16 bit, which has two advantages:

The memory used is halved, since the default accuracy is generally 32 bits, which means we can double the batch size and halve the training time.
Some Gpus (V100, 2080Ti) offer automatic acceleration (3 to 8 times faster) because they are optimized for 16-bit computing.

In Lightning, you can easily enable 16 bits:

Trainer(precision=16)
Copy the code

Note: Prior to PyTorch 1.6, you must also have Nvidia Apex installed. But now 16 bits is native to PyTorch. If we use Lightning, it supports both and automatically switches based on the PyTorch version detected.

23. Preallocate memory with variable input length

Models for speech recognition or NLP are usually trained on input tensors with variable sequence length. With the PyTorch cache allocator, variable length can be problematic and can cause performance degradation or unexpected out-of-memory errors. If a batch with a shorter sequence length is followed by another batch with a longer sequence length, PyTorch is forced to release the intermediate buffer from the previous iteration and reallocate the new buffer. This process is time-consuming and leads to fragmentation in the cache allocator, which can lead to out-of-memory errors.

A typical solution is to implement pre-allocation. It consists of the following steps:

Generate a (usually random) batch of inputs with a maximum sequence length (corresponding to a maximum length in the training data set or some predefined threshold)
Forward and backward passes are performed using the generated batches, without the optimizer or learning-rate scheduler, and this step preallocates a maximum buffer size that can be reused in subsequent training iterations
Zero the gradient
Practice regular training

24. Use non-blocking device memory transfers

When transferring data to device memory, it is useful to enable asynchronous (non-blocking) transfers via.to(non_blocking=True), especially when loading data with a DataLoader with pin_memory=True. Low-level CUDA optimizations allow certain types of data to be transferred from fixed memory to a GPU device, only at the same time as the GPU kernel process. Its working principle is shown below (Source) :

In the first (sequential) example, data loading prevents the kernel from executing, and vice versa. In the latter two (concurrent) examples, the load and execution tasks are first broken down into smaller subtasks and then pipelined in a just-in-time fashion.

PyTorch uses this feature to streamline GPU code execution along with GPU data transfer:

# assuming the loader call uses pinned memory # e.g. it was DataLoader(... , pin_memory=True) for data, target in loader: # these two calls are also nonblocking data = data.to('cuda:0', non_blocking=True) target = target.to('cuda:0', non_blocking=True) # model(data) is blocking, so it's a synchronization point output = model(data)Copy the code

In this case, Model (Data) is the first synchronization point. PyTorch and CUDA create the data and target tensor for us, move the data tensor to the GPU, move the target tensor to the GPU, and then perform forward passing in the model.

Distributed optimization

Use DistributedDataParallel instead of DataParallel

PyTorch implements data parallel training in two ways:

torch.nn.DataParallel
torch.nn.parallel.DistributedDataParallel

DistributedDataParallel differs from DataParallel in that the former uses multiple processes to create a process for each GPU, while dataparaparallel uses multiple threads. By using multiple processes, each GPU has its own dedicated process, which avoids the performance overhead of the Python interpreter’s GIL. More

26. If DistributedDataParallel and gradient accumulation are used for training, unnecessary all-reduce tasks are skipped

. By default, the torch. Nn. The parallel DistributedDataParallel after each reverse transfer carried gradient reduction (all – reduce), to calculate the average gradient to participate in the training of all staff. If the training uses gradient accumulation at N steps, then there is no need to do all-reduce after each training step, just before the optimizer is executed after the last call to BACKWARD.

DistributedDataParallel provides the no_sync() context manager, which disables gradient full reduction for certain iterations. No_sync () is applied to the first n-1 iteration of gradient accumulation, and the last iteration should perform the required gradient all-reduce.

27. If DistributedDataParallel(find_unused_parameters=True) is used to match the order of layers in the constructor and during execution

The torch with find_unused_parameters = True. Nn. The parallel. DistributedDataParallel use layer sequence and parameters of the model constructor to build for DistributedDataParallel Gradient full reduction of buckets. DistributedDataParallel overlapping all-reduce with reverse propagation. All-reduce for a specific bucket is triggered asynchronously only when All gradients of parameters in a given bucket are available.

To maximize the amount of overlap, the order in the model constructor should roughly match the order during execution. If the order does not match, then the all-reduce of the entire bucket waits for the last gradient to arrive, which may reduce the overlap between backward pass and All-reduce, and all-reduce may eventually be exposed, which slows down the training speed.

DistributedDataParallel with find_unused_parameters=False (which is the default) depends on automatic bucket formation based on the order of operations encountered during back-propagation. When find_unused_parameters=False, there is no need to reorder layers or parameters for best performance.

28. Load balancing workloads in distributed Settings

Models that process sequential data (speech recognition, translation, language models, and so on) often experience load imbalances. If one device receives a batch of data in a sequence longer than the rest, then all devices will wait for the last worker to finish. The backward-passing function acts as an implicit synchronization point between distributed Settings and logistic back-ends.

There are several ways to solve the load balancing problem. The core idea is to distribute the workload as evenly as possible among all workers in each global batch. For example, Transformer solves imbalances by forming batches with a roughly constant number of tokens (and a variable number of sequences within batches), while other models solve imbalances by bucking samples with similar sequence lengths or even sorting data sets by sequence length.

This is the end of the body, the following is References:

Performance Tuning Guide — PyTorch Tutorials on 1.10.1+ Documentation

7-tips-to-maximize-pytorch-performance

Tricks for training PyTorch models to convergence more quickly (spell.ml)

conclusion

Thank you all for reading this, if you find it helpful:

Like it and make it available to more people.
Share your thoughts with me in the comments section, and record your thought process in the comments section.
You can also read some of my previous posts if you are interested:
- 🐮! These 15 tips will directly take your Python performance off the ground 🚀 – Nuggets
- After reading this ARTICLE, I can’t believe there are people who don’t understand convolutional neural networks! – the nuggets (juejin. Cn)

Thank you again for your encouragement and support 🌹🌹🌹

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

28 PyTorch Best practice tips for the entire web!

preface

General optimization

1. Use multi-process data loading

2. Use fixed memory

3. Disable gradient calculation for verification or inference

4. Disable bias in the convolutional layer before batchnorm

5. Use None to zero the gradient instead of.zero_grad()

6. Fusion point by point operation

7. Enable channels_last memory format for CV models

8. Checkpoint intermediate buffer

9. Disable the debugging API

10. Use multiple batches per gradient update

11. Use gradient clipping

12. Use 20

Use profilers to analyze code

CPU specific optimization

14. Use of non-uniform memory access (NUMA) controls

15. Using OpenMP

16. Intel OpenMP Runtime Library (Libiomp)

17. Switch the memory allocator

18. Distribute dataparallel function to train models on the CPU

Gpu-specific optimization

19. Open cudNN benchmarks

20. Avoid unnecessary CPU-GPU synchronization

21. Create the tensor directly on the target device

22. Use mixed accuracy and AMP

23. Preallocate memory with variable input length

24. Use non-blocking device memory transfers

Distributed optimization

Use DistributedDataParallel instead of DataParallel

26. If DistributedDataParallel and gradient accumulation are used for training, unnecessary all-reduce tasks are skipped

27. If DistributedDataParallel(find_unused_parameters=True) is used to match the order of layers in the constructor and during execution

28. Load balancing workloads in distributed Settings

conclusion

28 PyTorch Best practice tips for the entire web!

preface

General optimization

1. Use multi-process data loading

2. Use fixed memory

3. Disable gradient calculation for verification or inference

4. Disable bias in the convolutional layer before batchnorm

5. Use None to zero the gradient instead of.zero_grad()

6. Fusion point by point operation

7. Enable channels_last memory format for CV models

8. Checkpoint intermediate buffer

9. Disable the debugging API

10. Use multiple batches per gradient update

11. Use gradient clipping

12. Use 20

Use profilers to analyze code

CPU specific optimization

14. Use of non-uniform memory access (NUMA) controls

15. Using OpenMP

16. Intel OpenMP Runtime Library (Libiomp)

17. Switch the memory allocator

18. Distribute dataparallel function to train models on the CPU

Gpu-specific optimization

19. Open cudNN benchmarks

20. Avoid unnecessary CPU-GPU synchronization

21. Create the tensor directly on the target device

22. Use mixed accuracy and AMP

23. Preallocate memory with variable input length

24. Use non-blocking device memory transfers

Distributed optimization

Use DistributedDataParallel instead of DataParallel

26. If DistributedDataParallel and gradient accumulation are used for training, unnecessary all-reduce tasks are skipped

27. If DistributedDataParallel(find_unused_parameters=True) is used to match the order of layers in the constructor and during execution

28. Load balancing workloads in distributed Settings

conclusion

Related Posts

Based on natural selection and random perturbation improved krill colony algorithm MATLAB source code

The difference between print and return

Why is the normal distribution so common