01 Consider changing a learning rate schedule

The choice of learning schedule has great influence on the convergence rate and generalization ability of the model. Leslie N. Smith et al., ‘Cyclical Learning Rates for Training Neural Networks’ and’ Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates proposes Cyclical Learning rate and 1Cycle Learning rate schedule. It has since been promoted by Jeremy Howard and Sylvain Gugger at Fast. Ai. The figure below is the diagram of schedule of 1Cycle learning rate:

Sylvain writes: a 1Cycle consists of two equal steps, one from a lower learning rate to a higher learning rate, and the other a return to the lowest level. The maximum value comes from the learning-rate finder, and the smaller value can be ten times lower. Then, the length of this cycle should be slightly less than the total EPOchs number, and, in the final stages of training, we should allow the learning rate to be several orders of magnitude smaller than the minimum. In the best case, this schedule achieves significant acceleration (what Smith calls super-convergence) compared to the traditional learning-rate schedule. For example, the 1Cycle strategy is used to train ResNET-56 on ImageNet data set, the number of training iterations is reduced to 1/10 of the original, but the model performance is still comparable to that in the original paper. This schedule seems to work well with common architectures and optimizers.

Pytorch already implements both methods: “torch.optim.lr_scheduler.cycliclr” and “torch.optim.lr_scheduler.onecyclelr”.

Reference documents: pytorch.org/docs/stable…

\

02 Using multiple workers and page locks in DataLoader

When using the torch. Utils. Data. When the DataLoader set num_workers > 0, rather than the default value is 0, at the same time set up pin_memory = True, rather than the default value is False.

Reference documents: pytorch.org/docs/stable…

Szymon Micacz, software engineer for advanced CUDA deep learning algorithms from NVIDIA, has achieved 2x acceleration in a single epoch using four workers and pinned memory. The rule of thumb for people to choose the number of workers is to set it to four times the number of available Gpus, which will reduce the training speed if it is larger or smaller. Note that increasing num_workers increases CPU memory consumption.

\

03 Set Batch to the maximum value

Turning the batch to the Max is a controversial idea. In general, you will train faster if you set batch to the maximum that the GPU memory allows. However, you also have to adjust for other hyperparameters, such as learning rate. A good rule of thumb is that doubling the batch size doubles the learning rate.

OpenAI’s paper “An Empirical Model of Large-Batch Training” well demonstrates how many steps are needed for convergence of different Batch sizes. In How to Get 4X Speedup and Better Generalization Using the right Batch Size, Author Daniel Huynh conducted some experiments using different batch sizes (also using the 1Cycle strategy discussed above).

Eventually, he increased the Batch size from 64 to 512, achieving a four-fold acceleration. However, the disadvantage of using a large batch is that it may result in a solution that is less generalizing than using a small batch.

\

04 Use automatic Mixing Accuracy (AMP)

PyTorch version 1.6 includes a native implementation of PyTorch’s automatic mixing precision training. The point here is that some operations run faster at half-precision (FP16) without loss of accuracy compared to single-precision (FP32). AMP automatically determines which operation should be performed with which precision. This can speed up training and reduce memory usage.

In the best case, AMP can be used as follows:

\

Consider using another optimizer

AdamW is a kind of Adam with weight attenuation (instead of L2 regularization) popularized by fast.ai, implemented in PyTorch as torch.optim.adamw. AdamW

Seems to have consistently outperformed Adam in error and training time. Both Adam and AdamW work well with the 1Cycle strategy mentioned above.

There are also some non-native optimizers that are getting a lot of attention, most notably LARS and LAMB. NVIDA’s APEX implements converged versions of some common optimizers, such as Adam. This implementation avoids multiple passes to and from GPU memory and is 5% faster than the Adam implementation in PyTorch.

\

06 cudNN benchmark

If your model architecture remains unchanged, the input size remains the same, set up the torch. Backends. Cudnn. Benchmark = True.

\

07 Watch out for frequent data transfer between the CPU and GPU

When tensor.cpu() is used frequently to take tensors from the GPU to the CPU (or tensor.cuda() from the CPU to the GPU), it can be very expensive. Item () and.numpy() can also be used with.detach () instead.

If you create a new tensor, you can assign it to the GPU using the keyword parameter device=torch. Device (CUDA :0).

Use. To (non_blocking=True) if you need to transfer data, as long as there is no sync point after the transfer.

\

08 Use Gradient/activate CheckPointing

Checkpointing works by swapping computations for memory and does not store all intermediate activations of the entire computations for backward passes, but recalculates those activations. We can apply this to any part of the model.

Specifically, in forward Pass, function is run in torch.no_grad() and does not store intermediate activations. Instead, input tuples and function arguments are held in the forward pass. In a BACKWARD pass, the input and function are retrieved and the forward pass is calculated again on the function. The intermediate activations are then tracked and the gradients are calculated using these activation values.

Thus, while this might slightly increase the elapsed time for a given batch size, it would significantly reduce memory footprint. This in turn will allow for a further increase in the batch size used, thus improving GPU utilization.

Although checkpointing is implemented with torch.utils. Checkpoint, it still requires some thought and effort to be implemented correctly. Priya Goyal has written an excellent tutorial on the key aspect of CheckPointing.

Priya Goyal tutorials: github.com/prigoyal/py…

\

09 Use gradient accumulation

Another way to increase the batch size is to accumulate gradients in multiple.backward () passes before calling optimizer.step().

“Training Neural Nets on Larger Batches” by Thomas Wolf at Hugging Face Practical Tips for 1-GPU, Multi-GPU, and Distributed Setups explain how to use gradient accumulation. Gradient accumulation can be achieved by:

This approach was developed primarily to circumvent GPU memory limitations.

\

10 Train multiple Gpus in parallel using distributed data

May accelerate distributed training there are a lot of ways, but simple way is to use the torch. Nn. DistributedDataParallel instead of torch. The nn. DataParallel. In this way, each GPU will be driven by a dedicated CPU core, avoiding the DataParallel GIL problem.

Distributed training document address: pytorch.org/tutorials/b…

\

11 Set the gradient to None instead of 0

The gradient is set to.zero_grad(set_to_none=True) instead of.zero_grad(). Doing so lets the memory allocator handle gradients instead of setting them to 0. As noted in the documentation, setting the gradient to None produces moderate acceleration, but don’t expect miracles. Note that there are downsides to this as well, see the documentation for details.

Document address: pytorch.org/docs/stable…

\

12 Use.as_tensor () instead of.tensor ()

Torch.tensor () always copies data. If you’re converting a Numpy array, use torch.as_tensor() or torch.from_numpy() to avoid copying data.

\

13 Enable the debugging tool if necessary

PyTorch provides many debugging tools, such as Autograd. profiler, autograd.grad_check, and autograd.anomaly_detection. Be sure to turn on the debugger only when you need it and turn it off when you don’t, as it will slow you down.

\

14. Use gradient cropping

Regarding the problem of avoiding gradient explosion in RNN, some experiments and theories have confirmed that gradient clipping (gradient = min(gradient, threshold)) can accelerate convergence. HuggingFace’s Transformer implementation is a very clear example of how gradient cropping can be used. Other methods mentioned in this article, such as AMP, can also be used.

This can be done in PyTorch using torch.nn.utils.clip_grad_norm_.

\

15 Turn bias off before BatchNorm

Turn off bias before starting BatchNormalization. For a 2-D convolutional layer, you can set the bias keyword to False: torch.nn.Conv2d(… , bias=False, …) .

\

16 Turn off gradient calculation during validation

To turn off the gradient calculation during validation, set torch.no_grad().

\

17 Use input and Batch normalization

Double check that the input is normalized, okay? Is Batch normalization used?

Original link: efficientdl.com/faster-deep…