PyTorch 1.8 released with AMD support, optimized for mass training

Content introduction

On March 4, PyTorch released version 1.8 on its official blog. According to the official release, the new version mainly includes compiler and distributed training updates, as well as some new mobile tutorials.

Overall, this update covers more than 3,000 commits since the 1.7 release, including compilations, code optimizations, scientific computing front end apis, and binary support for AMD ROCm via Pytorch.org.

PyTorch 1.8 also includes feature improvements and gradient compression for large-scale training of pipeline and model parallelism.

Some of the highlights include:

Torch. Fx allows function conversions between Python;
Added/adjusted API to support FFT(torch. FFT), linear algebra functions (Torch. Linalg), support for complex tensor autograd, and improved performance of Hessian and Jacobian calculations;
Major updates and improvements have been made to distributed training, including: improved NCCL reliability, support for pipeline parallelism, RPC analysis, and support for communication hooks that add gradient compression.

In PyTorch 1.8, several PyTorch libraries have been officially updated, including TorchCSPRNG, TorchVision, TorchText, and TorchAudio.

Add and update apis

New and updated apis include additional NUMpy-compatible apis, as well as additional apis for improving code performance in reasoning and training.

Here is a brief overview of the major updates to PyTorch 1.8.

[Stable edition]Torch.fftFFT in high-performance NumPy will be supported

The torch. FFT module was released in PyTorch 1.8. The module supports hardware acceleration and Autograd while implementing NumPy NP.ft functionality.

[Beta]torch.linalgLinear algebra functions in NumPy will be supported

Based on NumPy Np. linalg, Torch. Linalg provides similar support to NumPy for common linear algebra operations, including Cholesky factorations, determinants, eigenvalues, and so on.

[Beta] Pthon code conversion using FX

FX allows developers to transform Python code by using transform(input_module: nn.module) -> nn.Module.

Distributed training

To improve NCCL stability, PyTorch 1.8 will support stable asynchronous error/timeout handling; RPC analysis is supported. In addition, support for pipeline parallelism has been added and gradient compression can be done through communication hooks in the DDP.

Details as follows:

[Beta] Pipeline parallel

Provides an easy-to-use PyTorch API to parallelize pipes as part of a training cycle.

DDP communication hook

DDP communication hook is a general interface that controls gradient communication between workers by covering Vanilla AllReduce in gradient gradient distribution.

PyTorch 1.8 adds some built-in communication hooks, such as PowerSGD, that users can call on demand. In addition, the communication hook interface supports user – defined communication policies.

Additional prototyping capabilities for distributed training

In addition to the distributed training features added in the Stable and beta editions, some features are added to the Nightly edition.

Details are as follows:

[Prototype] ZeroRedundancyOptimizer [Prototype] CUDA [Prototype] RPC with TensorPipe support

PyTorch mobile terminal

Several mobile tutorials are available for new users in PyTorch 1.8, designed to help them deploy the PyTorch model on mobile more quickly.

At the same time, it provides development tools for experienced users to develop on mobile with PyTorch more easily.

New tutorials for PyTorch mobile include:

IOS terminal uses DeepLabV3 for image segmentation
The Android end uses DeepLabV3 for image segmentation

IOS terminal uses DeepLabV3 for image segmentation

The new Demo APP also includes image segmentation, target detection, machine translation, intelligent q&A, etc. (iOS & Android).

In addition to improving the performance of models such as MobileNetV3 on the CPU, the official Android GPU backend prototype has been revamped to cover more models and perform faster reasoning.

In addition, PyTorch 1.8 comes with the PyTorch Mobile Lite interpreter feature, which allows users to reduce the size of runtime binaries.

[Prototype] PyTorch Mobile Lite Interpreter

PyTorch Lite Interpreter is a stripped-down version of The PyTorch Runtime that executes PyTorch programs on space-constrained devices and reduces the storage footprint of binaries.

This feature reduces binary file size by 70% compared to the current version of the device run time.

Performance optimization

PyTorch 1.8 added support for Benchmark Utils to make it easier for users to monitor model performance. A new automated quantization API has also been opened.

[Beta] Benchmark Utils

Benchmark Utils allows users to take accurate performance measurements and provides composite tools to help with benchmarking and post-processing.

Code examples:

from torch.utils.benchmark import Timer results = [] for num_threads in [1, 2, 4]: timer = Timer( stmt="torch.add(x, y, out=out)", setup=""" n = 1024 x = torch.ones((n, n)) y = torch.ones((n, 1)) out = torch.empty((n, n)) """, num_threads=num_threads, ) results.append(timer.blocked_autorange(min_run_time=5)) print( f"{num_threads} thread{'s' if num_threads > 1 else ' ': < 4} "f" {results [1]. Median * 1 e6: > 4.0 f} us "+ (f" ({results [0]. The median/results [1]. Median:. 1 f} x) "if num_threads > 1 Else '') 1 thread 376 US 2 Threads 764 US (2.0x) 2 Threads 764 US (3.0x)Copy the code

[Prototype] FX Graph Mode Quantization

FX Graph Mode Quantization is a new automatic Quantization API in PyTorch. It improves Eager Mode Quantization by adding function support and automating the Quantization process.

Hardware support

[beta] enhances PyTorch Dispatcher’s capabilities to improve the back-end development experience in C++

PyTorch 1.8 allows users to create new out-of-tree devices outside of the PyTorch/PyTorch Repo and keep them in sync with local PyTorch devices.

[Beta] AMD GPU binaries are now available

PyTorch 1.8 adds support for ROCm Wheels. Users can easily use AMD Gpus by simply installing the standard PyTorch selectors, selecting ROCm from the installation options, and executing commands.

For tutorial and documentation details, please visit the following links:

Pytorch.org/blog/pytorc…

PyTorch 1.8 released with AMD support, optimized for mass training

Add and update apis

Distributed training

PyTorch mobile terminal

Performance optimization

Related Posts

Introduction to Flink (I) – Introduction to Apache Flink

Machine Learning Exercise 6: SKLearn support Vector Machines (SVM)

Two-dimensional code recognition based on MATLAB GUI gray + binarization + correction two-dimensional code generation and recognition