Deep learning sharp-GPU introduction

Deep learning GPU

In modern history, science and technology are developing with each passing day, from which Moore’s Law is prominent and various underlying technologies emerge in an endless stream. However, throughout the history of science and technology, there is a word behind the development of almost all emerging disciplines — “money”!

As the hottest industry in recent years, artificial intelligence is also burning money. As is known to all, the training of artificial intelligence and reasoning requires vast amounts of high performance computing, do deep learning friends all know that nowadays in the field of deep learning SOTA model often require large memory space, this directly led to the deep learning researchers need to configure the stronger GPU devices, or it will points minutes face the shortage of memory and calculation. Large enterprises or research institutions can spend money to deploy HPC, such as SOTA model such as GPT3, which is particularly hot in recent years. It can be said that if you want to run down and run fast, you may not have to think about it. Today’s AI models face higher-level challenges such as conversational AI, leading to an explosion in complexity. Training these models requires massive computing power and scalability.

Take NVIDIA’s state-of-the-art A100 for example. The Tensor Core at The NVIDIA A100 delivers 20 times more performance than the previous Generation NVIDIA Volta with the help of Tensor floating-point arithmetic (TF32) precision without having to change the code. With automatic mixing accuracy and FP16, performance can be further improved by 2 times. 2048 A100 Gpus can massively process training workloads like BERT in one minute, which is a world record for training time.

For very large models with large data tables (such as DLRM for recommendation systems), the A100 80GB can provide up to 1.3 TB of unified video memory per node and is up to 3 times faster than the A100 40GB. NVIDIA’s product leadership has been confirmed in MLPerf, an industry-level AI training benchmark, setting performance records.

A100 80GB is the same as A100 40G on GPU chip, still A100 core, 6912 CUDA core, 1.41GHz, FP32 performance 19.5TFLOPS, FP64 performance 9.7TFLOPS, INT8 performance 624TOPS, TDP 400 w. The main changes are video memory, which has been upgraded to 80GB from 40GB, HBM2, and 1.6TB/s. The video memory type has also been changed to the more advanced HBM2e, which has been increased from 2.4Gbps to 3.2Gbps, increasing the bandwidth from 1.6TB/s to 2TB/s.

2 GPU and GPU Introduction

For the masses, the CPU is used for processing calculation, and the GPU is used to process images rendered, from CPU may be relatively stronger ability, so why can use GPU to accelerate, deep learning performance gain huge ascension, and all sides of the below can make a cup of tea first, slowly to see.

Comprehensive, we can understand so, for an example of the workplace, believe a lot of people are met, the CPU is an 18 class wuyi proficient in man, but too much in the workplace as a result of its master, so lead to anything assigned to him, so he crazy, you didn’t notice him do it, only noticed that he didn’t do it. But GPU, he is more hypocritical, also has a certain ability, but the scope of ability is limited, and he picks work, compared with CPU hard work, he just want to do can achieve results, for other nothing. Extend to say, there are some people are SBPU, what is SBPU, that is, ability is not how, rob others of the credit first-class, but their fart also do not come out, relying on dishonorable means and dirty measurement, occupy the credit of others for their own use, and then blowing PPT. Err, off topic. Let’s get back to business.

2.1 the CPU is introduced

Central processing unit (CPU), is one of the main electronic computer equipment, the core of the computer accessories. Its main function is to interpret computer instructions and process data in computer software. CPU is responsible for reading instructions, decoding instructions and executing instructions in the computer’s core components. The CPU mainly consists of two parts, that is, the controller and the arithmetic unit, which also includes the cache memory and the data and control bus to realize the connection between them. The three core components of an electronic computer are the CPU, internal memory, and input/output devices. The main functions of the CPU are to process instructions, perform operations, control time, and process data.

In the computer architecture, CPU is the core hardware unit for controlling and allocating all the hardware resources of the computer (such as memory, input and output units) and performing general operations. The CPU is the computing and control core of a computer. All software layer operations in a computer system will eventually be mapped to CPU operations through the instruction set.

Essence: CPU is composed of several cores optimized for sequential serial processing. Of course, it can also be parallel through multi-core and hyper-threading technologies. However, due to the complex work flow, power consumption, heat dissipation and size of CPU, there are not too many cores in the overall CPU, which is also the advantage of GPU.

2.2 the GPU is introduced

Graphics processing Unit (English: Graphics processing unit, abbreviation: GPU), also known as display core, vision processor, display chip, is a specialized in personal computers, workstations, game consoles and some mobile devices (such as tablet computers, smart phones, etc.) on the image and graphics related operations of the microprocessor. [1] gpus make graphics card reduces the dependence on the CPU, and part of the original work of the CPU, especially in a 3 d graphics GPU core technology adopted by the hardware T&L (geometric transformation and lighting), cubic texture environment and vertex blending, texture compression and bump mapping, double four pixels 256 rendering engine, etc., And hardware T&L technology can be said to be the logo of GPU. GPU manufacturers are mainly NVIDIA and ATI.

2.3 Differences between CPU and GPU

Firstly, this paper analyzes the characteristics of GPU and CPU, and explains why GPU has advantages over CPU in parallel computing.

  • Task mode
    • The CPU consists of several cores optimized for sequential serial processing
    • The GPU has a massively parallel computing architecture made up of thousands of smaller, more efficient cores designed to handle multiple tasks at once. At the same time, CPU spends a considerable part of its time on peripheral interrupt, process switch and other tasks, while GPU has more time for parallel computing.
  • Function orientation
    • GPU needs to render millions of vertices or triangles at the same time when rendering a picture, so the design of GPU can fully support parallel computing.
    • The CPU is responsible not only for computing but also for logical control and other tasks.
  • System integration
    • As a kind of plug-in device, GPU has far fewer limitations in size, power, heat dissipation, and compatibility than CPU. In this way, GPU can have larger video memory and bandwidth.

2.4 CUDA is the key of GPU in deep learning

Gpus have been around for a long time, so why have they only started to shine in recent years? In my opinion, the main reasons are as follows:

  1. In recent years, neural network deep learning has been increasingly demanding computational power, especially for a large number of supercomputer operations in CV and NLP fields.
  2. NVIDIA and other vendors did not provide a gPU-deep learning bridge in the past, it was very expensive to really use GPU on the deep dotted line, until the advent of CUDA, smooth bridge. To the prevalence of GPU, provides a solid base.

At the software level, CUDA abstracts into a unified programming interface and provides support for C/C++/Python/Java and other programming languages. This layer of ABSTRACTION in CUDA is important because it allows down-developer code to migrate quickly across different hardware platforms; On top of CUDA, cuBLAS for scientific computing (compatible with BLAS interface), cuDNN for deep learning and other middleware and code libraries are encapsulated. These middleware are important for deep learning frameworks such as Tensorflow and PyTorch, making it easier to use CUDA and the underlying hardware for machine learning computing tasks.

3 GPU architecture

3.1 GPU Physical Architecture

Again, taking Nvidia’s A100 GPU as an example, A100 has 6192 cores overall, so you can understand it as a super CPU that can execute 6192 cores in parallel. The following figure

The basic unit of GPU is SM. In fact, in NVidia GPU, the most basic processing unit is the so-called SP(Streaming Processor), and in an NVidia GPU, there will be a lot of SP that can do calculation at the same time. Several SP units will be added to form a Streaming Multiprocessor (SM). Several SMS will form the so-called Texture Processing Clusters (TPC).

Here we can temporarily consider SM as a basic unit of GPU computing scheduling. An SM has 64 cores for computing. For A100, there are 108 heroes on the entire GPU, well, 108 SMS, so the total number of A100 cores is 64 * 108 = 6192. Look at that, it’s powerful enough that one card can run 6,192 parallel computing tasks. For the architecture diagram of SM, a brief supplement is made:

  1. The top layer is the PCIE layer, which is integrated into the server by means of peripheral PCIE interfaces.
  2. The green part is the GPU’s computing Core, for example, the A100 has 6,192 cores.
  3. The blue part in the middle is L2 cache
  4. NVLink is a component that communicates with multiple Gpus. It optimizes the communication between Gpus to reduce the CPU burden and improve transmission efficiency, which is often used in distributed training.
  5. HBM2 on both sides is video memory, the current A100 video memory has two kinds of 40G and 80G, if you have money to buy 80G, less than half of the video memory

Since SM is so important, let’s look at the hardware of SM.

Related structures of SM:

3.2 Logical architecture of GPU

If CUDA grid-block-Thread architecture is corresponding to the actual hardware, it will be similar to gPU-Streaming multiprocessor-streaming Processor. An entire Grid will be thrown directly to the GPU for execution, while blocks roughly correspond to SM and threads roughly correspond to SP. Of course, this is not very precise, just a simple metaphor.

  1. Kernel: Content/code/function executed by Thread
  2. Thread: Executes the smallest kernel unit and is scheduled to be executed in CUDA Core
  3. Warp: The GPU has many Streaming Multiprocessors to manage Thread Block scheduling
  4. The Thread Block:. Multiple threads are combined and scheduled to be executed in SM. An SM can execute multiple Thread blocks at the same time, but a Thread Block can only be scheduled to one SM. Streaming Multiprocessors manage Thread Block scheduling on a GPU.
  5. Grid: A combination of threads and blocks that are scheduled to execute on the entire GPU

At the same time, Thread, Thread Block, and Grid can access different storage resources due to their different layers. For example, threads can access only their own registers, Thread blocks can access the L1 cache in SM, while Grid can access the L2 cache and the larger HBM video memory.

4 GPU actual combat

There are four main steps in GPU operation:

  1. Copy data from Host to Device
  2. The CPU sends data processing execution to the GPU;
  3. Send the kernel function that the Device needs to execute to the Device
  4. Copy calculation results from Device to Host

4.1 CUDA programming practice

// Device code __global__ void VecAdd(float* A, float* B, float* C, int N) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < N) C[i] = A[i] + B[i]; } // Host code int main() { int N = ... ; size_t size = N * sizeof(float); // Allocate input vectors h_A and h_B in host memory float* h_A = (float*)malloc(size); float* h_B = (float*)malloc(size); float* h_C = (float*)malloc(size); // Initialize input vectors ... // Allocate vectors in device memory float* d_A; cudaMalloc(&d_A, size); float* d_B; cudaMalloc(&d_B, size); float* d_C; cudaMalloc(&d_C, size); // Copy vectors from host memory to device memory cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); // Invoke kernel int threadsPerBlock = 256; int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock; VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N); // Copy result from device memory to host memory // h_C contains the result in host memory cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); // Free device memory cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); // Free host memory ... }Copy the code

For example, if each core in A GPU has A unique ID, for example, core 0 executes C[0] = A[0] + B[0], and core 127 executes C[127] = A[127] + B[127]. int i = blockDim.x * blockIdx.x + threadIdx.x; The main calculation is thread ID, so how to code the thread, see the following introduction.

What about coding for threads? See the introduction below.

  • ThreadIdx is a Uint3 type that represents an index of a thread.
  • BlockIdx is a Uint3 type that represents an index of a thread block, usually multiple threads in a thread block.
  • BlockDim is a DIM3 type that represents the size of a thread block.
  • GridDim is a DIM3 type that represents the size of a grid, typically multiple thread blocks in a grid.
  • The following chart clearly shows the relationship between several concepts:

4.2 GPU-based Acceleration

There are a lot of things in the whole GPU, so I will write down this one briefly first, and I will open a special topic later. Here are a few briefly listed items:

  • Fusion operator
  • Spatial filtering
  • High Level Graph Opt
  • CPU & GPU Filter

5 Reference Materials

Docs.nvidia.com/cuda/cuda-c…