What is parallel computing?

What is parallel computing?

— It is often said that GPU is suitable for parallel computing, but CPU is not, so what is the relationship between parallel computing, serial computing, CPU and GPU?

Starting with the basics, what is serial computing?

Serial calculation

Most computers today use the Von Neumann architecture, in which the processor cycles “fetch instructions, execute instructions (including decode instructions), fetch instructions” from memory. Thus, the CPU can solve a problem only by executing it one instruction at a time, as follows:

The speed at which the problem can be solved depends on two factors: CPU speed and data reading speed. In response to these two problems, multicore processors with caches have emerged.

However, the cache cannot grow indefinitely, and if it continues to grow, it will degrade performance (too much cache, speed/cost).

The SMT architecture, multi-threading and hyper-threading architecture invented later enables the processor to simultaneously calculate multiple data. This relation between instruction and data is divided into SISD,SIMD,MISD and MIMD (Flynn Flynn).

  • SISD: Single instruction single data, serial processing in the traditional sense is this;
  • SIMD: Single instruction with multiple data, instruction stream is broadcast to multiple processors concurrently, and each processor has its own data stream. Processor memory only needs a set of logic to decode and execute this instruction stream without multiple instruction decoding channels.
  • MISD: multi-instruction single data, only concept;
  • MIMD: Multiple instruction and multiple data, each processing unit has its own instruction flow, and these instruction flows operate its own data flow, most types of parallel systems;

It is worth noting that the von Neumann system of multithreading concurrency (that is, running multiple threads/processes at the same time), macro is parallel, micro is serial, that is, time-sharing multiplexing only. True parallel computing relies on multiprocessor architectures. For example, we are familiar with gpus.

Parallel computing

Improving application performance with multiple cores usually requires some manipulation of computationally intensive code:

  • Divide the code into blocks.
  • Execute these code blocks in parallel through multiple threads.
  • When the results become available, integrate them in a thread-safe and high-performance manner.

The traditional multithreaded architecture, while functional, is difficult and inconvenient, especially the partitioning and collation steps (the essence of the problem is that when multiple threads are using the same data at the same time, the common strategy of locking for thread-safety causes a lot of competition).

The Parallel Framework is specifically designed to help in these application scenarios.

The internal architecture differences between CPU and GPU are obvious:

Large CPU cache and less LOGICAL operation ALU; The GPU has a large number of logical operations and a small cache. In contrast, gpus have more computing units, meaning more people work. A more detailed GPU parallel computing architecture is as follows:

The figure shows that the compute grid is composed of multiple stream processors, and each stream processor contains more than N blocks. The following is a detailed analysis of some concepts in GPU computing grid:

\ 1. Thread

Thread is the smallest execution unit in GPU operation, thread can complete a minimum logical meaning operation.

Beam \ 2. Thread

A thread bundle is the basic execution unit in a GPU. A GPU is a collection of **SIMD processors, so threads in each thread bundle are executed simultaneously. The ** concept was introduced to hide the latency associated with reading and writing to video memory.

Block \ 3. The thread

A thread block contains multiple thread bundles. All threads in a thread block can use shared memory to communicate and synchronize. However, the maximum number of threads/threads that a thread block can have depends on the graphics card model.

\4. Stream multiprocessor

The stream multiprocessor is the CPU’s core, responsible for the execution of the thread bundle. Only one thread bundle can execute at a time.

\5. Stream processor

The stream processor is only responsible for executing threads and has a relatively simple structure.

In addition, the biggest difference between GPU architecture and CPU is in memory systems.

To learn more about Nvidia graphics cards:

Zhuanlan.zhihu.com/p/61358167?…

GPU parallel programming is also possible:

Blog.csdn.net/llsansun/ar…

Differences between CPU parallelism and GPU parallelism:

www.cnblogs.com/muchen/p/62…

(Emphasis) Optimization of GPU parallel programming:

www.cnblogs.com/muchen/p/62…

Refer to the link

GPU parallel programming: www.cnblogs.com/muchen/p/61…

Blog.csdn.net/qq_25985027…

www.cnblogs.com/leokale-zz/…

www.cnblogs.com/jonins/p/95…