Deep into GPU Hardware Architecture and Operating mechanism – 0 Yearning for 0 – Blog Park (CNblogs.com)

GPU integrated module

The full version

Lite version

  • Generally speaking, a GPU has three important parts
    • Control module
    • Computing module (GPC in picture)
    • Output module (FBP in figure)

Control module

Pushbuffer

  • The program issues drawCall commands through the graphics API(DX, GL, WEBGL), which are pushed to the driver. The driver checks the validity of the command, and then puts the command into a Pushbuffer that can be read by the GPU.

Host Interface and Front End

  • After a period of time or an explicit call to flush, the driver sends the contents of the Pushbuffer to the GPU, which receives the commands through the Host Interface and processes them through the Front End.

Primitive Distributor

  • Work distribution starts in the Primitive Distributor, which processes the vertices in the IndexBuffer to produce batches of triangles and sends them to multiple PGCs. The understanding of this step is that n triangles are submitted and assigned to each PGC for simultaneous processing.

Calculation module

The GPC and TPC

  • GPC computing module is the most core component of GPU, where Shader’s calculation takes place
    • A Raster performs the rasterization operation,
    • Texture Process Cluster (TPC)

TPC

  • In general, a TPC has:
    • Texture Units: A number of Texture Units for Texture sampling (Texture Units)
    • PE: A Primitive Engine for receiving upstream PD data. As a fixed unit, PE is responsible for fetching corresponding Vertex Attribute Fetch according to the Vertex index sent by PD, interpolation of Vertex attributes, Vertex culling and other operations
    • A module responsible for Shader loading
    • 1. A number of computing units, also known as Streaming multi-processors (SM, AMD, CU), that perform Shader operations.

SM

  • As shown in the figure above, for a single SM of some Gpus (such as some Fermi models), it contains:
    • 32 cores (also called Stream processors)
    • 16 LD/ST (Load/Store) modules to load and store data
    • Four SFU (Special Function Units) perform Special mathematical operations (sin, cos, log, etc.)
    • Register File (128KB)
    • 64 KB L1 cache
    • Uniform Cache
    • Texture read unit
    • Texture Cache
    • PolyMorph Engine: The polygon Engine handles attribute Setup, VertexFetch, surface subdivision, rasterization (this module is understandable for dealing specifically with vertex-related things).
    • 2 Warp Schedulers: This module is responsible for Warp scheduling. A Warp is made up of 32 threads. Instructions from the Warp scheduler are sent to the Core for execution by Dispatch Units.
    • Instruction Cache
    • Interconnect Network

Poly Morph Engine

  • In GPC, the Poly Morph Engine in each SM is responsible for fetching vertex data through triangle Indices, the Vertex Fetch module in the graph.

Warp schedular VS GS PS

  • SIMD (Single Instruction Multiple Data)
  • Called Warp in NV, as groups of 32 threads and Wavefronts in AMD, as groups of 64 threads
  • In VS, 32 vertices in a group, and in PS, 8 Pixel quads (2 x 2 pixel blocks) in a group
  • After the data is retrieved, a Warp of 32 threads is scheduled in the SM to begin processing vertex data. Warp is a typical single instruction multithreading (SIMT, SIMD upgrading of single instruction multiple data), which is 32 threads execute instructions are the same at the same time, only the thread is not the same as the data, the benefits is a Warp only need one set of logic instructions for decoding and execution, chips can do a smaller, faster, This can be done because the tasks the GPU needs to process are naturally parallel.
  • The SM’s warp scheduler distributes instructions sequentially to the entire warp, and the threads in a single warp execute their instructions lock-step, and are masked out if they are not activated. It can be masked for a number of reasons, such as the current instruction is a branch of if(true), but the condition of the current thread’s data is false, or the number of loops is different (for n is not constant, or was terminated prematurely by break but something else is still going), so the branch in the shader can increase the time consumption significantly. Branches in a warp are all branches unless all 32 threads go to the if or else. Threads do not execute instructions independently, but warp, and these warps are independent of each other.

LD/ST Core SFU

  • Several floating point arithmetic cores (Core), several transcendental function units (SFU), several read and write units (Load/Store).
  • The instructions in WARP can be completed at once or scheduled several times, for example the LD/ST(load access) units in SM are usually significantly less than the basic mathematical operation units.

Memory type of the GPU

  • Register File
    • Since some instructions take longer to complete than others, especially memory loading, the warp scheduler may simply switch to another warp with no memory waiting, which is key to how the GPU overcomes memory read delays by simply switching active thread groups. To make this switch very fast, all warps managed by the scheduler have their own registers in a register file. There is a paradox here, the more registers the shader needs, the less space it will leave for warp, the less warp it will produce, and it will just wait and there will be no warp to switch when it hits a memory delay.
  • L1 Cache
    • Shared Memory (essentially a block of L1 Cache)
      • A block of Shared Memory that can be accessed by Compute Pipeline
    • The Texture is L1 Cache.
    • Instruction Cache (essentially a block of L1 Cache)
  • L2 Cache
  • DRAM

raster engines

  • After Viewport Transform, these triangles will be divided and then distributed to multiple GPCS. The scope of the triangle determines which Raster engines it will be assigned to. Each Raster engine covers multiple tiles on the screen. This is equal to dividing the rendering of the triangle over multiple tiles. The pixel phase has changed the division by triangle to division by displayed pixels.
  • The Raster engines on the GPC work on the triangles it receives and are responsible for generating the pixel information for these triangles (it also handles Clipping, backside culling, and Early-Z culling).

Viewport Transform

  • Once warp has completed all vertex-shader instructions, the results are processed by the Viewport Transform module, triangles are cliped and prepared for rasterization, and the GPU uses L1 and L2 caching to communicate vertex-shader and Pixle-shader data.

Output module

  • The Framebuffer Partition (FBP) is relatively simple. The core component is a render Output unit (ROP), which contains two sub-units, CROP (Color ROP) and ZROP.
    • The former is responsible for Alpha Blend, MSAA Resolve, and writing the final color to the color buffer.
    • The latter is responsible for doing the Stencil/Z Test and writing depth/ Stencil to the Z buffer.