Welcome to follow my public account [Jizhi Vision], reply 001 to get Google programming specification

O_o >_< O_o O_o ~_~ O_o

This article discusses how to use memory padding in GPU CUDA programming to avoid bank conflict.

1, the Shared memory

Shared memory is a small, low-latency on-chip memory that is hundreds of times faster than global memory. Shared memory can be used as a programmable cache.

  • An intra-block thread communication channel: indicates the communication channel between threads.
  • A program-managed cache for global memory data;
  • Scratch Pad memory for transforming data to improve global memory access patterns: Reduce glabal memory access by caching data.

Shared memory can be allocated dynamically or statically. Shared memory can be declared either within the kernel or as a global variable, and can be declared using the following keywords:

__shared__     / / / identifier

__shared__ float tile[_y][_x];     // statically declare a 2D floating-point array
    
/// the kernel is declared
extern __shared__ int tile[];

kernel<<<grid, block, isize * sizeof(int)>>>(...).;
Copy the code

To achieve high bandwidth, shared memory is divided into 32 equally sized memory cards, corresponding to threads in warp, which can be accessed simultaneously.

2. Use memory padding to avoid bank conflict

Shared memory is as fast as registers if there are no bank collisions.

Quick case:

  • All threads in WARP access different banks without conflict;
  • Warp all threads read the same address, triggering the broadcast mechanism without conflict.

Slow case:

  • Bank conflict: Warp multiple threads accessing the same bank;
  • Access must be serialized;
  • Maximum number of threads accessing the same bank simultaneously.

For example, bank conflict is a block of shared memory:

There is no bank conflict:

int x_id = blockDim.x * blockIdx.x + threadIdx.x;     / / column coordinates
int y_id = blockDim.y * blockIdx.y + threadIdx.y;     / / line coordinates
int index = y_id * col + x_id;

__shared__ float sData[BLOCKSIZE][BLOCKSIZE];

if(x_id < col && y_id < row)
{
    sData[threadIdx.y][threadIdx.x] = matrix[index];
    __syncthreads();
    matrixTest[index] = sData[threadIdx.y][threadIdx.x];
}
Copy the code

Bank conflict:

int x_id = blockDim.x * blockIdx.x + threadIdx.x;     / / column coordinates
int y_id = blockDim.y * blockIdx.y + threadIdx.y;     / / line coordinates
int index = y_id * col + x_id;

__shared__ float sData[BLOCKSIZE][BLOCKSIZE];

if(x_id < col && y_id < row)
{
    sData[threadIdx.x][threadIdx.y] = matrix[index];
    __syncthreads();
    matrixTest[index] = sData[threadIdx.x][threadIdx.y];
}
Copy the code

In the above examples, there is only a small change from no bank conflict to bank conflict. Now let’s see how to solve the above bank conflict.

For example, bank conflict can be avoided by simply using memory padding, as shown in the figure below:

Use memory padding to improve the performance of bank conflict code:

int x_id = blockDim.x * blockIdx.x + threadIdx.x;     / / column coordinates
int y_id = blockDim.y * blockIdx.y + threadIdx.y;     / / line coordinates
int index = y_id * col + x_id;

__shared__ float sData[BLOCKSIZE][BLOCKSIZE + 1];     // memory padding

if(x_id < col && y_id < row)
{
    sData[threadIdx.x][threadIdx.y] = matrix[index];
    __syncthreads();
    matrixTest[index] = sData[threadIdx.x][threadIdx.y];
}
Copy the code

You can use memory padding to avoid bank conflict in GPU CUDA programming.


GPU CUDA use memory padding to avoid bank Conflict