• 1. Where is the shared memory
  • 2. What are the features of shared memory
  • 3. Apply to concurrent access between threads
    • 3.1. Concurrent access features
    • 3.2. Size of memory block (Bank)
    • 3.3. Inter-thread synchronization of fetch
    • 3.4. Applying for and using static or dynamic shared memory
  • 4. Data transfer between shared memory and global memory

ref docs.nvidia.com.shared-memory

1. Where is the shared memory

The following figure shows the SharedMemory location of the Turing architecture GPU. It’s the largest chunk of memory on the chip. It exists in units of SM.

2. What are the features of shared memory

  • Shared memory is located differently from global memory and has faster access on the chip, up to 100 times faster than uncached global memory.
  • The hardware design of shared memory is divided into equally sized memory modules that can be accessed simultaneously. Therefore, any memory load or storage at n addresses across N different memory groups can be processed simultaneously, resulting in an effective bandwidth n times higher than that of a single storage body.

Proper use of these characteristics can help to implement a specific way of access more efficiently.

3. Apply to concurrent access between threads

3.1. Concurrent access features

Based on shared memory access properties. When the memory access of different threads is reasonably allocated to different memory blocks (banks), the effective bandwidth reaches the maximum. Conversely, when different threads access conflicting memory blocks, the access is serialized and the effective bandwidth decreases by a factor. One exception here is when multiple threads in WARP access the same location in the same memory block. A broadcast will run multiple accesses concurrently.

3.2. Size of memory block (Bank)

Devices with computing power above 3.x (currently the GeForce GTX 1060 has computing power of 6.1) have configurable Bank sizes. By setting the cudaSharedMemBankSizeEightByte/cudaSharedMemBankSizeFourByte control Bank size. The cudaSharedMemBankSizeEightByte helps avoid when accessing a double-precision data Bank of conflict. Here we can also see that parallel access to shared memory at different addresses does not conflict when the number of warp firing threads is 32 and the number of banks is greater than 32.

3.3. Inter-thread synchronization of fetch

Competing for access to SharedMemory results in undefined behavior. For example, warp appears to fire threads in parallel, but actually fires threads in batches. Therefore, after thread A and thread B concurrently read the global memory into the shared memory, A conflict may occur when thread A accesses the shared memory of thread B. Here CUDA introduces the __syncThreads () synchronization primitive to make parallel firing functions wait here until all parallel firing functions are executed here.

3.4. Applying for and using static or dynamic shared memory

Static and dynamic means that the size of shared memory is specified by the compiler or at runtime. The differences in their use are as follows

  • Static shared memory
    // Launch cuda kernel
    staticReverse<<<1,n>>>(ptr, size);
    
    // kernel
    __global__ void staticReverse(int *d, int n) {
      __shared__ int s[64];
      int t = threadIdx.x;
      int tr = n-t- 1;
      s[t] = d[t];
      __syncthreads();
      d[t] = s[tr];
    }
    Copy the code

    Keywords are used here__shared__

  • Dynamic shared memory
    // Launch cuda kernel
    dynamicReverse<<<1,n,n*sizeof(int)>>>(ptr, n);
    
    // kernel
    __global__ void dynamicReverse(int *d, int n) {
      extern __shared__ int s[];
      int t = threadIdx.x;
      int tr = n-t- 1;
      s[t] = d[t];
      __syncthreads();
      d[t] = s[tr];
    }
    Copy the code

    Keywords are used hereextern __shared__. Note that when a kernel function requires multiple arrays of shared memory, it can only be called onceextern __shared__And manually split the memory block for subsequent calculations.

4. Data transfer between shared memory and global memory

CUDA 11.0 introduces asynchronous replication. Through asynchronous copy, data can be transferred bypassing RF register and L1Cache. Sample code is as follows

//pipeline pipe;
for (size_t i = 0; i < copy_count; ++i) {
  __pipeline_memcpy_async(&shared[blockDim.x * i + threadIdx.x],
                          &global[blockDim.x * i + threadIdx.x], sizeof(T));
}
Copy the code