• 1. The default stream
    • 1.1. The default prevalence before CUDa7
    • 1.2. The default trend behavior of CUDa7
      • 1.2.1. How do I enable the per-thread default stream
    • 1.3. Development tips for default streams
  • 2. Asynchronous flow
    • 2.1. Way to synchronize streams
      • 2.1.1. Violence flow synchronization
      • 2.1.2. Synchronization of other streams
    • 2.2. Asynchronous stream copy memory

1. The default stream

Ref: GPU Pro Tip: CUDA 7 streams simplify concurrency – NVIDIA Developer Blog

The default stream can be used when the program is not performance critical.

1.1. The default prevalence before CUDa7

Each device has a default stream that is implicitly synchronized when invoked by different host threads. The default stream can run only after all regular streams have finished running. There is synchronization between the default stream and the regular stream.

1.2. The default trend behavior of CUDa7

Each host thread has a default stream. The default stream is treated as a regular stream and does not synchronize with other streams.

1.2.1. How do I enable the per-thread default stream

To enable the per-thread default stream in NVCC 7 and later, you can do this before including CUDA headers (cuda.h or CUDa_runtime.h), Compile CUDA_API_PER_THREAD_DEFAULT_STREAM preprocessor macros using the NVCC command line option CUDA or #define. Note that: When code is compiled by NVCC, you cannot use #define CUDA_API_PER_THREAD_DEFAULT_STREAM to enable this behavior in.cu files because NVCC implicitly contains cuda_Runtime.h at the top of the translation unit.

The following figure shows the NVVP running status of Kepler GPU before and after adding –default-stream per-thread to NVCC. If not added, the default stream causes the regular stream to block.

const int N = 1 << 20;
 
__global__ void kernel(float *x, int n)
{
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    for (int i = tid; i < n; i += blockDim.x * gridDim.x) {
        x[i] = sqrt(pow(3.14159,i)); }}int main(a)
{
    const int num_streams = 8;
 
    cudaStream_t streams[num_streams];
    float *data[num_streams];
 
    for (int i = 0; i < num_streams; i++) {
        cudaStreamCreate(&streams[i]);
 
        cudaMalloc(&data[i], N * sizeof(float));
 
        // launch one worker kernel per stream
        kernel<<<1.64.0, streams[i]>>>(data[i], N);
 
        // launch a dummy kernel on the default stream
        kernel<<<1.1> > > (0.0);
    }
 
    cudaDeviceReset(a);return 0;
}
Copy the code

1.3. Development tips for default streams

  • Remember: For default flows per thread, the default flows in each thread behave the same as regular flows, as long as they are synchronized and concurrent. For a traditional default stream, this is not true.
  • The –default-stream option is applied by compilation unit, so be sure to apply it to all NVCC command lines that require it.
  • CudaDeviceSynchronize () continues to synchronize everything on the device, even using the new per-thread default stream option. If you only want to synchronize a single stream, use cudaStreamSynchronize(cudaStream_t stream), as shown in our second example.
  • Starting with CUDA 7, you can also use the handle cudaStreamPerThread to explicitly access the default stream for each thread, or use the handle cudaStreamLegacy to access the older default stream. Note that cudaStreamLegacy still implicitly synchronizes with the default stream for each thread if you happen to mix them in a program.
  • You can create a non-blocking stream that is not synchronized with the traditional default stream by passing the cudaStreamCreate() flag to it.

The flow of the asynchronous

2. Asynchronous flow

2.1. Way to synchronize streams

Block the CPU thread until the GPU completes its work. In most cases, host thread performance is severely compromised.

2.1.1. Violence flow synchronization

cudaDeviceSynchronize(a);Copy the code

2.1.2. Synchronization of other streams

cudaStreamWaitEvent(event);
cudaStreamSynchronize(stream);
cudaStreamQuery(stream);
cudaEventSynchronize(event);
cudaEventQuery(event);
Copy the code

2.2. Asynchronous stream copy memory

It will not increase the efficiency of access