• 1. Why optimize Device ↔ Host memory transfer
  • 2. Several ways to optimize memory transfer
    • 2.1. Store and use DeviceCopy using Pinned Memor on Host
    • 2.2. Overlap of computation and transmission
    • 2.3. Zero-copy memory
    • 2.4. Unified Virtual Addressing (UVA)
    • 2.5. Unified Managed Memory (UM)

1. Why optimize Device ↔ Host memory transfer

The peak theoretical bandwidth between device memory and GPU (for example, 898 GB/s on NVIDIA Tesla V100) is much higher than the peak theoretical bandwidth between host memory and device memory (16 GB/s on PCIe X16 Gen3). Therefore, to get the best overall application performance, it is important to minimize data transfer between the host and the device, even if that means running the kernel on the GPU will not show any acceleration compared to running the kernel on the host CPU. Note: Minimize data transfer between host and device, even if it means running certain kernels on the device that will not improve performance compared to running them on the host CPU.

2. Several ways to optimize memory transfer

2.1. Store and use DeviceCopy using Pinned Memor on Host

Page-locked or pinned memory has better access efficiency.

Note:

  1. Fixed memory should not be overused. Overuse can degrade overall system performance because fixed memory is a scarce resource, but it’s hard to know in advance how much is too much.
  2. Allocating system memory is slow.

2.2. Overlap of computation and transmission

  • Cudacopy + CUDA algorithm is parallel with CPU algorithm
  • Multiple streams perform overlapping transports and calculationsNote:
    1. GPU is required to support concurrent computing and transmission
    2. If bidirectional transmission between Host and Device occurs concurrently, you need to pay attention to whether the GPU has dedicated hardware for transmission in each direction

    e.g.

2.3. Zero-copy memory

Note: Map a piece of Host memory for Device. This results in frequent mapping between GPU and CPU when accessing memory. Do not use this function on the current GPU.

float*a_h, *a_map; .cudaGetDeviceProperties(&prop, 0);       // Check whether the device driver supports zero-copy memory
if(! prop.canMapHostMemory)exit(0);
cudaSetDeviceFlags(cudaDeviceMapHost);         // Set the Device to the zero-copy mode. As a result, a block of Device memory can be applied for mapping when applying for Host memory
cudaHostAlloc(&a_h, nBytes, cudaHostAllocMapped);       // Open a block of Host memory, and reference cudaHostAllocMapped enable zero copy
cudaHostGetDevicePointer(&a_map, a_h, 0);          // Get the device memory pointer from the Host memory pointer
kernel<<<gridSize, blockSize>>>(a_map);
cudaFreeHost(a_h);           // Free zero-copy memory
Copy the code

2.4. Unified Virtual Addressing (UVA)

With UVA, the host memory shares a virtual address space with the device memory of all supported devices installed.

Note: nvD-memory applications still maintain a different address pointer between Host and Device.

2.5. Unified Managed Memory (UM)

See also this column feature article