Programming on Nvidia Jetson, unlike in the cloud, is a shared memory architecture between the CPU and GPU. To reduce the overhead of Memory copying, here’s a review of three Memory management strategies where Unified Memory can benefit from a shared Memory architecture.

Pinned Memory

For CUDA architecture, the host side memory is divided into pageable memory and page-lock/pinned memory. Paged memory is allocated on the host by the operating system API Malloc (), and paged memory is allocated on the host by the CUDA function cudaHostAlloc(). The important property of page locked memory is that the host operating system will not page or swap this memory, ensuring that it always resides in physical memory.

The GPU knows the physical address of the page-locked Memory and can copy data directly between the host and GPU at a faster rate using Direct Memory Access (DMA) technology. Because physical memory is allocated for each page lock memory and cannot be swapped to disk, page lock memory consumes more memory space than paged memory allocated using standard MALloc ().

Three memory management strategies

When the CPU/GPU is physically unified memory:

Traditional Memory: In this strategy CUDA programs must explicitly pass data between the CPU and GPU.

Zero-copy memory: the CPU and GPU can access the same memory area, avoiding GPU memory allocation and memory copy between cpus and Gpus. However, under this strategy, the GPU will automatically disable all caches, which is not described in the OFFICIAL CUDA documentation. Only reported by Nvidia (due to their consistency model and concerns about Maintaining cache coherence).

Unified Memory: Similar to zero-copy, cpus and Gpus use the same pointer. If shared memory is used, the pointer is the same. If shared memory is not used, it is a layer of encapsulation.

In summary, if you want to take advantage of CPU/GPU shared Memory features, use Unified Memory.

# Explicit Memory Manage
void * data, *d_data;
data = malloc(N);
cudaMalloc(&d_data, N);
cpu_func1(data, N);

cudaMemcpy(d_data, data, N, ...) ; gpu_func2<<<... >>>(d_data, N);cudaMemcpy(data, d_data, N, ...) ;cudaFree(d_data);
free(data);
Copy the code
# Unified Memory
void * data;
data = malloc(N);

cpu_func1(data, N); gpu_func2<<<... >>>(data, N);cudaDeviceSynchronize(a);free(data);
Copy the code