Performance analysis of different GPU direct data transports

Let's build a benchmark that stresses the following cases. Do each of them one at a time, not all at once. You should be able to develop **most** of these benchmarks on your laptop, test they work in simple cases, and then transfer the code to delta. Let's put the benchmark code under context-transfer-engine/benchmarks/gpu_baseline for now.

# Latency / bandwidth of local pinned host memory

Delta machine has multiple GPUs per node. How much performance can we get from writing data to pinned host memory? A GPU kernel that performs a strided memcpy to a pinned host memory region of configurable size would be great. The kernel should take as input: the blocks/threads for the kernel, the amount of data to memcpy per warp (group of 32 threads). the NUMA node to allocate pinned memory from, the ID of the gpu to launch the copy kernel.

NUMA stands for Nonuniform Memory Access. DRAM is divided into banks, and each bank is connected to a different GPU when the machine has multiple GPUs. You should see significant performance differences from selecting different numa nodes. A laptop would probably only have the one numa node, but Delta has 4.

# Latency / bandwidth of remote HBM, single node

Let's say we have two GPUs, but each GPU is on this node. What is the latency/bandwidth of communication here? I.e., how much data can I read and write from the other GPU's HBM? Let's use nvshmem to stress this. The kernel should take as input: the blocks/threads for the kernel, the amount of data to memcpy per warp (group of 32 threads). the source/destination NUMA nodes. 

# Latency / bandwidth of HBM + local pinned host memory at the same time

Let's say we have two GPUs, but each GPU is on this node. We run the previous two tests simultaneously. The NVMe bus may get saturated and cause contention between the two tests. The experienced speed may be halved for each workload.

# Latency / bandwidth of remote memory, two nodes

Let's say we have two GPUs, but each GPU is on a different node. What is the latency of communication here? This will stress network performance bottlenecks. We should have parameters for the following: the blocks / threads for the kernel, the source gpu id, the destination gpu id, and the amount of data to transfer per warp. 

# Latency / bandwidth of remote pinned memory, two nodes

DeltaAI machine has a special primitive called NVLink. This allows remote regions of data stored in pinned host memory to be mapped as pointers and passed to the GPU, effectively writing data to another node's pinned memory. I'm curious what the performance is of this. The kernel should take as input the blocks/threads for the kernel, the source gpu id and NUMA node, the destination gpu id and NUMA node.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance analysis of different GPU direct data transports #374

Latency / bandwidth of local pinned host memory

Latency / bandwidth of remote HBM, single node

Latency / bandwidth of HBM + local pinned host memory at the same time

Latency / bandwidth of remote memory, two nodes

Latency / bandwidth of remote pinned memory, two nodes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Performance analysis of different GPU direct data transports #374

Description

Latency / bandwidth of local pinned host memory

Latency / bandwidth of remote HBM, single node

Latency / bandwidth of HBM + local pinned host memory at the same time

Latency / bandwidth of remote memory, two nodes

Latency / bandwidth of remote pinned memory, two nodes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions