Skip to content

Performance analysis of different GPU direct data transports #374

@lukemartinlogan

Description

@lukemartinlogan

Let's build a benchmark that stresses the following cases. Do each of them one at a time, not all at once. You should be able to develop most of these benchmarks on your laptop, test they work in simple cases, and then transfer the code to delta. Let's put the benchmark code under context-transfer-engine/benchmarks/gpu_baseline for now.

Latency / bandwidth of local pinned host memory

Delta machine has multiple GPUs per node. How much performance can we get from writing data to pinned host memory? A GPU kernel that performs a strided memcpy to a pinned host memory region of configurable size would be great. The kernel should take as input: the blocks/threads for the kernel, the amount of data to memcpy per warp (group of 32 threads). the NUMA node to allocate pinned memory from, the ID of the gpu to launch the copy kernel.

NUMA stands for Nonuniform Memory Access. DRAM is divided into banks, and each bank is connected to a different GPU when the machine has multiple GPUs. You should see significant performance differences from selecting different numa nodes. A laptop would probably only have the one numa node, but Delta has 4.

Latency / bandwidth of remote HBM, single node

Let's say we have two GPUs, but each GPU is on this node. What is the latency/bandwidth of communication here? I.e., how much data can I read and write from the other GPU's HBM? Let's use nvshmem to stress this. The kernel should take as input: the blocks/threads for the kernel, the amount of data to memcpy per warp (group of 32 threads). the source/destination NUMA nodes.

Latency / bandwidth of HBM + local pinned host memory at the same time

Let's say we have two GPUs, but each GPU is on this node. We run the previous two tests simultaneously. The NVMe bus may get saturated and cause contention between the two tests. The experienced speed may be halved for each workload.

Latency / bandwidth of remote memory, two nodes

Let's say we have two GPUs, but each GPU is on a different node. What is the latency of communication here? This will stress network performance bottlenecks. We should have parameters for the following: the blocks / threads for the kernel, the source gpu id, the destination gpu id, and the amount of data to transfer per warp.

Latency / bandwidth of remote pinned memory, two nodes

DeltaAI machine has a special primitive called NVLink. This allows remote regions of data stored in pinned host memory to be mapped as pointers and passed to the GPU, effectively writing data to another node's pinned memory. I'm curious what the performance is of this. The kernel should take as input the blocks/threads for the kernel, the source gpu id and NUMA node, the destination gpu id and NUMA node.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions