Let's build a benchmark that stresses the following cases. Do each of them one at a time, not all at once. You should be able to develop most of these benchmarks on your laptop, test they work in simple cases, and then transfer the code to delta. Let's put the benchmark code under context-transfer-engine/benchmarks/gpu_baseline for now.
Latency / bandwidth of local pinned host memory
Delta machine has multiple GPUs per node. How much performance can we get from writing data to pinned host memory? A GPU kernel that performs a strided memcpy to a pinned host memory region of configurable size would be great. The kernel should take as input: the blocks/threads for the kernel, the amount of data to memcpy per warp (group of 32 threads). the NUMA node to allocate pinned memory from, the ID of the gpu to launch the copy kernel.
NUMA stands for Nonuniform Memory Access. DRAM is divided into banks, and each bank is connected to a different GPU when the machine has multiple GPUs. You should see significant performance differences from selecting different numa nodes. A laptop would probably only have the one numa node, but Delta has 4.
Latency / bandwidth of remote HBM, single node
Let's say we have two GPUs, but each GPU is on this node. What is the latency/bandwidth of communication here? I.e., how much data can I read and write from the other GPU's HBM? Let's use nvshmem to stress this. The kernel should take as input: the blocks/threads for the kernel, the amount of data to memcpy per warp (group of 32 threads). the source/destination NUMA nodes.
Latency / bandwidth of HBM + local pinned host memory at the same time
Let's say we have two GPUs, but each GPU is on this node. We run the previous two tests simultaneously. The NVMe bus may get saturated and cause contention between the two tests. The experienced speed may be halved for each workload.
Latency / bandwidth of remote memory, two nodes
Let's say we have two GPUs, but each GPU is on a different node. What is the latency of communication here? This will stress network performance bottlenecks. We should have parameters for the following: the blocks / threads for the kernel, the source gpu id, the destination gpu id, and the amount of data to transfer per warp.
Latency / bandwidth of remote pinned memory, two nodes
DeltaAI machine has a special primitive called NVLink. This allows remote regions of data stored in pinned host memory to be mapped as pointers and passed to the GPU, effectively writing data to another node's pinned memory. I'm curious what the performance is of this. The kernel should take as input the blocks/threads for the kernel, the source gpu id and NUMA node, the destination gpu id and NUMA node.
Let's build a benchmark that stresses the following cases. Do each of them one at a time, not all at once. You should be able to develop most of these benchmarks on your laptop, test they work in simple cases, and then transfer the code to delta. Let's put the benchmark code under context-transfer-engine/benchmarks/gpu_baseline for now.
Latency / bandwidth of local pinned host memory
Delta machine has multiple GPUs per node. How much performance can we get from writing data to pinned host memory? A GPU kernel that performs a strided memcpy to a pinned host memory region of configurable size would be great. The kernel should take as input: the blocks/threads for the kernel, the amount of data to memcpy per warp (group of 32 threads). the NUMA node to allocate pinned memory from, the ID of the gpu to launch the copy kernel.
NUMA stands for Nonuniform Memory Access. DRAM is divided into banks, and each bank is connected to a different GPU when the machine has multiple GPUs. You should see significant performance differences from selecting different numa nodes. A laptop would probably only have the one numa node, but Delta has 4.
Latency / bandwidth of remote HBM, single node
Let's say we have two GPUs, but each GPU is on this node. What is the latency/bandwidth of communication here? I.e., how much data can I read and write from the other GPU's HBM? Let's use nvshmem to stress this. The kernel should take as input: the blocks/threads for the kernel, the amount of data to memcpy per warp (group of 32 threads). the source/destination NUMA nodes.
Latency / bandwidth of HBM + local pinned host memory at the same time
Let's say we have two GPUs, but each GPU is on this node. We run the previous two tests simultaneously. The NVMe bus may get saturated and cause contention between the two tests. The experienced speed may be halved for each workload.
Latency / bandwidth of remote memory, two nodes
Let's say we have two GPUs, but each GPU is on a different node. What is the latency of communication here? This will stress network performance bottlenecks. We should have parameters for the following: the blocks / threads for the kernel, the source gpu id, the destination gpu id, and the amount of data to transfer per warp.
Latency / bandwidth of remote pinned memory, two nodes
DeltaAI machine has a special primitive called NVLink. This allows remote regions of data stored in pinned host memory to be mapped as pointers and passed to the GPU, effectively writing data to another node's pinned memory. I'm curious what the performance is of this. The kernel should take as input the blocks/threads for the kernel, the source gpu id and NUMA node, the destination gpu id and NUMA node.