Skip to content

Perf/nvl72 multigpu tpch sf1000#885

Draft
felipeblazing wants to merge 5 commits into
sirius-db:devfrom
felipeblazing:perf/nvl72-multigpu-tpch-sf1000
Draft

Perf/nvl72 multigpu tpch sf1000#885
felipeblazing wants to merge 5 commits into
sirius-db:devfrom
felipeblazing:perf/nvl72-multigpu-tpch-sf1000

Conversation

@felipeblazing

Copy link
Copy Markdown
Collaborator

No description provided.

felipeblazing and others added 5 commits June 5, 2026 04:24
cuDF worker-thread allocations were hitting raw cudaMalloc instead of cuCascade's
stream-ordered, reservation-aware pool: cudf::set_current_device_resource_ref was a
no-op for what cuDF reads (cuDF 26.4 reads the legacy per-device resource map while
the ref setter only writes the ref map). Install a per-device forwarding resource via
the legacy set_current_device_resource so cuDF allocations forward into cuCascade's
pool; restore the prior legacy resource on teardown.

TPC-H SF1000 q9 (host-pin, 4x GB200): ~2x on top of the 5 GiB scan-batch config.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two changes to the multi-GPU hash-partition shuffle:

1. Coalesce the cross-GPU shuffle: bucket the N fine partitions by target GPU slot
   (p % num_gpus, matching task_creator routing) and emit a small number of large
   per-slot batches instead of N tiny ones. Collapses ~hundreds of thousands of tiny
   cudaMemcpyPeerAsync into ~num_gpus large transfers while preserving fine-partition
   count (parallelism) and build/probe slot consistency.

2. View-fusion: hash_partition_sliced() returns the single reordered table plus
   zero-copy per-partition views; execute() concatenates straight from those views
   into one contiguous per-slot table — materializing exactly once instead of
   deep-copying every fine partition and then concatenating again. Views never cross
   the task boundary (only the materialized, downgradable per-slot tables do).

TPC-H SF1000 q9 (host-pin, 4x GB200): 16.7s -> ~5s combined.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Host-tier pinned scans routed every chunk to whatever device cudaGetDevice()
returned, funneling all scan tasks onto GPU 0 and OOMing heavy queries at any GPU
count. Precompute a per-chunk target GPU (round-robin within the chunk's NUMA node's
GPU set) and bind it explicitly, so host-pinned scan work is balanced across GPUs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
GPFS rejects io_uring registered-buffer reads (IORING_OP_READ_FIXED) on O_DIRECT fds
with -ENOMEM, breaking the Sirius datasource on GPFS mounts (e.g. /scratch). Use plain
io_uring_prep_read into the same pinned bounce buffer so SF1000 can be read off GPFS.

TEMPORARY: this unconditionally drops the registered-buffer fast path (costs
throughput on local/NVMe). See the TODO(Amin, William) for the proper fix (detect the
filesystem / fall back only on -ENOMEM).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reproducible TPC-H SF1000 benchmarking for Super Sirius on GB200 NVL72: per-query
isolated processes with host/gpu-tier column pinning across 1/2/4 GPUs, timeout and
status classification, CSV + summary output. Includes the tuned 1/2/4-GPU configs
(5 GiB scan batches, per-Grace host capacity), an nsys per-query profiler, and a
standalone NUMA-aware H2D bandwidth probe (h2d_multigpu_bw.cu).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@felipeblazing felipeblazing marked this pull request as draft June 5, 2026 04:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant