Perf/nvl72 multigpu tpch sf1000 by felipeblazing · Pull Request #885 · sirius-db/sirius

felipeblazing · 2026-06-05T04:26:04Z

No description provided.

cuDF worker-thread allocations were hitting raw cudaMalloc instead of cuCascade's stream-ordered, reservation-aware pool: cudf::set_current_device_resource_ref was a no-op for what cuDF reads (cuDF 26.4 reads the legacy per-device resource map while the ref setter only writes the ref map). Install a per-device forwarding resource via the legacy set_current_device_resource so cuDF allocations forward into cuCascade's pool; restore the prior legacy resource on teardown. TPC-H SF1000 q9 (host-pin, 4x GB200): ~2x on top of the 5 GiB scan-batch config. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Two changes to the multi-GPU hash-partition shuffle: 1. Coalesce the cross-GPU shuffle: bucket the N fine partitions by target GPU slot (p % num_gpus, matching task_creator routing) and emit a small number of large per-slot batches instead of N tiny ones. Collapses ~hundreds of thousands of tiny cudaMemcpyPeerAsync into ~num_gpus large transfers while preserving fine-partition count (parallelism) and build/probe slot consistency. 2. View-fusion: hash_partition_sliced() returns the single reordered table plus zero-copy per-partition views; execute() concatenates straight from those views into one contiguous per-slot table — materializing exactly once instead of deep-copying every fine partition and then concatenating again. Views never cross the task boundary (only the materialized, downgradable per-slot tables do). TPC-H SF1000 q9 (host-pin, 4x GB200): 16.7s -> ~5s combined. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Host-tier pinned scans routed every chunk to whatever device cudaGetDevice() returned, funneling all scan tasks onto GPU 0 and OOMing heavy queries at any GPU count. Precompute a per-chunk target GPU (round-robin within the chunk's NUMA node's GPU set) and bind it explicitly, so host-pinned scan work is balanced across GPUs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

GPFS rejects io_uring registered-buffer reads (IORING_OP_READ_FIXED) on O_DIRECT fds with -ENOMEM, breaking the Sirius datasource on GPFS mounts (e.g. /scratch). Use plain io_uring_prep_read into the same pinned bounce buffer so SF1000 can be read off GPFS. TEMPORARY: this unconditionally drops the registered-buffer fast path (costs throughput on local/NVMe). See the TODO(Amin, William) for the proper fix (detect the filesystem / fall back only on -ENOMEM). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Reproducible TPC-H SF1000 benchmarking for Super Sirius on GB200 NVL72: per-query isolated processes with host/gpu-tier column pinning across 1/2/4 GPUs, timeout and status classification, CSV + summary output. Includes the tuned 1/2/4-GPU configs (5 GiB scan batches, per-Grace host capacity), an nsys per-query profiler, and a standalone NUMA-aware H2D bandwidth probe (h2d_multigpu_bw.cu). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

felipeblazing and others added 5 commits June 5, 2026 04:24

felipeblazing marked this pull request as draft June 5, 2026 04:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf/nvl72 multigpu tpch sf1000#885

Perf/nvl72 multigpu tpch sf1000#885
felipeblazing wants to merge 5 commits into
sirius-db:devfrom
felipeblazing:perf/nvl72-multigpu-tpch-sf1000

felipeblazing commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

felipeblazing commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant