Perf/nvl72 multigpu tpch sf1000#885
Draft
felipeblazing wants to merge 5 commits into
Draft
Conversation
cuDF worker-thread allocations were hitting raw cudaMalloc instead of cuCascade's stream-ordered, reservation-aware pool: cudf::set_current_device_resource_ref was a no-op for what cuDF reads (cuDF 26.4 reads the legacy per-device resource map while the ref setter only writes the ref map). Install a per-device forwarding resource via the legacy set_current_device_resource so cuDF allocations forward into cuCascade's pool; restore the prior legacy resource on teardown. TPC-H SF1000 q9 (host-pin, 4x GB200): ~2x on top of the 5 GiB scan-batch config. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two changes to the multi-GPU hash-partition shuffle: 1. Coalesce the cross-GPU shuffle: bucket the N fine partitions by target GPU slot (p % num_gpus, matching task_creator routing) and emit a small number of large per-slot batches instead of N tiny ones. Collapses ~hundreds of thousands of tiny cudaMemcpyPeerAsync into ~num_gpus large transfers while preserving fine-partition count (parallelism) and build/probe slot consistency. 2. View-fusion: hash_partition_sliced() returns the single reordered table plus zero-copy per-partition views; execute() concatenates straight from those views into one contiguous per-slot table — materializing exactly once instead of deep-copying every fine partition and then concatenating again. Views never cross the task boundary (only the materialized, downgradable per-slot tables do). TPC-H SF1000 q9 (host-pin, 4x GB200): 16.7s -> ~5s combined. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Host-tier pinned scans routed every chunk to whatever device cudaGetDevice() returned, funneling all scan tasks onto GPU 0 and OOMing heavy queries at any GPU count. Precompute a per-chunk target GPU (round-robin within the chunk's NUMA node's GPU set) and bind it explicitly, so host-pinned scan work is balanced across GPUs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
GPFS rejects io_uring registered-buffer reads (IORING_OP_READ_FIXED) on O_DIRECT fds with -ENOMEM, breaking the Sirius datasource on GPFS mounts (e.g. /scratch). Use plain io_uring_prep_read into the same pinned bounce buffer so SF1000 can be read off GPFS. TEMPORARY: this unconditionally drops the registered-buffer fast path (costs throughput on local/NVMe). See the TODO(Amin, William) for the proper fix (detect the filesystem / fall back only on -ENOMEM). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reproducible TPC-H SF1000 benchmarking for Super Sirius on GB200 NVL72: per-query isolated processes with host/gpu-tier column pinning across 1/2/4 GPUs, timeout and status classification, CSV + summary output. Includes the tuned 1/2/4-GPU configs (5 GiB scan batches, per-Grace host capacity), an nsys per-query profiler, and a standalone NUMA-aware H2D bandwidth probe (h2d_multigpu_bw.cu). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.