Make GPU-native DuckDB scan the default#893
Draft
kevkrist wants to merge 3 commits into
Draft
Conversation
Flip enable_gpu_duckdb_native_scan to true so DuckDB seq_scan routes to the
GPU-native scan operator by default (no flag/SET needed), and point the perf
rigging at it.
Engine:
- src/include/sirius_config.hpp: default enable_gpu_duckdb_native_scan = true
- src/sirius_extension.cpp: update the SET-option description (default on)
- src/pipeline/sirius_pipeline_converter.cpp: only route seq_scan to native when
bind_data is a TableScanBindData; otherwise fall through to the CPU scan
- src/sirius_context.cpp: stop force-enabling use_sirius_datasource when native is
on (the io_uring/O_DIRECT datasource is now opt-in; native reads via DuckDB
BufferManager by default, which works everywhere incl. hosts without io_uring)
- src/scan_manager/duckdb_native_split_provider.cpp: skip the uring io_object for
":memory:" databases and catch create_io_object failures -> BufferManager fallback
- src/op/scan/duckdb_native_metadata.cpp: mark tables with transient/in-memory
segments (non-CONSTANT block_id<0) non-viable
Rigging / docs:
- test/tpch_performance/{run_tpch_duckdb.sh,benchmark_and_validate.sh,CLAUDE.md}:
the duckdb data source now uses GPU-native scan by default; --gpu-native-scan and
--data-source duckdb-native are redundant aliases kept for compatibility
- test/cpp/integration/integration-gb10.yaml: config tuned for the GB10 unified-memory box
- docs/super-sirius/native-duckdb-scan-perf-findings.md: SF30 native-vs-parquet analysis
KNOWN WIP (this is a checkpoint, not green):
- In-memory (":memory:") / transient-segment tables are correctly detected as
non-viable, but the scan-manager's non-viable throw happens during execution and
does NOT trigger the transparent CPU fallback, so it surfaces as a hard error.
The [transparent] integration tests (which use :memory: tables) currently fail.
Next step: move the viability check to plan-conversion time in the converter so
non-viable tables fall through to the CPU scan before execution commits.
- Native-scan component tests ([duckdb_native_walker], [duckdb_native_split_provider])
pass; SF30 file-backed benchmarks validate 22/22 against DuckDB CPU.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… tests
Make insert_duckdb_native_scan_operator a try-insert that returns false (caller
falls through to the CPU duckdb scan) instead of throwing, and decide viability at
plan-conversion time so non-viable tables never hard-fail mid-execution (the
scan-manager's runtime non-viable throw does NOT trigger transparent fallback).
Converter (src/pipeline/sirius_pipeline_converter.cpp + header):
- insert_duckdb_native_scan_operator now returns bool; split_table_scan_source falls
through to the CPU scan when it returns false.
- Fall back to CPU for: non-base-table bind data, null client_context, in-memory
databases (StorageManager::InMemory() — transaction-free, so it also covers
plan generation paths without an active transaction, e.g. plan printing/EXPLAIN),
metadata-walk failure (wrapped in try/catch), walker-rejected tables (unsupported
types/codecs), and tables with unreadable transient segments (non-CONSTANT
block_id<0, e.g. uncheckpointed data).
Walker (src/op/scan/duckdb_native_metadata.cpp):
- Revert the block_id<0 rejection added in the checkpoint commit. The walker is a
pure metadata describer (its unit tests build in-memory fixtures and expect
viable=true); the "is this readable" decision belongs in the converter, which now
scans the walked descriptors for non-CONSTANT block_id<0 segments.
Verified on the GB10 box:
- Full unit suite: 1555/1556 pass. The one failure ("Task scheduler dispatches tasks
with device preference") is a pre-existing 2-GPU test on this 1-GPU host
("Requested number of GPUs exceeds available GPUs"), unrelated to this change.
- [transparent], [plan_printer], [duckdb_native_walker], [duckdb_native_split_provider]
all green.
- SF30 benchmark_and_validate: --data-source duckdb (native by default, no flag) and
--data-source parquet both validate 22/22 vs DuckDB CPU; sirius uses
GPU_DUCKDB_NATIVE_SCAN and GPU_PARQUET_SCAN respectively.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ents Addresses findings from a review of the native-scan-default change. Perf (high): the plan-time viability gate added in this PR walked the DuckDB segment metadata (DataTable::GetColumnSegmentInfo over every segment — the dominant cold-scan cost) and discarded the result, then the split provider re-walked the identical metadata at execution. Cache the converter's validated walk on duckdb_native_scan_info::prewalked_metadata; the split provider moves it out (zero-copy) and only walks when absent (direct construction / unit tests). The ctor now seals the const _scan_info in the body, after the cached metadata is moved out of the by-value param, to avoid a deep copy. Docs/comments: - sirius_context.cpp: the comment claimed the native scan reads via DuckDB's BufferManager by default. It does not — gpu_ioctxs_ is built unconditionally, the scan gets one, and the decoder reads via host_read by default; BufferManager is only the per-block fallback when the io_object can't be opened. Also drop the false claim that use_sirius_datasource gates the native path (it is read in no live code path). - test/tpch_performance/CLAUDE.md: fix stale line ref (:384 catch block -> :451 routing gate), name the function to resist drift. - run_tpch_duckdb.sh: --gpu-native-scan and SIRIUS_NATIVE_SCAN_VERIFY are independent; stop implying the flag is needed for verify. Note: deliberately did NOT fold the block_id<0 readability check into the walker (a suggested but rejected option) — the walker is a pure metadata describer whose unit tests expect viable=true on in-memory fixtures; folding it would re-break them. Verified: builds clean in pixi; [duckdb_native_split_provider], [duckdb_native_walker], [plan_printer], [transparent] unit suites all pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Turns on the GPU-native duckdb scan by default so seq_scan over duckdb tables runs on the GPU without a flag. Falls back to the CPU scan for in-memory tables and anything the native path can't read, so nothing should break. Also points the TPC-H rigging at it and drops a short SF30 perf writeup in docs/. Draft for now to get some eyes on it.