Add window functions to Sirius (1/3): ROW_NUMBER / RANK / DENSE_RANK#806
Add window functions to Sirius (1/3): ROW_NUMBER / RANK / DENSE_RANK#806ran-yuan-rui wants to merge 10 commits into
Conversation
Add tests for GPU window ranking (row_number / rank / dense_rank): - test/cpp/operator/test_physical_window.cpp: plan-translation coverage — accepts ranking expressions sharing PARTITION BY / ORDER BY (with and without partitioning); rejects unsupported shapes (aggregate-over, frame EXCLUDE, heterogeneous partitions, non-column keys, HUGEINT columns, rank/dense_rank without ORDER BY) with NotImplementedException. - test/sql/window/ranking.test: ranking semantics via gpu_execution (ASC/DESC, NULLS FIRST/LAST, multi-key, ties, NULL partitions). - test/cpp/integration/test_gpu_execution_tpcds.cpp: GPU-vs-CPU equality for row_number / rank / dense_rank over store_sales and customer. - test/cpp/integration/test_tpcds_plan_translation.cpp: assert a TPC-DS ranking query translates to a Sirius physical plan. - test/cpp/integration/data/duckdb/: committed TPC-DS sf=0.01 fixture (tpcds.duckdb) plus its generator script. Build: - CMakeLists.txt: compile the new tests, and the previously unbuilt test_tpcds_plan_translation.cpp, into sirius_unittest. - .pre-commit-config.yaml: allow committed duckdb fixtures under test/cpp/integration/data/duckdb/ past the large-file check.
Implement sirius_physical_window, a partition-consumer operator that evaluates ROW_NUMBER, RANK, and DENSE_RANK over an optional PARTITION BY and ORDER BY shared by the expressions of one LogicalWindow. The pipeline converter hash-partitions the input by the PARTITION BY columns (a single global partition when absent) and feeds whole partitions to the operator, which concatenates them, stably sorts by (PARTITION BY, ORDER BY), and computes ranks with a cuDF grouped rank scan. Output is the child columns followed by one BIGINT rank column per window expression. - Enable the LOGICAL_WINDOW dispatch in the Sirius physical plan generator. create_plan(LogicalWindow) guards unsupported shapes (aggregate-over, FILTER, EXCLUDE, non-column keys, mixed framings, HUGEINT child columns, RANK/DENSE_RANK without ORDER BY) with NotImplementedException so such queries fall back to DuckDB. - sirius_physical_partition hash-partitions on the window's PARTITION BY columns for the new WINDOW consumer. - Honor ORDER BY ASC/DESC and NULLS FIRST/LAST; the cuDF null order is flipped for descending keys to match SQL semantics. - Null-check the SiriusContext in comparison-join planning so a missing context raises a clear error instead of a raw null dereference. Add window operator tests, TPC-DS Q44 ranking execution and plan-translation coverage, and SQL ranking-semantics tests.
COALESCE was unimplemented in the column expression executor (it threw NotImplementedException), forcing any query using it off the GPU. Implement it by chaining cudf::replace_nulls — start from the first argument and fill the remaining NULLs from each subsequent argument, yielding the first non-NULL per row. Add a GPU-vs-CPU test covering multi-arg COALESCE with different NULL distributions and types (int / bigint / double).
cuDF applies null_order in the ascending frame and reverses it for a descending column, so the naive NULLS_FIRST?BEFORE:AFTER mapping mis-placed NULLs for every DESC sort key (e.g. ORDER BY x DESC NULLS LAST put NULLs first, disagreeing with DuckDB CPU). The mapping was duplicated across the order, sort-sample, sort-partition, merge-sort and top-n operators. Centralize the contract in op/cudf_sort_order.hpp (to_cudf_order / to_cudf_null_order, which flips BEFORE<->AFTER for descending keys) and route every sort/top-n/window path through it. Also fix single-key top-n with a nullable key: cudf::top_k_order takes no null_order and cannot honor NULLS FIRST/LAST when selecting the top k, so a nullable key now falls back to a full sort that honors null placement. Add test/cpp/integration/test_gpu_execution_order_nulls.cpp covering the four ASC/DESC x NULLS FIRST/LAST quadrants across distributed ORDER BY and single/multi-key TOP-N.
The window GPU semantic tests only exercised INTEGER partition/order keys. The guard imposes no type restriction (only HUGEINT and non-column keys are rejected), so STRING/DATE/TIMESTAMP/DOUBLE keys already run through the same stable-sort + grouped rank scan path. Add a default-run case covering them so that path — including the DESC NULLS FIRST/LAST handling and the rank/dense_rank tie-detection value struct built from non-integer order columns — is verified GPU == DuckDB CPU. Shapes: VARCHAR partition + DOUBLE DESC NULLS LAST (row_number); INTEGER partition + DATE DESC NULLS LAST (rank/dense_rank); global VARCHAR DESC NULLS LAST + TIMESTAMP ASC NULLS FIRST (rank); VARCHAR partition + TIMESTAMP ASC NULLS FIRST (row_number).
The Q67 translation case still said GPU execution was deferred "until COALESCE is supported", but COALESCE now runs on GPU in this branch. Reword to: Q67 is translation-only here; end-to-end GPU execution is deferred until the remaining non-window TPC-DS gaps (e.g. the ROLLUP projection path) are covered.
Resolve conflicts from the upstream refactors (wire_data_repositories → descriptors, batch API, sirius exceptions, expression executor rework): - CMakeLists / TEST_SOURCES: union of both sides' new sources/tests. - compute_repository_wiring: adopt upstream's emit/descriptor form and add WINDOW to the FULL-barrier sink list (PARTITION→WINDOW is wired by the existing PARTITION branch). - sort_sample: keep upstream's restructured boundary computation, re-apply the shared to_cudf_order/to_cudf_null_order helper for DESC NULLS correctness. - gpu_execute_operator: take upstream's COALESCE (it now implements the same replace_nulls chaining), dropping the now-redundant local COALESCE branch. - plan_comparison_join: keep the sirius_state null-check hardening.
Adapt the window ranking code to the APIs changed by the merged upstream dev: - types parameter: duckdb::LogicalType -> sirius::logical_type (base ctor); create_plan now wraps op.types via sirius::from_duckdb_vec. - source_order(): duckdb::OrderPreservationType -> sirius::OrderPreservationType (operator + plan test). - get_next_task_input_data: repo->pop_data_batch(state, idx) -> pop_next_data_batch(idx) (batch_state::task_created removed). - execute(): use read-only batches (data_batch::get_memory_space is now private); make_data_batch now requires the writer stream. - split_window_sink: engine_ -> build_ctx_, pipeline_breakers_ -> inserted_operators_ (converter member renames). - physical_plan_generator.hpp: add LogicalWindow forward declaration. Compiles clean; single-partition window/order/nulls cases pass. Multi-partition cases under g_integration_env still under investigation (works via CLI).
After the DuckDB bump in the merge, current_setting('scan_task_batch_size')
(and hash_partition_bytes / max_sort_partition_bytes) returns an empty result
set rather than the value, so the runtime-setting RAII guards and the benchmark
config note threw "Attempted to access index 0 within vector of size 0" when
reading the current value. Read these extension options via
`SELECT value FROM duckdb_settings() WHERE name = '...'`, which returns the
value. No operator changes — the window/sort logic itself was unaffected
(verified GPU == CPU via the CLI).
… example The "transparent execution: fallback for unsupported (window)" case used ROW_NUMBER() OVER (...) as its "unsupported operator" example, but this PR makes ranking windows (ROW_NUMBER/RANK/DENSE_RANK) GPU-supported, so that query no longer falls back and the rebind accounting changed. Switch the example to an aggregate-over-window (SUM(...) OVER (...)), which Phase 1 still does not support, so the test keeps exercising the silent CPU fallback path.
|
@ran-yuan-rui Since this PR introduces a bug fix and a 12 MB TPC-DS fixture , could you split those two off into a separate PR, that way we get that merged sooner and the larger piece of work involved here for the WIndow function support does not delay it. |
| // Phase 1 supports only ROW_NUMBER / RANK / DENSE_RANK, sharing one PARTITION BY / ORDER BY, with | ||
| // column-reference keys, no FILTER / EXCLUDE, and no HUGEINT child columns. Anything else throws | ||
| // NotImplementedException so the whole query falls back to DuckDB CPU. | ||
| void guard_phase1(duckdb::LogicalWindow& op) |
There was a problem hiding this comment.
i am assuming that we will only support one window in the query for now, correct? Like, you can have multiple aggregators in the same window, but if you have different windows, then the partitioning and ordering would need to be different, and that should be outside the scope of phase 1
Thx - will do |
## Summary Split out from sirius-db#806 per @wmalpica review comment. This PR contains two independent pieces: 1. Fix GPU `ORDER BY` / `TOP-N` correctness for `DESC ... NULLS FIRST/LAST`. 2. Add a reusable SF0.01 TPC-DS DuckDB fixture for integration / plan-translation tests. ## RCA: DESC NULL ordering bug DuckDB SQL treats `NULLS FIRST` / `NULLS LAST` as the final, absolute placement of NULLs in the result: - `DESC NULLS FIRST` => NULLs first, then values descending - `DESC NULLS LAST` => values descending, then NULLs last Sirius previously mapped SQL null ordering directly to cuDF: ```cpp NULLS_FIRST -> cudf::null_order::BEFORE NULLS_LAST -> cudf::null_order::AFTER ``` That is correct for `ASC`, but wrong for `DESC`: cuDF applies `null_order` in its sort-direction frame, so DESC effectively reverses the NULL placement. As a result, GPU `ORDER BY k DESC NULLS LAST` could place NULLs first, disagreeing with DuckDB CPU. The same naive mapping was duplicated across the distributed sort path: - `ORDER_BY` - `SORT_SAMPLE` - `SORT_PARTITION` - `MERGE_SORT` - `TOP_N` There was one additional TOP-N issue: `cudf::top_k_order` has no `null_order` parameter, so single-key TOP-N on a nullable key could select the wrong top-k rows before any final sorting step. ## Fix - Added `src/include/op/cudf_sort_order.hpp` as the single source of truth: - `to_cudf_order(...)` - `to_cudf_null_order(...)` - `to_cudf_null_order(...)` flips `BEFORE` / `AFTER` for descending keys so SQL `NULLS FIRST/LAST` matches DuckDB semantics. - Routed all sort/top-n operators through the helper. - For single-key TOP-N: - keep `cudf::top_k_order` fast path when the key has no NULLs - use `sort_by_key + slice` when the key is nullable, so NULL placement is honored ## TPC-DS fixture Update (reworked per @mbrobbel's review): Dropped the committed 12 MB tpcds.duckdb binary. The fixture is now generated at build time via an opt-in CMake target tpcds-fixture (runs generate_tpcds_duckdb.sh into ${CMAKE_BINARY_DIR}/test-fixtures/, SF 0.01 via DuckDB's dsdgen). Run cmake --build . --target tpcds-fixture once before the TPC-DS integration tests. ## Tests Added `test_gpu_execution_order_nulls.cpp`: - `ORDER BY` - `TOP-N single-key` - `TOP-N multi-key` - `ASC/DESC × NULLS FIRST/LAST` The test materializes GPU output once, adds `observed_position` to make ordering differences visible, and compares against DuckDB CPU with bidirectional `EXCEPT ALL`. Validation: ```bash pre-commit run --all-files pixi run -e default make release build/release/extension/sirius/test/cpp/sirius_unittest "[nulls]" ```
|
Marking as draft (there is a conflict). |
Talked with @wmalpica and he point me that this uses hash partitioning + in-window cuDF sort/rank for single-GPU in-core ranking. It is not yet the sort-partition architecture proposed for scalable/out-of-core windows.
So I will redo the PR.
Summary
Adds GPU support for the ranking window functions
ROW_NUMBER/RANK/DENSE_RANKto Sirius, so supported ranking-window shapes run on GPU viagpu_executioninstead of forcing the whole query back to DuckDB CPU (out-of-scope shapes go through the existing fallback/error path — see Design).This is Phase 1 of a staged roadmap toward broader window support (see below) — the planned phases cover the main useful window families, not full DuckDB/SQL window parity. It also includes an engine-wide DESC NULLS FIRST/LAST correctness fix in sort/top-n that the window pre-sort depends on. The logical areas are split across commits and can be reviewed independently.
Design
Operator. New
sirius_physical_window, a partition-consumer (same base as hash-join build / aggregate-merge — both a sink and a source). Ranking is not decomposable, so it consumes already hash-partitioned data rather than streaming batch-by-batch.Pipeline shape — reuses the existing partition framework, no new machinery:
sirius_physical_partition(HASH by thePARTITION BYcolumns; one global partition when there's noPARTITION BY).PARTITION → WINDOWis a FULL barrier; the window operator drains one whole partition per task. This is HASH_GROUP_BY'sPARTITION → MERGE_GROUP_BYshape minus the local partial stage, and noCONCAT(that's join-only).Compute is pure cuDF. Per partition: stable-sort by
(PARTITION BY, ORDER BY dirs)→ grouped rank scan viacudf::groupby(…, sorted::YES).scan(make_rank_aggregation(...))(FIRST=row_number,MIN=rank,DENSE=dense_rank). Under the presorted (sorted::YES) path the rank order comes entirely from the pre-sort, so mixed ASC/DESC, multi-column keys, and all three methods are a single code path.Why this matters: we reuse DuckDB's bound expressions + the existing partition/task framework and do not port DuckDB's CPU window machinery. As a result, the design covers multi-batch / multi-partition / NULL groups / ASC·DESC·NULLS with the existing framework plus the window operator's per-partition drain/concat path (validated GPU == DuckDB CPU). Out-of-scope shapes (aggregate-over-window, FILTER, EXCLUDE, non-column keys, heterogeneous windows, HUGEINT child cols) throw
NotImplementedExceptionand use the existing DuckDB fallback path when fallback is enabled.Roadmap
A staged roadmap toward broader window support — not a claim of full DuckDB/SQL window parity. P1 is ranking; P2/P3 target the main remaining useful window families:
ROW_NUMBER/RANK/DENSE_RANKSUM/AVG/COUNT/MIN/MAX OVER) + default &ROWSframesRANGEframe,LEAD/LAG,FIRST/LAST/NTH_VALUE,NTILE, heterogeneous-window splittingcuDF already has the primitives for almost all of P2/P3 (
grouped_rolling_window,grouped_range_rolling_window,nth_element/lead/lag); onlyNTILEhas no direct kernel (composable fromrow_number+ partition count). So later phases are mostly Sirius-side orchestration.Beyond this roadmap (follow-up, not yet planned):
FILTER/EXCLUDE,GROUPSframe,DISTINCTaggregate windows,PERCENT_RANK/CUME_DIST, expressionPARTITION BY/ORDER BYkeys, richer types (DECIMAL/HUGEINT/ interval range bounds), collation / string-ordering details, multi-frame splitting & reuse optimization, and out-of-core / spill / multi-GPU semantics. Full DuckDB corner-case parity is explicitly out of scope here.Also fixed (engine-level, surfaced by this work)
DESC NULLS FIRST/LASTplacement (correctness). cuDF appliesnull_orderin the ascending frame and reverses it for a DESCENDING column, so the naiveNULLS_FIRST?BEFORE:AFTERmapping mis-placed NULLs for every descending sort key — e.g.ORDER BY x DESC NULLS LASTput NULLs first on GPU, disagreeing with DuckDB CPU. The mapping was duplicated acrossorder/sort_sample/sort_partition/merge_sort/top_n. Centralized intoop/cudf_sort_order.hpp(to_cudf_null_order, which flipsBEFORE/AFTERfor descending keys) and applied everywhere, including the window pre-sort. Also fixes single-keyTOP-Non a nullable key:cudf::top_k_ordertakes nonull_orderand ignored NULLS placement when selecting the top-k, so a nullable key now falls back to a full sort. Pre-existing and undetected because benchmark sort keys are non-null / NULL-filtered (it needsDESC∧ NULLs present ∧ order-sensitive comparison).Tests
All Catch2, one
sirius_unittestbinary (tag-filtered). Default regression includes window semantics; TPC-DS + perf are hidden[.]and run on demand.test_physical_window.cpp— plan/guard ([window][plan])test_gpu_execution_window.cpp— GPU semantics incl. multi-batch/multi-partition and non-integer (VARCHAR/DATE/TIMESTAMP/DOUBLE) keys ([integration][window][gpu])test_gpu_execution_order_nulls.cpp— the 4 ASC/DESC × NULLS FIRST/LAST quadrants for ORDER BY + TOP-N ([integration][gpu_execution][nulls])test_gpu_execution_tpcds.cpp(Q44),test_tpcds_plan_translation.cpp(Q44 + Q67 translation),test_window_benchmark.cpp— hidden[.]Known limitations
ORDER BY, hash-join build and aggregate-merge OOM instead of spilling once the working set exceeds the GPUusage_limit(even at ~1.04× overflow with host memory free). Root cause: the downgrade path only spillsidlebatches, but partition-consume batches sit intask_created/processing, so it finds 0 spill candidates → OOM after retries. not introduced or worsened by this PR (plainORDER BYOOMs identically).Input welcome now in this PR we just fallback.
Notes
test/cpp/integration/data/duckdb/tpcds.duckdb(same precedent as the existing TPC-Hintegration.duckdb) + agenerate_tpcds_duckdb.shto reproduce it;.pre-commit-config.yamllarge-file check widened to coverdata/duckdb/*.duckdb.