Add window functions to Sirius (1/3): ROW_NUMBER / RANK / DENSE_RANK by ran-yuan-rui · Pull Request #806 · sirius-db/sirius

ran-yuan-rui · 2026-05-25T12:53:13Z

Talked with @wmalpica and he point me that this uses hash partitioning + in-window cuDF sort/rank for single-GPU in-core ranking. It is not yet the sort-partition architecture proposed for scalable/out-of-core windows.
So I will redo the PR.

Summary

Adds GPU support for the ranking window functions ROW_NUMBER / RANK / DENSE_RANK to Sirius, so supported ranking-window shapes run on GPU via gpu_execution instead of forcing the whole query back to DuckDB CPU (out-of-scope shapes go through the existing fallback/error path — see Design).

This is Phase 1 of a staged roadmap toward broader window support (see below) — the planned phases cover the main useful window families, not full DuckDB/SQL window parity. It also includes an engine-wide DESC NULLS FIRST/LAST correctness fix in sort/top-n that the window pre-sort depends on. The logical areas are split across commits and can be reviewed independently.

Design

Operator. New sirius_physical_window, a partition-consumer (same base as hash-join build / aggregate-merge — both a sink and a source). Ranking is not decomposable, so it consumes already hash-partitioned data rather than streaming batch-by-batch.

Pipeline shape — reuses the existing partition framework, no new machinery:

The converter inserts a sirius_physical_partition (HASH by the PARTITION BY columns; one global partition when there's no PARTITION BY).
PARTITION → WINDOW is a FULL barrier; the window operator drains one whole partition per task. This is HASH_GROUP_BY's PARTITION → MERGE_GROUP_BY shape minus the local partial stage, and no CONCAT (that's join-only).

Compute is pure cuDF. Per partition: stable-sort by (PARTITION BY, ORDER BY dirs) → grouped rank scan via cudf::groupby(…, sorted::YES).scan(make_rank_aggregation(...)) (FIRST=row_number, MIN=rank, DENSE=dense_rank). Under the presorted (sorted::YES) path the rank order comes entirely from the pre-sort, so mixed ASC/DESC, multi-column keys, and all three methods are a single code path.

Why this matters: we reuse DuckDB's bound expressions + the existing partition/task framework and do not port DuckDB's CPU window machinery. As a result, the design covers multi-batch / multi-partition / NULL groups / ASC·DESC·NULLS with the existing framework plus the window operator's per-partition drain/concat path (validated GPU == DuckDB CPU). Out-of-scope shapes (aggregate-over-window, FILTER, EXCLUDE, non-column keys, heterogeneous windows, HUGEINT child cols) throw NotImplementedException and use the existing DuckDB fallback path when fallback is enabled.

Roadmap

A staged roadmap toward broader window support — not a claim of full DuckDB/SQL window parity. P1 is ranking; P2/P3 target the main remaining useful window families:

Phase	Scope	Status
P1	`ROW_NUMBER` / `RANK` / `DENSE_RANK`	this PR
P2	aggregate-over-window (`SUM/AVG/COUNT/MIN/MAX OVER`) + default & `ROWS` frames	planned
P3	`RANGE` frame, `LEAD`/`LAG`, `FIRST/LAST/NTH_VALUE`, `NTILE`, heterogeneous-window splitting	planned

cuDF already has the primitives for almost all of P2/P3 (grouped_rolling_window, grouped_range_rolling_window, nth_element / lead / lag); only NTILE has no direct kernel (composable from row_number + partition count). So later phases are mostly Sirius-side orchestration.

Beyond this roadmap (follow-up, not yet planned): FILTER / EXCLUDE, GROUPS frame, DISTINCT aggregate windows, PERCENT_RANK / CUME_DIST, expression PARTITION BY / ORDER BY keys, richer types (DECIMAL / HUGEINT / interval range bounds), collation / string-ordering details, multi-frame splitting & reuse optimization, and out-of-core / spill / multi-GPU semantics. Full DuckDB corner-case parity is explicitly out of scope here.

Also fixed (engine-level, surfaced by this work)

DESC NULLS FIRST/LAST placement (correctness). cuDF applies null_order in the ascending frame and reverses it for a DESCENDING column, so the naive NULLS_FIRST?BEFORE:AFTER mapping mis-placed NULLs for every descending sort key — e.g. ORDER BY x DESC NULLS LAST put NULLs first on GPU, disagreeing with DuckDB CPU. The mapping was duplicated across order / sort_sample / sort_partition / merge_sort / top_n. Centralized into op/cudf_sort_order.hpp (to_cudf_null_order, which flips BEFORE/AFTER for descending keys) and applied everywhere, including the window pre-sort. Also fixes single-key TOP-N on a nullable key: cudf::top_k_order takes no null_order and ignored NULLS placement when selecting the top-k, so a nullable key now falls back to a full sort. Pre-existing and undetected because benchmark sort keys are non-null / NULL-filtered (it needs DESC ∧ NULLs present ∧ order-sensitive comparison).

Tests

All Catch2, one sirius_unittest binary (tag-filtered). Default regression includes window semantics; TPC-DS + perf are hidden [.] and run on demand.

test_physical_window.cpp — plan/guard ([window][plan])
test_gpu_execution_window.cpp — GPU semantics incl. multi-batch/multi-partition and non-integer (VARCHAR/DATE/TIMESTAMP/DOUBLE) keys ([integration][window][gpu])
test_gpu_execution_order_nulls.cpp — the 4 ASC/DESC × NULLS FIRST/LAST quadrants for ORDER BY + TOP-N ([integration][gpu_execution][nulls])
COALESCE GPU==CPU coverage case in test_gpu_execution_tpch.cpp (the COALESCE impl itself now comes from upstream dev)
test_gpu_execution_tpcds.cpp (Q44), test_tpcds_plan_translation.cpp (Q44 + Q67 translation), test_window_benchmark.cpp — hidden [.]

Known limitations

Single-GPU validation only. All testing is on one GPU. Multi-GPU is not exercised: partitions are independent so the framework could in principle schedule them across GPUs, but that path is untested here and not claimed.
No out-of-core / spill for blocking operators. Window, ORDER BY, hash-join build and aggregate-merge OOM instead of spilling once the working set exceeds the GPU usage_limit (even at ~1.04× overflow with host memory free). Root cause: the downgrade path only spills idle batches, but partition-consume batches sit in task_created/processing, so it finds 0 spill candidates → OOM after retries. not introduced or worsened by this PR (plain ORDER BY OOMs identically).
Input welcome now in this PR we just fallback.

Notes

Adds a committed 12 MB TPC-DS fixture test/cpp/integration/data/duckdb/tpcds.duckdb (same precedent as the existing TPC-H integration.duckdb) + a generate_tpcds_duckdb.sh to reproduce it; .pre-commit-config.yaml large-file check widened to cover data/duckdb/*.duckdb.

Add tests for GPU window ranking (row_number / rank / dense_rank): - test/cpp/operator/test_physical_window.cpp: plan-translation coverage — accepts ranking expressions sharing PARTITION BY / ORDER BY (with and without partitioning); rejects unsupported shapes (aggregate-over, frame EXCLUDE, heterogeneous partitions, non-column keys, HUGEINT columns, rank/dense_rank without ORDER BY) with NotImplementedException. - test/sql/window/ranking.test: ranking semantics via gpu_execution (ASC/DESC, NULLS FIRST/LAST, multi-key, ties, NULL partitions). - test/cpp/integration/test_gpu_execution_tpcds.cpp: GPU-vs-CPU equality for row_number / rank / dense_rank over store_sales and customer. - test/cpp/integration/test_tpcds_plan_translation.cpp: assert a TPC-DS ranking query translates to a Sirius physical plan. - test/cpp/integration/data/duckdb/: committed TPC-DS sf=0.01 fixture (tpcds.duckdb) plus its generator script. Build: - CMakeLists.txt: compile the new tests, and the previously unbuilt test_tpcds_plan_translation.cpp, into sirius_unittest. - .pre-commit-config.yaml: allow committed duckdb fixtures under test/cpp/integration/data/duckdb/ past the large-file check.

Implement sirius_physical_window, a partition-consumer operator that evaluates ROW_NUMBER, RANK, and DENSE_RANK over an optional PARTITION BY and ORDER BY shared by the expressions of one LogicalWindow. The pipeline converter hash-partitions the input by the PARTITION BY columns (a single global partition when absent) and feeds whole partitions to the operator, which concatenates them, stably sorts by (PARTITION BY, ORDER BY), and computes ranks with a cuDF grouped rank scan. Output is the child columns followed by one BIGINT rank column per window expression. - Enable the LOGICAL_WINDOW dispatch in the Sirius physical plan generator. create_plan(LogicalWindow) guards unsupported shapes (aggregate-over, FILTER, EXCLUDE, non-column keys, mixed framings, HUGEINT child columns, RANK/DENSE_RANK without ORDER BY) with NotImplementedException so such queries fall back to DuckDB. - sirius_physical_partition hash-partitions on the window's PARTITION BY columns for the new WINDOW consumer. - Honor ORDER BY ASC/DESC and NULLS FIRST/LAST; the cuDF null order is flipped for descending keys to match SQL semantics. - Null-check the SiriusContext in comparison-join planning so a missing context raises a clear error instead of a raw null dereference. Add window operator tests, TPC-DS Q44 ranking execution and plan-translation coverage, and SQL ranking-semantics tests.

COALESCE was unimplemented in the column expression executor (it threw NotImplementedException), forcing any query using it off the GPU. Implement it by chaining cudf::replace_nulls — start from the first argument and fill the remaining NULLs from each subsequent argument, yielding the first non-NULL per row. Add a GPU-vs-CPU test covering multi-arg COALESCE with different NULL distributions and types (int / bigint / double).

cuDF applies null_order in the ascending frame and reverses it for a descending column, so the naive NULLS_FIRST?BEFORE:AFTER mapping mis-placed NULLs for every DESC sort key (e.g. ORDER BY x DESC NULLS LAST put NULLs first, disagreeing with DuckDB CPU). The mapping was duplicated across the order, sort-sample, sort-partition, merge-sort and top-n operators. Centralize the contract in op/cudf_sort_order.hpp (to_cudf_order / to_cudf_null_order, which flips BEFORE<->AFTER for descending keys) and route every sort/top-n/window path through it. Also fix single-key top-n with a nullable key: cudf::top_k_order takes no null_order and cannot honor NULLS FIRST/LAST when selecting the top k, so a nullable key now falls back to a full sort that honors null placement. Add test/cpp/integration/test_gpu_execution_order_nulls.cpp covering the four ASC/DESC x NULLS FIRST/LAST quadrants across distributed ORDER BY and single/multi-key TOP-N.

The window GPU semantic tests only exercised INTEGER partition/order keys. The guard imposes no type restriction (only HUGEINT and non-column keys are rejected), so STRING/DATE/TIMESTAMP/DOUBLE keys already run through the same stable-sort + grouped rank scan path. Add a default-run case covering them so that path — including the DESC NULLS FIRST/LAST handling and the rank/dense_rank tie-detection value struct built from non-integer order columns — is verified GPU == DuckDB CPU. Shapes: VARCHAR partition + DOUBLE DESC NULLS LAST (row_number); INTEGER partition + DATE DESC NULLS LAST (rank/dense_rank); global VARCHAR DESC NULLS LAST + TIMESTAMP ASC NULLS FIRST (rank); VARCHAR partition + TIMESTAMP ASC NULLS FIRST (row_number).

The Q67 translation case still said GPU execution was deferred "until COALESCE is supported", but COALESCE now runs on GPU in this branch. Reword to: Q67 is translation-only here; end-to-end GPU execution is deferred until the remaining non-window TPC-DS gaps (e.g. the ROLLUP projection path) are covered.

Resolve conflicts from the upstream refactors (wire_data_repositories → descriptors, batch API, sirius exceptions, expression executor rework): - CMakeLists / TEST_SOURCES: union of both sides' new sources/tests. - compute_repository_wiring: adopt upstream's emit/descriptor form and add WINDOW to the FULL-barrier sink list (PARTITION→WINDOW is wired by the existing PARTITION branch). - sort_sample: keep upstream's restructured boundary computation, re-apply the shared to_cudf_order/to_cudf_null_order helper for DESC NULLS correctness. - gpu_execute_operator: take upstream's COALESCE (it now implements the same replace_nulls chaining), dropping the now-redundant local COALESCE branch. - plan_comparison_join: keep the sirius_state null-check hardening.

Adapt the window ranking code to the APIs changed by the merged upstream dev: - types parameter: duckdb::LogicalType -> sirius::logical_type (base ctor); create_plan now wraps op.types via sirius::from_duckdb_vec. - source_order(): duckdb::OrderPreservationType -> sirius::OrderPreservationType (operator + plan test). - get_next_task_input_data: repo->pop_data_batch(state, idx) -> pop_next_data_batch(idx) (batch_state::task_created removed). - execute(): use read-only batches (data_batch::get_memory_space is now private); make_data_batch now requires the writer stream. - split_window_sink: engine_ -> build_ctx_, pipeline_breakers_ -> inserted_operators_ (converter member renames). - physical_plan_generator.hpp: add LogicalWindow forward declaration. Compiles clean; single-partition window/order/nulls cases pass. Multi-partition cases under g_integration_env still under investigation (works via CLI).

After the DuckDB bump in the merge, current_setting('scan_task_batch_size') (and hash_partition_bytes / max_sort_partition_bytes) returns an empty result set rather than the value, so the runtime-setting RAII guards and the benchmark config note threw "Attempted to access index 0 within vector of size 0" when reading the current value. Read these extension options via `SELECT value FROM duckdb_settings() WHERE name = '...'`, which returns the value. No operator changes — the window/sort logic itself was unaffected (verified GPU == CPU via the CLI).

… example The "transparent execution: fallback for unsupported (window)" case used ROW_NUMBER() OVER (...) as its "unsupported operator" example, but this PR makes ranking windows (ROW_NUMBER/RANK/DENSE_RANK) GPU-supported, so that query no longer falls back and the rebind accounting changed. Switch the example to an aggregate-over-window (SUM(...) OVER (...)), which Phase 1 still does not support, so the test keeps exercising the silent CPU fallback path.

wmalpica · 2026-05-26T19:43:07Z

@ran-yuan-rui Since this PR introduces a bug fix and a 12 MB TPC-DS fixture , could you split those two off into a separate PR, that way we get that merged sooner and the larger piece of work involved here for the WIndow function support does not delay it.

wmalpica · 2026-05-26T21:56:23Z

+// Phase 1 supports only ROW_NUMBER / RANK / DENSE_RANK, sharing one PARTITION BY / ORDER BY, with
+// column-reference keys, no FILTER / EXCLUDE, and no HUGEINT child columns. Anything else throws
+// NotImplementedException so the whole query falls back to DuckDB CPU.
+void guard_phase1(duckdb::LogicalWindow& op)


i am assuming that we will only support one window in the query for now, correct? Like, you can have multiple aggregators in the same window, but if you have different windows, then the partitioning and ordering would need to be different, and that should be outside the scope of phase 1

ran-yuan-rui · 2026-05-26T23:49:40Z

@ran-yuan-rui Since this PR introduces a bug fix and a 12 MB TPC-DS fixture , could you split those two off into a separate PR, that way we get that merged sooner and the larger piece of work involved here for the WIndow function support does not delay it.

Thx - will do

@wmalpica

## Summary Split out from sirius-db#806 per @wmalpica review comment. This PR contains two independent pieces: 1. Fix GPU `ORDER BY` / `TOP-N` correctness for `DESC ... NULLS FIRST/LAST`. 2. Add a reusable SF0.01 TPC-DS DuckDB fixture for integration / plan-translation tests. ## RCA: DESC NULL ordering bug DuckDB SQL treats `NULLS FIRST` / `NULLS LAST` as the final, absolute placement of NULLs in the result: - `DESC NULLS FIRST` => NULLs first, then values descending - `DESC NULLS LAST` => values descending, then NULLs last Sirius previously mapped SQL null ordering directly to cuDF: ```cpp NULLS_FIRST -> cudf::null_order::BEFORE NULLS_LAST -> cudf::null_order::AFTER ``` That is correct for `ASC`, but wrong for `DESC`: cuDF applies `null_order` in its sort-direction frame, so DESC effectively reverses the NULL placement. As a result, GPU `ORDER BY k DESC NULLS LAST` could place NULLs first, disagreeing with DuckDB CPU. The same naive mapping was duplicated across the distributed sort path: - `ORDER_BY` - `SORT_SAMPLE` - `SORT_PARTITION` - `MERGE_SORT` - `TOP_N` There was one additional TOP-N issue: `cudf::top_k_order` has no `null_order` parameter, so single-key TOP-N on a nullable key could select the wrong top-k rows before any final sorting step. ## Fix - Added `src/include/op/cudf_sort_order.hpp` as the single source of truth: - `to_cudf_order(...)` - `to_cudf_null_order(...)` - `to_cudf_null_order(...)` flips `BEFORE` / `AFTER` for descending keys so SQL `NULLS FIRST/LAST` matches DuckDB semantics. - Routed all sort/top-n operators through the helper. - For single-key TOP-N: - keep `cudf::top_k_order` fast path when the key has no NULLs - use `sort_by_key + slice` when the key is nullable, so NULL placement is honored ## TPC-DS fixture Update (reworked per @mbrobbel's review): Dropped the committed 12 MB tpcds.duckdb binary. The fixture is now generated at build time via an opt-in CMake target tpcds-fixture (runs generate_tpcds_duckdb.sh into ${CMAKE_BINARY_DIR}/test-fixtures/, SF 0.01 via DuckDB's dsdgen). Run cmake --build . --target tpcds-fixture once before the TPC-DS integration tests. ## Tests Added `test_gpu_execution_order_nulls.cpp`: - `ORDER BY` - `TOP-N single-key` - `TOP-N multi-key` - `ASC/DESC × NULLS FIRST/LAST` The test materializes GPU output once, adds `observed_position` to make ordering differences visible, and compares against DuckDB CPU with bidirectional `EXCEPT ALL`. Validation: ```bash pre-commit run --all-files pixi run -e default make release build/release/extension/sirius/test/cpp/sirius_unittest "[nulls]" ```

mbrobbel · 2026-06-04T20:43:49Z

Marking as draft (there is a conflict).

ran-yuan-rui added 10 commits May 24, 2026 10:18

ran-yuan-rui changed the title ~~Add window function to Sirius (1/3) -ROW_NUMBER / RANK / DENSE_RANK~~ Add window functions to Sirius (1/3): ROW_NUMBER / RANK / DENSE_RANK May 25, 2026

ran-yuan-rui requested review from bwyogatama, felipeblazing and wmalpica and removed request for felipeblazing May 25, 2026 14:55

wmalpica reviewed May 26, 2026

View reviewed changes

ran-yuan-rui mentioned this pull request May 27, 2026

Bugfix desc null fix + add tpcds fixture #810

Merged

mbrobbel marked this pull request as draft June 4, 2026 20:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add window functions to Sirius (1/3): ROW_NUMBER / RANK / DENSE_RANK#806

Add window functions to Sirius (1/3): ROW_NUMBER / RANK / DENSE_RANK#806
ran-yuan-rui wants to merge 10 commits into
sirius-db:devfrom
ran-yuan-rui:feature-window-function

ran-yuan-rui commented May 25, 2026 •

edited

Loading

Uh oh!

wmalpica commented May 26, 2026

Uh oh!

wmalpica May 26, 2026

Uh oh!

ran-yuan-rui commented May 26, 2026

Uh oh!

mbrobbel commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ran-yuan-rui commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design

Roadmap

Also fixed (engine-level, surfaced by this work)

Tests

Known limitations

Notes

Uh oh!

wmalpica commented May 26, 2026

Uh oh!

wmalpica May 26, 2026

Choose a reason for hiding this comment

Uh oh!

ran-yuan-rui commented May 26, 2026

Uh oh!

mbrobbel commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ran-yuan-rui commented May 25, 2026 •

edited

Loading