Skip to content

Add window functions to Sirius (1/3): ROW_NUMBER / RANK / DENSE_RANK#806

Draft
ran-yuan-rui wants to merge 10 commits into
sirius-db:devfrom
ran-yuan-rui:feature-window-function
Draft

Add window functions to Sirius (1/3): ROW_NUMBER / RANK / DENSE_RANK#806
ran-yuan-rui wants to merge 10 commits into
sirius-db:devfrom
ran-yuan-rui:feature-window-function

Conversation

@ran-yuan-rui

@ran-yuan-rui ran-yuan-rui commented May 25, 2026

Copy link
Copy Markdown
Contributor

Talked with @wmalpica and he point me that this uses hash partitioning + in-window cuDF sort/rank for single-GPU in-core ranking. It is not yet the sort-partition architecture proposed for scalable/out-of-core windows.
So I will  redo the PR.

Summary

Adds GPU support for the ranking window functions ROW_NUMBER / RANK / DENSE_RANK to Sirius, so supported ranking-window shapes run on GPU via gpu_execution instead of forcing the whole query back to DuckDB CPU (out-of-scope shapes go through the existing fallback/error path — see Design).

This is Phase 1 of a staged roadmap toward broader window support (see below) — the planned phases cover the main useful window families, not full DuckDB/SQL window parity. It also includes an engine-wide DESC NULLS FIRST/LAST correctness fix in sort/top-n that the window pre-sort depends on. The logical areas are split across commits and can be reviewed independently.

Design

Operator. New sirius_physical_window, a partition-consumer (same base as hash-join build / aggregate-merge — both a sink and a source). Ranking is not decomposable, so it consumes already hash-partitioned data rather than streaming batch-by-batch.

Pipeline shape — reuses the existing partition framework, no new machinery:

  • The converter inserts a sirius_physical_partition (HASH by the PARTITION BY columns; one global partition when there's no PARTITION BY).
  • PARTITION → WINDOW is a FULL barrier; the window operator drains one whole partition per task. This is HASH_GROUP_BY's PARTITION → MERGE_GROUP_BY shape minus the local partial stage, and no CONCAT (that's join-only).

Compute is pure cuDF. Per partition: stable-sort by (PARTITION BY, ORDER BY dirs) → grouped rank scan via cudf::groupby(…, sorted::YES).scan(make_rank_aggregation(...)) (FIRST=row_number, MIN=rank, DENSE=dense_rank). Under the presorted (sorted::YES) path the rank order comes entirely from the pre-sort, so mixed ASC/DESC, multi-column keys, and all three methods are a single code path.

Why this matters: we reuse DuckDB's bound expressions + the existing partition/task framework and do not port DuckDB's CPU window machinery. As a result, the design covers multi-batch / multi-partition / NULL groups / ASC·DESC·NULLS with the existing framework plus the window operator's per-partition drain/concat path (validated GPU == DuckDB CPU). Out-of-scope shapes (aggregate-over-window, FILTER, EXCLUDE, non-column keys, heterogeneous windows, HUGEINT child cols) throw NotImplementedException and use the existing DuckDB fallback path when fallback is enabled.

Roadmap

A staged roadmap toward broader window support — not a claim of full DuckDB/SQL window parity. P1 is ranking; P2/P3 target the main remaining useful window families:

Phase Scope Status
P1 ROW_NUMBER / RANK / DENSE_RANK this PR
P2 aggregate-over-window (SUM/AVG/COUNT/MIN/MAX OVER) + default & ROWS frames planned
P3 RANGE frame, LEAD/LAG, FIRST/LAST/NTH_VALUE, NTILE, heterogeneous-window splitting planned

cuDF already has the primitives for almost all of P2/P3 (grouped_rolling_window, grouped_range_rolling_window, nth_element / lead / lag); only NTILE has no direct kernel (composable from row_number + partition count). So later phases are mostly Sirius-side orchestration.

Beyond this roadmap (follow-up, not yet planned): FILTER / EXCLUDE, GROUPS frame, DISTINCT aggregate windows, PERCENT_RANK / CUME_DIST, expression PARTITION BY / ORDER BY keys, richer types (DECIMAL / HUGEINT / interval range bounds), collation / string-ordering details, multi-frame splitting & reuse optimization, and out-of-core / spill / multi-GPU semantics. Full DuckDB corner-case parity is explicitly out of scope here.

Also fixed (engine-level, surfaced by this work)

DESC NULLS FIRST/LAST placement (correctness). cuDF applies null_order in the ascending frame and reverses it for a DESCENDING column, so the naive NULLS_FIRST?BEFORE:AFTER mapping mis-placed NULLs for every descending sort key — e.g. ORDER BY x DESC NULLS LAST put NULLs first on GPU, disagreeing with DuckDB CPU. The mapping was duplicated across order / sort_sample / sort_partition / merge_sort / top_n. Centralized into op/cudf_sort_order.hpp (to_cudf_null_order, which flips BEFORE/AFTER for descending keys) and applied everywhere, including the window pre-sort. Also fixes single-key TOP-N on a nullable key: cudf::top_k_order takes no null_order and ignored NULLS placement when selecting the top-k, so a nullable key now falls back to a full sort. Pre-existing and undetected because benchmark sort keys are non-null / NULL-filtered (it needs DESC ∧ NULLs present ∧ order-sensitive comparison).

Tests

All Catch2, one sirius_unittest binary (tag-filtered). Default regression includes window semantics; TPC-DS + perf are hidden [.] and run on demand.

  • test_physical_window.cpp — plan/guard ([window][plan])
  • test_gpu_execution_window.cpp — GPU semantics incl. multi-batch/multi-partition and non-integer (VARCHAR/DATE/TIMESTAMP/DOUBLE) keys ([integration][window][gpu])
  • test_gpu_execution_order_nulls.cpp — the 4 ASC/DESC × NULLS FIRST/LAST quadrants for ORDER BY + TOP-N ([integration][gpu_execution][nulls])
  • COALESCE GPU==CPU coverage case in test_gpu_execution_tpch.cpp (the COALESCE impl itself now comes from upstream dev)
  • test_gpu_execution_tpcds.cpp (Q44), test_tpcds_plan_translation.cpp (Q44 + Q67 translation), test_window_benchmark.cpp — hidden [.]

Known limitations

  • Single-GPU validation only. All testing is on one GPU. Multi-GPU is not exercised: partitions are independent so the framework could in principle schedule them across GPUs, but that path is untested here and not claimed.
  • No out-of-core / spill for blocking operators. Window, ORDER BY, hash-join build and aggregate-merge OOM instead of spilling once the working set exceeds the GPU usage_limit (even at ~1.04× overflow with host memory free). Root cause: the downgrade path only spills idle batches, but partition-consume batches sit in task_created/processing, so it finds 0 spill candidates → OOM after retries. not introduced or worsened by this PR (plain ORDER BY OOMs identically).
    Input welcome now in this PR we just fallback.

Notes

  • Adds a committed 12 MB TPC-DS fixture test/cpp/integration/data/duckdb/tpcds.duckdb (same precedent as the existing TPC-H integration.duckdb) + a generate_tpcds_duckdb.sh to reproduce it; .pre-commit-config.yaml large-file check widened to cover data/duckdb/*.duckdb.

Add tests for GPU window ranking (row_number / rank / dense_rank):
- test/cpp/operator/test_physical_window.cpp: plan-translation coverage —
  accepts ranking expressions sharing PARTITION BY / ORDER BY (with and
  without partitioning); rejects unsupported shapes (aggregate-over,
  frame EXCLUDE, heterogeneous partitions, non-column keys, HUGEINT
  columns, rank/dense_rank without ORDER BY) with NotImplementedException.
- test/sql/window/ranking.test: ranking semantics via gpu_execution
  (ASC/DESC, NULLS FIRST/LAST, multi-key, ties, NULL partitions).
- test/cpp/integration/test_gpu_execution_tpcds.cpp: GPU-vs-CPU equality
  for row_number / rank / dense_rank over store_sales and customer.
- test/cpp/integration/test_tpcds_plan_translation.cpp: assert a TPC-DS
  ranking query translates to a Sirius physical plan.
- test/cpp/integration/data/duckdb/: committed TPC-DS sf=0.01 fixture
  (tpcds.duckdb) plus its generator script.

Build:
- CMakeLists.txt: compile the new tests, and the previously unbuilt
  test_tpcds_plan_translation.cpp, into sirius_unittest.
- .pre-commit-config.yaml: allow committed duckdb fixtures under
  test/cpp/integration/data/duckdb/ past the large-file check.
Implement sirius_physical_window, a partition-consumer operator that
evaluates ROW_NUMBER, RANK, and DENSE_RANK over an optional PARTITION BY
and ORDER BY shared by the expressions of one LogicalWindow.

The pipeline converter hash-partitions the input by the PARTITION BY columns
(a single global partition when absent) and feeds whole partitions to the
operator, which concatenates them, stably sorts by (PARTITION BY, ORDER BY),
and computes ranks with a cuDF grouped rank scan. Output is the child columns
followed by one BIGINT rank column per window expression.

- Enable the LOGICAL_WINDOW dispatch in the Sirius physical plan generator.
  create_plan(LogicalWindow) guards unsupported shapes (aggregate-over,
  FILTER, EXCLUDE, non-column keys, mixed framings, HUGEINT child columns,
  RANK/DENSE_RANK without ORDER BY) with NotImplementedException so such
  queries fall back to DuckDB.
- sirius_physical_partition hash-partitions on the window's PARTITION BY
  columns for the new WINDOW consumer.
- Honor ORDER BY ASC/DESC and NULLS FIRST/LAST; the cuDF null order is
  flipped for descending keys to match SQL semantics.
- Null-check the SiriusContext in comparison-join planning so a missing
  context raises a clear error instead of a raw null dereference.

Add window operator tests, TPC-DS Q44 ranking execution and plan-translation
coverage, and SQL ranking-semantics tests.
COALESCE was unimplemented in the column expression executor (it threw
NotImplementedException), forcing any query using it off the GPU. Implement it
by chaining cudf::replace_nulls — start from the first argument and fill the
remaining NULLs from each subsequent argument, yielding the first non-NULL per
row. Add a GPU-vs-CPU test covering multi-arg COALESCE with different NULL
distributions and types (int / bigint / double).
cuDF applies null_order in the ascending frame and reverses it for a
descending column, so the naive NULLS_FIRST?BEFORE:AFTER mapping mis-placed
NULLs for every DESC sort key (e.g. ORDER BY x DESC NULLS LAST put NULLs
first, disagreeing with DuckDB CPU). The mapping was duplicated across the
order, sort-sample, sort-partition, merge-sort and top-n operators.

Centralize the contract in op/cudf_sort_order.hpp (to_cudf_order /
to_cudf_null_order, which flips BEFORE<->AFTER for descending keys) and route
every sort/top-n/window path through it.

Also fix single-key top-n with a nullable key: cudf::top_k_order takes no
null_order and cannot honor NULLS FIRST/LAST when selecting the top k, so a
nullable key now falls back to a full sort that honors null placement.

Add test/cpp/integration/test_gpu_execution_order_nulls.cpp covering the four
ASC/DESC x NULLS FIRST/LAST quadrants across distributed ORDER BY and
single/multi-key TOP-N.
The window GPU semantic tests only exercised INTEGER partition/order keys.
The guard imposes no type restriction (only HUGEINT and non-column keys are
rejected), so STRING/DATE/TIMESTAMP/DOUBLE keys already run through the same
stable-sort + grouped rank scan path. Add a default-run case covering them so
that path — including the DESC NULLS FIRST/LAST handling and the rank/dense_rank
tie-detection value struct built from non-integer order columns — is verified
GPU == DuckDB CPU.

Shapes: VARCHAR partition + DOUBLE DESC NULLS LAST (row_number); INTEGER
partition + DATE DESC NULLS LAST (rank/dense_rank); global VARCHAR DESC NULLS
LAST + TIMESTAMP ASC NULLS FIRST (rank); VARCHAR partition + TIMESTAMP ASC NULLS
FIRST (row_number).
The Q67 translation case still said GPU execution was deferred "until COALESCE
is supported", but COALESCE now runs on GPU in this branch. Reword to: Q67 is
translation-only here; end-to-end GPU execution is deferred until the remaining
non-window TPC-DS gaps (e.g. the ROLLUP projection path) are covered.
Resolve conflicts from the upstream refactors (wire_data_repositories →
descriptors, batch API, sirius exceptions, expression executor rework):

- CMakeLists / TEST_SOURCES: union of both sides' new sources/tests.
- compute_repository_wiring: adopt upstream's emit/descriptor form and add
  WINDOW to the FULL-barrier sink list (PARTITION→WINDOW is wired by the
  existing PARTITION branch).
- sort_sample: keep upstream's restructured boundary computation, re-apply the
  shared to_cudf_order/to_cudf_null_order helper for DESC NULLS correctness.
- gpu_execute_operator: take upstream's COALESCE (it now implements the same
  replace_nulls chaining), dropping the now-redundant local COALESCE branch.
- plan_comparison_join: keep the sirius_state null-check hardening.
Adapt the window ranking code to the APIs changed by the merged upstream dev:

- types parameter: duckdb::LogicalType -> sirius::logical_type (base ctor);
  create_plan now wraps op.types via sirius::from_duckdb_vec.
- source_order(): duckdb::OrderPreservationType -> sirius::OrderPreservationType
  (operator + plan test).
- get_next_task_input_data: repo->pop_data_batch(state, idx) ->
  pop_next_data_batch(idx) (batch_state::task_created removed).
- execute(): use read-only batches (data_batch::get_memory_space is now private);
  make_data_batch now requires the writer stream.
- split_window_sink: engine_ -> build_ctx_, pipeline_breakers_ ->
  inserted_operators_ (converter member renames).
- physical_plan_generator.hpp: add LogicalWindow forward declaration.

Compiles clean; single-partition window/order/nulls cases pass. Multi-partition
cases under g_integration_env still under investigation (works via CLI).
After the DuckDB bump in the merge, current_setting('scan_task_batch_size')
(and hash_partition_bytes / max_sort_partition_bytes) returns an empty result
set rather than the value, so the runtime-setting RAII guards and the benchmark
config note threw "Attempted to access index 0 within vector of size 0" when
reading the current value. Read these extension options via
`SELECT value FROM duckdb_settings() WHERE name = '...'`, which returns the
value. No operator changes — the window/sort logic itself was unaffected
(verified GPU == CPU via the CLI).
… example

The "transparent execution: fallback for unsupported (window)" case used
ROW_NUMBER() OVER (...) as its "unsupported operator" example, but this PR makes
ranking windows (ROW_NUMBER/RANK/DENSE_RANK) GPU-supported, so that query no
longer falls back and the rebind accounting changed. Switch the example to an
aggregate-over-window (SUM(...) OVER (...)), which Phase 1 still does not support,
so the test keeps exercising the silent CPU fallback path.
@ran-yuan-rui ran-yuan-rui changed the title Add window function to Sirius (1/3) -ROW_NUMBER / RANK / DENSE_RANK Add window functions to Sirius (1/3): ROW_NUMBER / RANK / DENSE_RANK May 25, 2026
@ran-yuan-rui ran-yuan-rui requested review from bwyogatama, felipeblazing and wmalpica and removed request for felipeblazing May 25, 2026 14:55
@wmalpica

Copy link
Copy Markdown
Collaborator

@ran-yuan-rui Since this PR introduces a bug fix and a 12 MB TPC-DS fixture , could you split those two off into a separate PR, that way we get that merged sooner and the larger piece of work involved here for the WIndow function support does not delay it.

// Phase 1 supports only ROW_NUMBER / RANK / DENSE_RANK, sharing one PARTITION BY / ORDER BY, with
// column-reference keys, no FILTER / EXCLUDE, and no HUGEINT child columns. Anything else throws
// NotImplementedException so the whole query falls back to DuckDB CPU.
void guard_phase1(duckdb::LogicalWindow& op)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am assuming that we will only support one window in the query for now, correct? Like, you can have multiple aggregators in the same window, but if you have different windows, then the partitioning and ordering would need to be different, and that should be outside the scope of phase 1

@ran-yuan-rui

Copy link
Copy Markdown
Contributor Author

@ran-yuan-rui Since this PR introduces a bug fix and a 12 MB TPC-DS fixture , could you split those two off into a separate PR, that way we get that merged sooner and the larger piece of work involved here for the WIndow function support does not delay it.

Thx - will do

pull Bot pushed a commit to muzammilar/sirius that referenced this pull request May 30, 2026
## Summary

Split out from sirius-db#806 per @wmalpica review comment.


This PR contains two independent pieces:

1. Fix GPU `ORDER BY` / `TOP-N` correctness for `DESC ... NULLS
FIRST/LAST`.
2. Add a reusable SF0.01 TPC-DS DuckDB fixture for integration /
plan-translation tests.

## RCA: DESC NULL ordering bug

DuckDB SQL treats `NULLS FIRST` / `NULLS LAST` as the final, absolute
placement of NULLs in the result:

- `DESC NULLS FIRST` => NULLs first, then values descending
- `DESC NULLS LAST` => values descending, then NULLs last

Sirius previously mapped SQL null ordering directly to cuDF:

```cpp
NULLS_FIRST -> cudf::null_order::BEFORE
NULLS_LAST  -> cudf::null_order::AFTER
```

That is correct for `ASC`, but wrong for `DESC`: cuDF applies
`null_order` in its sort-direction frame, so DESC effectively reverses
the NULL placement. As a result, GPU `ORDER BY k DESC NULLS LAST` could
place NULLs first, disagreeing with DuckDB CPU.

The same naive mapping was duplicated across the distributed sort path:

- `ORDER_BY`
- `SORT_SAMPLE`
- `SORT_PARTITION`
- `MERGE_SORT`
- `TOP_N`

There was one additional TOP-N issue: `cudf::top_k_order` has no
`null_order` parameter, so single-key TOP-N on a nullable key could
select the wrong top-k rows before any final sorting step.

## Fix

- Added `src/include/op/cudf_sort_order.hpp` as the single source of
truth:
  - `to_cudf_order(...)`
  - `to_cudf_null_order(...)`
- `to_cudf_null_order(...)` flips `BEFORE` / `AFTER` for descending keys
so SQL `NULLS FIRST/LAST` matches DuckDB semantics.
- Routed all sort/top-n operators through the helper.
- For single-key TOP-N:
  - keep `cudf::top_k_order` fast path when the key has no NULLs
- use `sort_by_key + slice` when the key is nullable, so NULL placement
is honored

## TPC-DS fixture

Update (reworked per @mbrobbel's review): Dropped the committed 12 MB
tpcds.duckdb binary. The fixture is now generated at build time via an
opt-in CMake target tpcds-fixture (runs generate_tpcds_duckdb.sh into
${CMAKE_BINARY_DIR}/test-fixtures/, SF 0.01 via DuckDB's dsdgen). Run
cmake --build . --target tpcds-fixture once before the TPC-DS
integration tests.


## Tests

Added `test_gpu_execution_order_nulls.cpp`:

- `ORDER BY`
- `TOP-N single-key`
- `TOP-N multi-key`
- `ASC/DESC × NULLS FIRST/LAST`

The test materializes GPU output once, adds `observed_position` to make
ordering differences visible, and compares against DuckDB CPU with
bidirectional `EXCEPT ALL`.

Validation:

```bash
pre-commit run --all-files
pixi run -e default make release
build/release/extension/sirius/test/cpp/sirius_unittest "[nulls]"
```
@mbrobbel

mbrobbel commented Jun 4, 2026

Copy link
Copy Markdown
Member

Marking as draft (there is a conflict).

@mbrobbel mbrobbel marked this pull request as draft June 4, 2026 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants