Skip to content

feat[gpu]: sliced validity in Arrow device export#8318

Merged
0ax1 merged 3 commits into
developfrom
ad/sliced-varbinview-e2e
Jun 10, 2026
Merged

feat[gpu]: sliced validity in Arrow device export#8318
0ax1 merged 3 commits into
developfrom
ad/sliced-varbinview-e2e

Conversation

@0ax1

@0ax1 0ax1 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

No description provided.

@0ax1 0ax1 force-pushed the ad/sliced-varbinview-e2e branch from cdee93f to da7278c Compare June 9, 2026 15:29
@codspeed-hq

codspeed-hq Bot commented Jun 9, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 3 improved benchmarks
❌ 3 regressed benchmarks
✅ 1521 untouched benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation bitwise_not_vortex_buffer_mut[128] 216.9 ns 275.3 ns -21.19%
Simulation bitwise_not_vortex_buffer_mut[1024] 278.6 ns 336.9 ns -17.31%
Simulation bitwise_not_vortex_buffer_mut[2048] 342.2 ns 400.6 ns -14.56%
Simulation chunked_bool_canonical_into[(1000, 10)] 46.8 µs 31.9 µs +46.83%
Simulation chunked_varbinview_canonical_into[(1000, 10)] 198.2 µs 162 µs +22.31%
Simulation chunked_varbinview_into_canonical[(1000, 10)] 213.4 µs 177.2 µs +20.41%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing ad/sliced-varbinview-e2e (f8d19e7) with develop (031fb76)

Open in CodSpeed

@0ax1 0ax1 force-pushed the ad/sliced-varbinview-e2e branch from da7278c to 795fe55 Compare June 9, 2026 15:31
@0ax1 0ax1 changed the title test[gpu]: cover sliced utf8 Arrow device export test[gpu]: sliced utf8 Arrow device export Jun 9, 2026
@0ax1 0ax1 changed the title test[gpu]: sliced utf8 Arrow device export feat[gpu]: cover sliced validity in Arrow device export Jun 9, 2026
@0ax1 0ax1 added the changelog/feature A new feature label Jun 9, 2026
@0ax1 0ax1 changed the title feat[gpu]: cover sliced validity in Arrow device export feat[gpu]: sliced validity in Arrow device export Jun 9, 2026
@0ax1 0ax1 marked this pull request as ready for review June 9, 2026 15:34
@0ax1 0ax1 requested a review from a team June 9, 2026 15:34
Add cuDF e2e coverage for sliced and multi-buffer Utf8View arrays, including non-ASCII values and sliced null validity.

Keep bit-offset validity repacking on the CUDA stream for Arrow Device export, with focused tests and a CUDA benchmark for the repack path.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
@0ax1 0ax1 force-pushed the ad/sliced-varbinview-e2e branch from 795fe55 to 6ce4bd7 Compare June 9, 2026 15:39
@0ax1 0ax1 enabled auto-merge (squash) June 9, 2026 15:41
@0ax1 0ax1 requested review from onursatici and robert3005 June 9, 2026 15:41
Comment thread vortex-cuda/kernels/src/arrow_validity.cu Outdated
0ax1 and others added 2 commits June 10, 2026 14:08
Rebuild the validity bitmap 64 bits at a time with a funnel shift over
two adjacent input words, masking the leading offset bits and the
trailing length bits, instead of testing bits one by one. Launch one
word per thread with a grid-stride loop so warp accesses coalesce.

Repack of 100M bits on GH200 drops from 140us to 21us (6.7x).

Also derive the output size from len + arrow_offset instead of taking a
redundant output_bytes parameter, drop the now-unneeded output memset
(every word is written, edge masks zero the padding), bound the
host-to-device copy to the slice's bytes via shrink_offset, and cover
negative-shift and multi-word offsets in the repack tests.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
@0ax1

0ax1 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

GH 200:

┌───────────────────────────────────────────────────────────────┬───────────────┬────────────┬─────────────┬─────────────┐
│                        Kernel variant                         │ Time (median) │ Throughput │ Step change │ vs baseline │
├───────────────────────────────────────────────────────────────┼───────────────┼────────────┼─────────────┼─────────────┤
│ Bit-by-bit, byte writes (PR baseline)                         │      140.3 µs │   178 GB/s │           — │        1.0× │
├───────────────────────────────────────────────────────────────┼───────────────┼────────────┼─────────────┼─────────────┤
│ u64 funnel-shift words, blocked ranges (start_elem/stop_elem) │       39.0 µs │   641 GB/s │      −72.2% │        3.6× │
├───────────────────────────────────────────────────────────────┼───────────────┼────────────┼─────────────┼─────────────┤
│ + grid-stride loop (coalesced warp accesses)                  │       26.9 µs │   929 GB/s │      −31.2% │        5.2× │
├───────────────────────────────────────────────────────────────┼───────────────┼────────────┼─────────────┼─────────────┤
│ + 256 threads/block, one word per thread                      │       21.0 µs │  1.19 TB/s │      −21.8% │        6.7× │
└───────────────────────────────────────────────────────────────┴───────────────┴────────────┴─────────────┴─────────────┘

@0ax1 0ax1 requested a review from robert3005 June 10, 2026 14:13
@0ax1 0ax1 merged commit f46621d into develop Jun 10, 2026
78 of 81 checks passed
@0ax1 0ax1 deleted the ad/sliced-varbinview-e2e branch June 10, 2026 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants