ci(gpu): profile CUDA micro-benchmarks with Polar Signals PC sampling by joseph-isaacs · Pull Request #8338 · vortex-data/vortex

joseph-isaacs · 2026-06-10T15:05:35Z

Summary

Adds a GPU CI workflow that runs the full vortex-cuda benchmark suite directly (no Codspeed) under NVIDIA CUDA PC sampling, with the resulting profiles collected by Polar Signals.

This complements the existing bench-codspeed-cuda job in codspeed.yml: that job measures wall-time regressions, whereas this one captures GPU PC-sampling profiles (kernel-level hotspots and stall reasons) in Polar Signals Cloud.

How it works

The mechanism described in the blog post has three parts, all wired up here:

-lineinfo kernels — vortex-cuda/build.rs now adds -lineinfo to release/bench nvcc builds when VORTEX_CUDA_LINEINFO is set. This lets Polar Signals symbolize PC samples back to source. It's opt-in so default release builds are byte-for-byte unchanged, and -lineinfo has no runtime cost.
parcagpu CUPTI shim — the workflow builds parca-dev/parcagpu v0.3.0 (libparcagpucupti.so) and injects it into every CUDA process via CUDA_INJECTION64_PATH. The shim exposes PC samples as USDT probes.
parca-agent — reuses our existing polarsignals/gh-actions-ps-profiling action (same POLAR_SIGNALS_API_KEY secret and project_uuid as bench.yml), pinned to parca-agent 0.48.0, the release that added GPU PC-sampling support. It consumes the parcagpu USDT probes over eBPF and ships the profiles.

PC sampling is enabled at ~100 samples/sec with the blog's recommended sampling factor of 20 (PARCAGPU_PC_SAMPLING_RATE / PARCAGPU_SAMPLING_FACTOR).

Workflow

Triggers: every PR that touches vortex-cuda/** (or the workflow file), plus workflow_dispatch.
Runner: the g5 GPU runner used by the existing CUDA Codspeed job.
Forks are skipped (no access to the Polar Signals secret), matching bench-pr.yml.
Runs all CUDA benchmarks in one job via cargo bench -p vortex-cuda (built once with --no-run).

Notes / things to confirm

Requires the POLAR_SIGNALS_API_KEY secret to be available to this workflow (already used by bench.yml/bench-pr.yml/sql-benchmarks.yml).
I bumped parca_agent_version to 0.48.0 (the action defaults to 0.45.0, which predates GPU PC sampling). If a specific newer flag is needed to enable GPU collection on the agent side, it can be passed via the action's extra_args.

Checks run

yamllint --strict -c .yamllint.yaml .github/workflows/cuda-pc-sampling.yml — passes.
Verified env::var_os usage in build.rs against the existing std::env import.
Could not run the workflow itself (needs the GPU runner + Polar Signals secret) or a CUDA build locally (no GPU/CUDA toolkit in this environment).

https://claude.ai/code/session_01G2ZPy2t8iuZHF4HyQFYSYD

Generated by Claude Code

Add a GPU CI workflow that runs the full vortex-cuda benchmark suite directly (no Codspeed) under NVIDIA CUDA PC sampling, collected by Polar Signals via the parca-agent and the parcagpu CUPTI shim. - New `.github/workflows/cuda-pc-sampling.yml` runs on every GPU PR (vortex-cuda changes) and on demand. It builds the parcagpu shim, reuses the existing Polar Signals action/token, injects the shim via CUDA_INJECTION64_PATH, enables PC sampling, and runs `cargo bench`. - `vortex-cuda/build.rs` gains an opt-in `-lineinfo` for release/bench kernel builds (VORTEX_CUDA_LINEINFO) so PC samples symbolize to source. Default release builds are unchanged; `-lineinfo` has no runtime cost. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

The parcagpu CMake build requires the `dtrace` probe generator to emit its USDT probes; on Ubuntu that ships in systemtap-sdt-dev. Without it CMake configure failed with "Could not find DTRACE". Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

codspeed-hq · 2026-06-10T15:10:04Z

Merging this PR will not alter performance

⚡ 3 improved benchmarks
❌ 5 regressed benchmarks
✅ 915 untouched benchmarks
⏩ 604 skipped benchmarks¹

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	Simulation	`bitwise_not_vortex_buffer_mut[128]`	216.9 ns	275.3 ns	-21.19%
❌	Simulation	`decompress_rd[f64, (100000, 0.0)]`	845.9 µs	1,024.6 µs	-17.44%
❌	Simulation	`bitwise_not_vortex_buffer_mut[1024]`	278.6 ns	336.9 ns	-17.31%
❌	Simulation	`decompress_rd[f32, (100000, 0.0)]`	499.6 µs	587 µs	-14.9%
❌	Simulation	`bitwise_not_vortex_buffer_mut[2048]`	342.2 ns	400.6 ns	-14.56%
⚡	Simulation	`decompress_rd[f64, (100000, 0.01)]`	981.7 µs	846 µs	+16.04%
⚡	Simulation	`decompress_rd[f64, (100000, 0.1)]`	981.6 µs	846 µs	+16.04%
⚡	Simulation	`baseline_lt[16, 65536]`	246.7 µs	219.1 µs	+12.59%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing claude/eloquent-davinci-nvq8zx (b6f36b2) with develop (f46621d)}

604 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

parcagpu vendors `proton` as a git submodule; the shallow clone left it empty, so CMake failed with "Cannot find source file ... proton/.../CuptiApi.cpp". Clone recursively (shallow) so the submodule sources are present. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

0ax1 · 2026-06-10T15:12:35Z

@claude concise pr desc please

github-actions · 2026-06-10T15:13:00Z

Claude review automation is disabled for pull requests that modify .github/ files.

Why:

workflow and action files are part of the automation policy
this review workflow refuses to evaluate automation changes from the same PR

Ask a human reviewer to inspect workflow changes directly.

0ax1 · 2026-06-10T15:48:01Z

Claude review automation is disabled for pull requests that modify .github/ files.

Why:
* workflow and action files are part of the automation policy

* this review workflow refuses to evaluate automation changes from the same PR
Ask a human reviewer to inspect workflow changes directly.

@joseph-isaacs human, please

joseph-isaacs requested a review from a team June 10, 2026 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(gpu): profile CUDA micro-benchmarks with Polar Signals PC sampling#8338

ci(gpu): profile CUDA micro-benchmarks with Polar Signals PC sampling#8338
joseph-isaacs wants to merge 3 commits into
developfrom
claude/eloquent-davinci-nvq8zx

joseph-isaacs commented Jun 10, 2026

Uh oh!

codspeed-hq Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

0ax1 commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

0ax1 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

joseph-isaacs commented Jun 10, 2026

Summary

How it works

Workflow

Notes / things to confirm

Checks run

Uh oh!

codspeed-hq Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Performance Changes

Footnotes

Uh oh!

0ax1 commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

0ax1 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codspeed-hq Bot commented Jun 10, 2026 •

edited

Loading