Skip to content

ci(gpu): profile CUDA micro-benchmarks with Polar Signals PC sampling#8338

Open
joseph-isaacs wants to merge 3 commits into
developfrom
claude/eloquent-davinci-nvq8zx
Open

ci(gpu): profile CUDA micro-benchmarks with Polar Signals PC sampling#8338
joseph-isaacs wants to merge 3 commits into
developfrom
claude/eloquent-davinci-nvq8zx

Conversation

@joseph-isaacs

Copy link
Copy Markdown
Contributor

Summary

Adds a GPU CI workflow that runs the full vortex-cuda benchmark suite directly (no Codspeed) under NVIDIA CUDA PC sampling, with the resulting profiles collected by Polar Signals.

This complements the existing bench-codspeed-cuda job in codspeed.yml: that job measures wall-time regressions, whereas this one captures GPU PC-sampling profiles (kernel-level hotspots and stall reasons) in Polar Signals Cloud.

How it works

The mechanism described in the blog post has three parts, all wired up here:

  1. -lineinfo kernelsvortex-cuda/build.rs now adds -lineinfo to release/bench nvcc builds when VORTEX_CUDA_LINEINFO is set. This lets Polar Signals symbolize PC samples back to source. It's opt-in so default release builds are byte-for-byte unchanged, and -lineinfo has no runtime cost.
  2. parcagpu CUPTI shim — the workflow builds parca-dev/parcagpu v0.3.0 (libparcagpucupti.so) and injects it into every CUDA process via CUDA_INJECTION64_PATH. The shim exposes PC samples as USDT probes.
  3. parca-agent — reuses our existing polarsignals/gh-actions-ps-profiling action (same POLAR_SIGNALS_API_KEY secret and project_uuid as bench.yml), pinned to parca-agent 0.48.0, the release that added GPU PC-sampling support. It consumes the parcagpu USDT probes over eBPF and ships the profiles.

PC sampling is enabled at ~100 samples/sec with the blog's recommended sampling factor of 20 (PARCAGPU_PC_SAMPLING_RATE / PARCAGPU_SAMPLING_FACTOR).

Workflow

  • Triggers: every PR that touches vortex-cuda/** (or the workflow file), plus workflow_dispatch.
  • Runner: the g5 GPU runner used by the existing CUDA Codspeed job.
  • Forks are skipped (no access to the Polar Signals secret), matching bench-pr.yml.
  • Runs all CUDA benchmarks in one job via cargo bench -p vortex-cuda (built once with --no-run).

Notes / things to confirm

  • Requires the POLAR_SIGNALS_API_KEY secret to be available to this workflow (already used by bench.yml/bench-pr.yml/sql-benchmarks.yml).
  • I bumped parca_agent_version to 0.48.0 (the action defaults to 0.45.0, which predates GPU PC sampling). If a specific newer flag is needed to enable GPU collection on the agent side, it can be passed via the action's extra_args.

Checks run

  • yamllint --strict -c .yamllint.yaml .github/workflows/cuda-pc-sampling.yml — passes.
  • Verified env::var_os usage in build.rs against the existing std::env import.
  • Could not run the workflow itself (needs the GPU runner + Polar Signals secret) or a CUDA build locally (no GPU/CUDA toolkit in this environment).

https://claude.ai/code/session_01G2ZPy2t8iuZHF4HyQFYSYD


Generated by Claude Code

Add a GPU CI workflow that runs the full vortex-cuda benchmark suite
directly (no Codspeed) under NVIDIA CUDA PC sampling, collected by Polar
Signals via the parca-agent and the parcagpu CUPTI shim.

- New `.github/workflows/cuda-pc-sampling.yml` runs on every GPU PR
  (vortex-cuda changes) and on demand. It builds the parcagpu shim,
  reuses the existing Polar Signals action/token, injects the shim via
  CUDA_INJECTION64_PATH, enables PC sampling, and runs `cargo bench`.
- `vortex-cuda/build.rs` gains an opt-in `-lineinfo` for release/bench
  kernel builds (VORTEX_CUDA_LINEINFO) so PC samples symbolize to source.
  Default release builds are unchanged; `-lineinfo` has no runtime cost.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs requested a review from a team June 10, 2026 15:05
The parcagpu CMake build requires the `dtrace` probe generator to emit its
USDT probes; on Ubuntu that ships in systemtap-sdt-dev. Without it CMake
configure failed with "Could not find DTRACE".

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@codspeed-hq

codspeed-hq Bot commented Jun 10, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

⚡ 3 improved benchmarks
❌ 5 regressed benchmarks
✅ 915 untouched benchmarks
⏩ 604 skipped benchmarks1

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation bitwise_not_vortex_buffer_mut[128] 216.9 ns 275.3 ns -21.19%
Simulation decompress_rd[f64, (100000, 0.0)] 845.9 µs 1,024.6 µs -17.44%
Simulation bitwise_not_vortex_buffer_mut[1024] 278.6 ns 336.9 ns -17.31%
Simulation decompress_rd[f32, (100000, 0.0)] 499.6 µs 587 µs -14.9%
Simulation bitwise_not_vortex_buffer_mut[2048] 342.2 ns 400.6 ns -14.56%
Simulation decompress_rd[f64, (100000, 0.01)] 981.7 µs 846 µs +16.04%
Simulation decompress_rd[f64, (100000, 0.1)] 981.6 µs 846 µs +16.04%
Simulation baseline_lt[16, 65536] 246.7 µs 219.1 µs +12.59%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing claude/eloquent-davinci-nvq8zx (b6f36b2) with develop (f46621d)

Open in CodSpeed

Footnotes

  1. 604 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

parcagpu vendors `proton` as a git submodule; the shallow clone left it
empty, so CMake failed with "Cannot find source file ... proton/.../CuptiApi.cpp".
Clone recursively (shallow) so the submodule sources are present.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@0ax1

0ax1 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

@claude concise pr desc please

@github-actions

Copy link
Copy Markdown
Contributor

Claude review automation is disabled for pull requests that modify .github/ files.

Why:

  • workflow and action files are part of the automation policy
  • this review workflow refuses to evaluate automation changes from the same PR

Ask a human reviewer to inspect workflow changes directly.

@0ax1

0ax1 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Claude review automation is disabled for pull requests that modify .github/ files.

Why:

* workflow and action files are part of the automation policy

* this review workflow refuses to evaluate automation changes from the same PR

Ask a human reviewer to inspect workflow changes directly.

@joseph-isaacs human, please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants