ci(gpu): profile CUDA micro-benchmarks with Polar Signals PC sampling#8338
ci(gpu): profile CUDA micro-benchmarks with Polar Signals PC sampling#8338joseph-isaacs wants to merge 3 commits into
Conversation
Add a GPU CI workflow that runs the full vortex-cuda benchmark suite directly (no Codspeed) under NVIDIA CUDA PC sampling, collected by Polar Signals via the parca-agent and the parcagpu CUPTI shim. - New `.github/workflows/cuda-pc-sampling.yml` runs on every GPU PR (vortex-cuda changes) and on demand. It builds the parcagpu shim, reuses the existing Polar Signals action/token, injects the shim via CUDA_INJECTION64_PATH, enables PC sampling, and runs `cargo bench`. - `vortex-cuda/build.rs` gains an opt-in `-lineinfo` for release/bench kernel builds (VORTEX_CUDA_LINEINFO) so PC samples symbolize to source. Default release builds are unchanged; `-lineinfo` has no runtime cost. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
The parcagpu CMake build requires the `dtrace` probe generator to emit its USDT probes; on Ubuntu that ships in systemtap-sdt-dev. Without it CMake configure failed with "Could not find DTRACE". Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Merging this PR will not alter performance
Warning Please fix the performance issues or acknowledge them on CodSpeed. Performance Changes
Tip Investigate this regression by commenting Comparing Footnotes
|
parcagpu vendors `proton` as a git submodule; the shallow clone left it empty, so CMake failed with "Cannot find source file ... proton/.../CuptiApi.cpp". Clone recursively (shallow) so the submodule sources are present. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
|
@claude concise pr desc please |
|
Claude review automation is disabled for pull requests that modify Why:
Ask a human reviewer to inspect workflow changes directly. |
@joseph-isaacs human, please |
Summary
Adds a GPU CI workflow that runs the full
vortex-cudabenchmark suite directly (no Codspeed) under NVIDIA CUDA PC sampling, with the resulting profiles collected by Polar Signals.This complements the existing
bench-codspeed-cudajob incodspeed.yml: that job measures wall-time regressions, whereas this one captures GPU PC-sampling profiles (kernel-level hotspots and stall reasons) in Polar Signals Cloud.How it works
The mechanism described in the blog post has three parts, all wired up here:
-lineinfokernels —vortex-cuda/build.rsnow adds-lineinfoto release/benchnvccbuilds whenVORTEX_CUDA_LINEINFOis set. This lets Polar Signals symbolize PC samples back to source. It's opt-in so default release builds are byte-for-byte unchanged, and-lineinfohas no runtime cost.parca-dev/parcagpuv0.3.0(libparcagpucupti.so) and injects it into every CUDA process viaCUDA_INJECTION64_PATH. The shim exposes PC samples as USDT probes.polarsignals/gh-actions-ps-profilingaction (samePOLAR_SIGNALS_API_KEYsecret andproject_uuidasbench.yml), pinned to parca-agent0.48.0, the release that added GPU PC-sampling support. It consumes the parcagpu USDT probes over eBPF and ships the profiles.PC sampling is enabled at ~100 samples/sec with the blog's recommended sampling factor of 20 (
PARCAGPU_PC_SAMPLING_RATE/PARCAGPU_SAMPLING_FACTOR).Workflow
vortex-cuda/**(or the workflow file), plusworkflow_dispatch.g5GPU runner used by the existing CUDA Codspeed job.bench-pr.yml.cargo bench -p vortex-cuda(built once with--no-run).Notes / things to confirm
POLAR_SIGNALS_API_KEYsecret to be available to this workflow (already used bybench.yml/bench-pr.yml/sql-benchmarks.yml).parca_agent_versionto0.48.0(the action defaults to0.45.0, which predates GPU PC sampling). If a specific newer flag is needed to enable GPU collection on the agent side, it can be passed via the action'sextra_args.Checks run
yamllint --strict -c .yamllint.yaml .github/workflows/cuda-pc-sampling.yml— passes.env::var_osusage inbuild.rsagainst the existingstd::envimport.https://claude.ai/code/session_01G2ZPy2t8iuZHF4HyQFYSYD
Generated by Claude Code