Skip to content

CUPTI Profiler Host, Range Profiler, and PM Sampling APIs#3059

Draft
gbaraldi wants to merge 12 commits intoJuliaGPU:masterfrom
gbaraldi:cupti-profiler-host
Draft

CUPTI Profiler Host, Range Profiler, and PM Sampling APIs#3059
gbaraldi wants to merge 12 commits intoJuliaGPU:masterfrom
gbaraldi:cupti-profiler-host

Conversation

@gbaraldi
Copy link
Copy Markdown
Member

Summary

Adds support for the modern CUPTI profiling APIs: Profiler Host (metric enumeration), Range Profiler (per-kernel hardware counters), and PM Sampling (periodic counter collection).

Addresses #2694.

Changes

Phase 1: Low-level bindings

  • Added cupti_profiler_host.h, cupti_range_profiler.h, cupti_pmsampling.h to code generation
  • 37 new API functions generated via Clang.jl

Phase 2: Metric enumeration

  • CUPTI.supported_chips() — list supported GPU chip names
  • CUPTI.list_metrics() — enumerate available PM sampling metrics with descriptions
  • CUPTI.metric_info("metric_name") — detailed metric information with sub-metrics
  • CUPTI.check_profiling_permissions() — check NVreg_RestrictProfilingToAdminUsers

Phase 3: Range profiler

  • CUPTI.range_profile(f, metrics) — per-kernel hardware counter collection
  • Multi-pass support for metrics requiring multiple passes

Phase 4: PM sampling

  • CUPTI.pm_sample(f, metrics) — periodic hardware counter sampling
  • Configurable sampling interval, trigger mode, buffer size

Phase 5: @profile integration

  • CUDA.@profile counters=["metric1", "metric2"] code — hardware counter profiling via the existing macro
  • Pretty-printed table output

Example

julia> CUDA.@profile counters=["sm__cycles_active.avg", "dram__throughput.avg.pct_of_peak_sustained_elapsed"] begin
           @cuda threads=256 blocks=4096 vadd_kernel(a, b, c)
           CUDA.synchronize()
       end
Hardware counter profiling: 1 kernel(s), 2 metric(s)

┌────────┬────────────────────────────────────────────────────┬──────────┐
│ Kernel │ Metric                                             │    Value │
├────────┼────────────────────────────────────────────────────┼──────────┤
│ 0      │ sm__cycles_active.avg                              │  11056.8 │
│ 0      │ dram__throughput.avg.pct_of_peak_sustained_elapsed │   40.240 │
└────────┴────────────────────────────────────────────────────┴──────────┘

Test plan

  • 25/25 tests pass locally on GH100
  • CI

🤖 Generated with Claude Code

gbaraldi and others added 8 commits March 23, 2026 19:40
Phase 1: Code generation
- Add cupti_profiler_host.h, cupti_range_profiler.h, cupti_pmsampling.h
  to the Clang.jl wrapper generation
- Mark all cuptiProfilerHost* functions as needs_context=false
- Regenerate libcupti.jl with 37 new API functions

Phase 2: High-level wrappers for metric enumeration
- ProfilerHostContext: manages CUPTI profiler host object lifecycle
- supported_chips(): list all supported GPU chip names
- base_metrics()/sub_metrics()/metric_properties(): enumerate metrics
- single_pass_sets(): list single-pass metric set names
- list_metrics(): high-level metric listing with descriptions
- metric_info(): detailed metric information with sub-metrics
- check_profiling_permissions(): warn about NVreg_RestrictProfilingToAdminUsers
- chip_name(): auto-detect chip name from CUDA device

Addresses JuliaGPU#2694.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 3: Range profiler
- range_profile(f, metrics): profile hardware counters per-kernel
- Multi-pass loop for metrics requiring multiple passes
- Returns RangeProfileResult with range names and metric values

Phase 4: PM sampling
- pm_sample(f, metrics): periodic hardware counter sampling
- Configurable sampling interval, trigger mode, buffer size
- Counter availability image handling for forward compatibility
- Returns PmSamplingResult with timestamped samples

Shared infrastructure:
- _profiler_initialize(): global CUPTI profiler init
- config_add_metrics!/get_config_image/get_num_passes: config helpers
- evaluate_metrics(): evaluate counter data to metric values
- profiler_lock: prevents concurrent profiling sessions

Tested on GH100: range profiling returns real SM/DRAM metrics,
PM sampling collects 1024 timestamped samples.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests cover:
- supported_chips(): chip enumeration
- chip_name(): auto-detection from CUDA device
- single_pass_sets(): single-pass metric set listing
- ProfilerHostContext: lifecycle, base_metrics, sub_metrics, metric_properties
- list_metrics(): high-level metric listing
- check_profiling_permissions(): NVreg_RestrictProfilingToAdminUsers check
- range_profile(): hardware counter collection per-kernel
- pm_sample(): periodic counter sampling

All 25 tests pass on GH100.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CUDA.@Profile counters=["metric1", "metric2"] begin ... end

Uses the CUPTI Range Profiler API to collect per-kernel hardware
counter values and pretty-prints a table with the results.

Integrates with the existing @Profile macro rather than adding
separate macros — the counters keyword triggers counter profiling
mode while the default behavior is unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use the CUPTI callback API (which coexists with the range profiler)
to capture kernel symbol names during the first profiling pass.
Names are demangled via demumble and stripped to function name only.

The @Profile counters=... output now shows actual kernel names:

  ┌─────────────┬───────────────────┬─────────────────────┐
  │ Kernel      │ cycles_active.avg │ throughput.pct_peak  │
  ├─────────────┼───────────────────┼─────────────────────┤
  │ vadd_kernel │           11478.3 │              40.234  │
  │ vmul_kernel │            9591.1 │              43.726  │
  └─────────────┴───────────────────┴─────────────────────┘

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Test that metrics requiring multiple passes (7 metrics = 4 passes
  on GH100) produce correct results. KernelReplay mode handles
  multi-pass internally so f() is only called once.
- Test that kernel names are captured via the callback API and
  contain the expected function name.

32/32 tests pass on GH100.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Core Profiler Host/Range Profiler/PM Sampling APIs: require CUDA >= 12.6
- cuptiProfilerHostGetSinglePassSets: require CUDA >= 13.2
- Tests skip entirely on CUDA < 12.6

Fixes CI failure on CUDA 13.0 where cuptiProfilerHostGetSinglePassSets
was not yet available.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CUPTI profiling is not supported on MIG partitions, vGPU, WSL,
confidential compute, or CMP devices. The CI runs on A100 MIG
which caused all ProfilerHostContext tests to fail with
CUPTI_ERROR_INVALID_PARAMETER.

Add profiler_device_supported() that calls cuptiProfilerDeviceSupported
to check at runtime, and skip tests when profiling is unavailable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@vchuravy
Copy link
Copy Markdown
Member

x-ref #1826 and #1823

gbaraldi and others added 2 commits March 24, 2026 12:24
On MIG devices, even cuptiProfilerDeviceSupported throws
CUPTI_ERROR_INVALID_PARAMETER. Catch the error and return false
instead of propagating the exception.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Extract _with_profiler_host() — shared setup for both APIs:
  CUPTI init, ProfilerHostContext creation, metric config, config image
- Extract _get_counter_availability() — PM sampling counter availability query
- Extract demangle_names!() and short_kernel_name() — shared with profile_internally
- Remove dead multi-pass loop from range_profile (KernelReplay handles it)
- Remove replay_mode parameter (always KernelReplay)
- Add detailed comments explaining the range profiling flow

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@vchuravy
Copy link
Copy Markdown
Member

@gbaraldi NVPerf(Works?) ought to help with the translation from raw metrics to more humane ones.

gbaraldi and others added 2 commits March 24, 2026 13:23
METRICS: Dict mapping human-readable names to bare CUPTI metric strings
that work across Turing, Ampere, Ada, Hopper, and Blackwell (including
GB202 consumer chips).

METRIC_ALIASES: Preset groups (:memory, :compute, :overview, :tensor)
for common profiling scenarios.

Uses fbpa__dram_read/write_bytes for DRAM read/write since
dram__bytes_read/write don't exist on GB202 (renamed to
dram__bytes_op_read/write). The fbpa__ prefix is stable across
all architectures.

Verified on TU102, GA100, GA102, AD102, GH100, GB100, GB202.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove dead multi-pass loop (KernelReplay handles it internally)
- Extract _with_profiler_host() shared between range profiler and PM sampling
- Extract _get_counter_availability(), demangle_names!(), short_kernel_name()
- Add METRICS dict: human-readable aliases for bare CUPTI metric names
  verified across TU102, GA100, GA102, AD102, GH100, GB100, GB202
- Add METRIC_ALIASES presets: :overview, :memory, :compute, :tensor
- Use fbpa__dram_read/write_bytes (stable across all architectures
  including GB202 where dram__bytes_read was renamed)
- Fix short metric name collisions by keeping unit prefix
- Add detailed comments explaining range profiling flow

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 24, 2026

Codecov Report

❌ Patch coverage is 6.79887% with 329 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.40%. Comparing base (9f56ee2) to head (eb37566).

Files with missing lines Patch % Lines
lib/cupti/wrappers.jl 3.58% 269 Missing ⚠️
src/profile.jl 18.91% 60 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3059      +/-   ##
==========================================
- Coverage   76.94%   74.40%   -2.55%     
==========================================
  Files         148      148              
  Lines       12984    13329     +345     
==========================================
- Hits         9991     9917      -74     
- Misses       2993     3412     +419     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants