CUPTI Profiler Host, Range Profiler, and PM Sampling APIs by gbaraldi · Pull Request #3059 · JuliaGPU/CUDA.jl

gbaraldi · 2026-03-23T19:58:19Z

Summary

Adds support for the modern CUPTI profiling APIs: Profiler Host (metric enumeration), Range Profiler (per-kernel hardware counters), and PM Sampling (periodic counter collection).

Addresses #2694.

Changes

Phase 1: Low-level bindings

Added cupti_profiler_host.h, cupti_range_profiler.h, cupti_pmsampling.h to code generation
37 new API functions generated via Clang.jl

Phase 2: Metric enumeration

CUPTI.supported_chips() — list supported GPU chip names
CUPTI.list_metrics() — enumerate available PM sampling metrics with descriptions
CUPTI.metric_info("metric_name") — detailed metric information with sub-metrics
CUPTI.check_profiling_permissions() — check NVreg_RestrictProfilingToAdminUsers

Phase 3: Range profiler

CUPTI.range_profile(f, metrics) — per-kernel hardware counter collection
Multi-pass support for metrics requiring multiple passes

Phase 4: PM sampling

CUPTI.pm_sample(f, metrics) — periodic hardware counter sampling
Configurable sampling interval, trigger mode, buffer size

Phase 5: `@profile` integration

CUDA.@profile counters=["metric1", "metric2"] code — hardware counter profiling via the existing macro
Pretty-printed table output

Example

julia> CUDA.@profile counters=["sm__cycles_active.avg", "dram__throughput.avg.pct_of_peak_sustained_elapsed"] begin
           @cuda threads=256 blocks=4096 vadd_kernel(a, b, c)
           CUDA.synchronize()
       end
Hardware counter profiling: 1 kernel(s), 2 metric(s)

┌────────┬────────────────────────────────────────────────────┬──────────┐
│ Kernel │ Metric                                             │    Value │
├────────┼────────────────────────────────────────────────────┼──────────┤
│ 0      │ sm__cycles_active.avg                              │  11056.8 │
│ 0      │ dram__throughput.avg.pct_of_peak_sustained_elapsed │   40.240 │
└────────┴────────────────────────────────────────────────────┴──────────┘

Test plan

25/25 tests pass locally on GH100
CI

🤖 Generated with Claude Code

Phase 1: Code generation - Add cupti_profiler_host.h, cupti_range_profiler.h, cupti_pmsampling.h to the Clang.jl wrapper generation - Mark all cuptiProfilerHost* functions as needs_context=false - Regenerate libcupti.jl with 37 new API functions Phase 2: High-level wrappers for metric enumeration - ProfilerHostContext: manages CUPTI profiler host object lifecycle - supported_chips(): list all supported GPU chip names - base_metrics()/sub_metrics()/metric_properties(): enumerate metrics - single_pass_sets(): list single-pass metric set names - list_metrics(): high-level metric listing with descriptions - metric_info(): detailed metric information with sub-metrics - check_profiling_permissions(): warn about NVreg_RestrictProfilingToAdminUsers - chip_name(): auto-detect chip name from CUDA device Addresses JuliaGPU#2694. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Phase 3: Range profiler - range_profile(f, metrics): profile hardware counters per-kernel - Multi-pass loop for metrics requiring multiple passes - Returns RangeProfileResult with range names and metric values Phase 4: PM sampling - pm_sample(f, metrics): periodic hardware counter sampling - Configurable sampling interval, trigger mode, buffer size - Counter availability image handling for forward compatibility - Returns PmSamplingResult with timestamped samples Shared infrastructure: - _profiler_initialize(): global CUPTI profiler init - config_add_metrics!/get_config_image/get_num_passes: config helpers - evaluate_metrics(): evaluate counter data to metric values - profiler_lock: prevents concurrent profiling sessions Tested on GH100: range profiling returns real SM/DRAM metrics, PM sampling collects 1024 timestamped samples. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tests cover: - supported_chips(): chip enumeration - chip_name(): auto-detection from CUDA device - single_pass_sets(): single-pass metric set listing - ProfilerHostContext: lifecycle, base_metrics, sub_metrics, metric_properties - list_metrics(): high-level metric listing - check_profiling_permissions(): NVreg_RestrictProfilingToAdminUsers check - range_profile(): hardware counter collection per-kernel - pm_sample(): periodic counter sampling All 25 tests pass on GH100. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@Profile

CUDA.@Profile counters=["metric1", "metric2"] begin ... end Uses the CUPTI Range Profiler API to collect per-kernel hardware counter values and pretty-prints a table with the results. Integrates with the existing @Profile macro rather than adding separate macros — the counters keyword triggers counter profiling mode while the default behavior is unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@Profile

Use the CUPTI callback API (which coexists with the range profiler) to capture kernel symbol names during the first profiling pass. Names are demangled via demumble and stripped to function name only. The @Profile counters=... output now shows actual kernel names: ┌─────────────┬───────────────────┬─────────────────────┐ │ Kernel │ cycles_active.avg │ throughput.pct_peak │ ├─────────────┼───────────────────┼─────────────────────┤ │ vadd_kernel │ 11478.3 │ 40.234 │ │ vmul_kernel │ 9591.1 │ 43.726 │ └─────────────┴───────────────────┴─────────────────────┘ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Test that metrics requiring multiple passes (7 metrics = 4 passes on GH100) produce correct results. KernelReplay mode handles multi-pass internally so f() is only called once. - Test that kernel names are captured via the callback API and contain the expected function name. 32/32 tests pass on GH100. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Core Profiler Host/Range Profiler/PM Sampling APIs: require CUDA >= 12.6 - cuptiProfilerHostGetSinglePassSets: require CUDA >= 13.2 - Tests skip entirely on CUDA < 12.6 Fixes CI failure on CUDA 13.0 where cuptiProfilerHostGetSinglePassSets was not yet available. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CUPTI profiling is not supported on MIG partitions, vGPU, WSL, confidential compute, or CMP devices. The CI runs on A100 MIG which caused all ProfilerHostContext tests to fail with CUPTI_ERROR_INVALID_PARAMETER. Add profiler_device_supported() that calls cuptiProfilerDeviceSupported to check at runtime, and skip tests when profiling is unavailable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vchuravy · 2026-03-24T08:46:27Z

x-ref #1826 and #1823

On MIG devices, even cuptiProfilerDeviceSupported throws CUPTI_ERROR_INVALID_PARAMETER. Catch the error and return false instead of propagating the exception. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Extract _with_profiler_host() — shared setup for both APIs: CUPTI init, ProfilerHostContext creation, metric config, config image - Extract _get_counter_availability() — PM sampling counter availability query - Extract demangle_names!() and short_kernel_name() — shared with profile_internally - Remove dead multi-pass loop from range_profile (KernelReplay handles it) - Remove replay_mode parameter (always KernelReplay) - Add detailed comments explaining the range profiling flow Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vchuravy · 2026-03-24T12:56:25Z

@gbaraldi NVPerf(Works?) ought to help with the translation from raw metrics to more humane ones.

METRICS: Dict mapping human-readable names to bare CUPTI metric strings that work across Turing, Ampere, Ada, Hopper, and Blackwell (including GB202 consumer chips). METRIC_ALIASES: Preset groups (:memory, :compute, :overview, :tensor) for common profiling scenarios. Uses fbpa__dram_read/write_bytes for DRAM read/write since dram__bytes_read/write don't exist on GB202 (renamed to dram__bytes_op_read/write). The fbpa__ prefix is stable across all architectures. Verified on TU102, GA100, GA102, AD102, GH100, GB100, GB202. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove dead multi-pass loop (KernelReplay handles it internally) - Extract _with_profiler_host() shared between range profiler and PM sampling - Extract _get_counter_availability(), demangle_names!(), short_kernel_name() - Add METRICS dict: human-readable aliases for bare CUPTI metric names verified across TU102, GA100, GA102, AD102, GH100, GB100, GB202 - Add METRIC_ALIASES presets: :overview, :memory, :compute, :tensor - Use fbpa__dram_read/write_bytes (stable across all architectures including GB202 where dram__bytes_read was renamed) - Fix short metric name collisions by keeping unit prefix - Add detailed comments explaining range profiling flow Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codecov · 2026-03-24T14:29:26Z

Codecov Report

❌ Patch coverage is 6.79887% with 329 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.40%. Comparing base (9f56ee2) to head (eb37566).

Files with missing lines	Patch %	Lines
lib/cupti/wrappers.jl	3.58%	269 Missing ⚠️
src/profile.jl	18.91%	60 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3059      +/-   ##
==========================================
- Coverage   76.94%   74.40%   -2.55%     
==========================================
  Files         148      148              
  Lines       12984    13329     +345     
==========================================
- Hits         9991     9917      -74     
- Misses       2993     3412     +419

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

gbaraldi and others added 8 commits March 23, 2026 19:40

gbaraldi and others added 2 commits March 24, 2026 12:24

gbaraldi and others added 2 commits March 24, 2026 13:23

maleadt force-pushed the master branch from f1e7455 to 5a6f767 Compare March 26, 2026 08:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUPTI Profiler Host, Range Profiler, and PM Sampling APIs#3059

CUPTI Profiler Host, Range Profiler, and PM Sampling APIs#3059
gbaraldi wants to merge 12 commits intoJuliaGPU:masterfrom
gbaraldi:cupti-profiler-host

gbaraldi commented Mar 23, 2026

Uh oh!

vchuravy commented Mar 24, 2026

Uh oh!

vchuravy commented Mar 24, 2026

Uh oh!

codecov bot commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gbaraldi commented Mar 23, 2026

Summary

Changes

Phase 1: Low-level bindings

Phase 2: Metric enumeration

Phase 3: Range profiler

Phase 4: PM sampling

Phase 5: @profile integration

Example

Test plan

Uh oh!

vchuravy commented Mar 24, 2026

Uh oh!

vchuravy commented Mar 24, 2026

Uh oh!

codecov bot commented Mar 24, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Phase 5: `@profile` integration