CUPTI Profiler Host, Range Profiler, and PM Sampling APIs#3059
Draft
gbaraldi wants to merge 12 commits intoJuliaGPU:masterfrom
Draft
CUPTI Profiler Host, Range Profiler, and PM Sampling APIs#3059gbaraldi wants to merge 12 commits intoJuliaGPU:masterfrom
gbaraldi wants to merge 12 commits intoJuliaGPU:masterfrom
Conversation
Phase 1: Code generation - Add cupti_profiler_host.h, cupti_range_profiler.h, cupti_pmsampling.h to the Clang.jl wrapper generation - Mark all cuptiProfilerHost* functions as needs_context=false - Regenerate libcupti.jl with 37 new API functions Phase 2: High-level wrappers for metric enumeration - ProfilerHostContext: manages CUPTI profiler host object lifecycle - supported_chips(): list all supported GPU chip names - base_metrics()/sub_metrics()/metric_properties(): enumerate metrics - single_pass_sets(): list single-pass metric set names - list_metrics(): high-level metric listing with descriptions - metric_info(): detailed metric information with sub-metrics - check_profiling_permissions(): warn about NVreg_RestrictProfilingToAdminUsers - chip_name(): auto-detect chip name from CUDA device Addresses JuliaGPU#2694. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 3: Range profiler - range_profile(f, metrics): profile hardware counters per-kernel - Multi-pass loop for metrics requiring multiple passes - Returns RangeProfileResult with range names and metric values Phase 4: PM sampling - pm_sample(f, metrics): periodic hardware counter sampling - Configurable sampling interval, trigger mode, buffer size - Counter availability image handling for forward compatibility - Returns PmSamplingResult with timestamped samples Shared infrastructure: - _profiler_initialize(): global CUPTI profiler init - config_add_metrics!/get_config_image/get_num_passes: config helpers - evaluate_metrics(): evaluate counter data to metric values - profiler_lock: prevents concurrent profiling sessions Tested on GH100: range profiling returns real SM/DRAM metrics, PM sampling collects 1024 timestamped samples. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests cover: - supported_chips(): chip enumeration - chip_name(): auto-detection from CUDA device - single_pass_sets(): single-pass metric set listing - ProfilerHostContext: lifecycle, base_metrics, sub_metrics, metric_properties - list_metrics(): high-level metric listing - check_profiling_permissions(): NVreg_RestrictProfilingToAdminUsers check - range_profile(): hardware counter collection per-kernel - pm_sample(): periodic counter sampling All 25 tests pass on GH100. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CUDA.@Profile counters=["metric1", "metric2"] begin ... end Uses the CUPTI Range Profiler API to collect per-kernel hardware counter values and pretty-prints a table with the results. Integrates with the existing @Profile macro rather than adding separate macros — the counters keyword triggers counter profiling mode while the default behavior is unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use the CUPTI callback API (which coexists with the range profiler) to capture kernel symbol names during the first profiling pass. Names are demangled via demumble and stripped to function name only. The @Profile counters=... output now shows actual kernel names: ┌─────────────┬───────────────────┬─────────────────────┐ │ Kernel │ cycles_active.avg │ throughput.pct_peak │ ├─────────────┼───────────────────┼─────────────────────┤ │ vadd_kernel │ 11478.3 │ 40.234 │ │ vmul_kernel │ 9591.1 │ 43.726 │ └─────────────┴───────────────────┴─────────────────────┘ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Test that metrics requiring multiple passes (7 metrics = 4 passes on GH100) produce correct results. KernelReplay mode handles multi-pass internally so f() is only called once. - Test that kernel names are captured via the callback API and contain the expected function name. 32/32 tests pass on GH100. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Core Profiler Host/Range Profiler/PM Sampling APIs: require CUDA >= 12.6 - cuptiProfilerHostGetSinglePassSets: require CUDA >= 13.2 - Tests skip entirely on CUDA < 12.6 Fixes CI failure on CUDA 13.0 where cuptiProfilerHostGetSinglePassSets was not yet available. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CUPTI profiling is not supported on MIG partitions, vGPU, WSL, confidential compute, or CMP devices. The CI runs on A100 MIG which caused all ProfilerHostContext tests to fail with CUPTI_ERROR_INVALID_PARAMETER. Add profiler_device_supported() that calls cuptiProfilerDeviceSupported to check at runtime, and skip tests when profiling is unavailable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Member
On MIG devices, even cuptiProfilerDeviceSupported throws CUPTI_ERROR_INVALID_PARAMETER. Catch the error and return false instead of propagating the exception. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Extract _with_profiler_host() — shared setup for both APIs: CUPTI init, ProfilerHostContext creation, metric config, config image - Extract _get_counter_availability() — PM sampling counter availability query - Extract demangle_names!() and short_kernel_name() — shared with profile_internally - Remove dead multi-pass loop from range_profile (KernelReplay handles it) - Remove replay_mode parameter (always KernelReplay) - Add detailed comments explaining the range profiling flow Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Member
|
@gbaraldi NVPerf(Works?) ought to help with the translation from raw metrics to more humane ones. |
METRICS: Dict mapping human-readable names to bare CUPTI metric strings that work across Turing, Ampere, Ada, Hopper, and Blackwell (including GB202 consumer chips). METRIC_ALIASES: Preset groups (:memory, :compute, :overview, :tensor) for common profiling scenarios. Uses fbpa__dram_read/write_bytes for DRAM read/write since dram__bytes_read/write don't exist on GB202 (renamed to dram__bytes_op_read/write). The fbpa__ prefix is stable across all architectures. Verified on TU102, GA100, GA102, AD102, GH100, GB100, GB202. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove dead multi-pass loop (KernelReplay handles it internally) - Extract _with_profiler_host() shared between range profiler and PM sampling - Extract _get_counter_availability(), demangle_names!(), short_kernel_name() - Add METRICS dict: human-readable aliases for bare CUPTI metric names verified across TU102, GA100, GA102, AD102, GH100, GB100, GB202 - Add METRIC_ALIASES presets: :overview, :memory, :compute, :tensor - Use fbpa__dram_read/write_bytes (stable across all architectures including GB202 where dram__bytes_read was renamed) - Fix short metric name collisions by keeping unit prefix - Add detailed comments explaining range profiling flow Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3059 +/- ##
==========================================
- Coverage 76.94% 74.40% -2.55%
==========================================
Files 148 148
Lines 12984 13329 +345
==========================================
- Hits 9991 9917 -74
- Misses 2993 3412 +419 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds support for the modern CUPTI profiling APIs: Profiler Host (metric enumeration), Range Profiler (per-kernel hardware counters), and PM Sampling (periodic counter collection).
Addresses #2694.
Changes
Phase 1: Low-level bindings
cupti_profiler_host.h,cupti_range_profiler.h,cupti_pmsampling.hto code generationPhase 2: Metric enumeration
CUPTI.supported_chips()— list supported GPU chip namesCUPTI.list_metrics()— enumerate available PM sampling metrics with descriptionsCUPTI.metric_info("metric_name")— detailed metric information with sub-metricsCUPTI.check_profiling_permissions()— checkNVreg_RestrictProfilingToAdminUsersPhase 3: Range profiler
CUPTI.range_profile(f, metrics)— per-kernel hardware counter collectionPhase 4: PM sampling
CUPTI.pm_sample(f, metrics)— periodic hardware counter samplingPhase 5:
@profileintegrationCUDA.@profile counters=["metric1", "metric2"] code— hardware counter profiling via the existing macroExample
Test plan
🤖 Generated with Claude Code