Fast static PTX analysis for CUDA teams that want actionable performance feedback before runtime profiling.
cuda-sage is intentionally focused. It parses PTX text, estimates occupancy, flags likely warp divergence sites, identifies memory-risk patterns, and compares PTX revisions for regression direction. The tool is designed to run in local development, code review, and CI environments where speed and repeatability matter.
Important
cuda-sage is a static analysis tool. It should guide optimization choices early, but it does not replace runtime validation with Nsight tools on target hardware.
- What It Is, What It Does, And Why It Is Needed
- Why This Project Exists
- What You Get
- Architecture At A Glance
- Quick Start
- Desktop GUI (PyQt6)
- CLI Reference
- JSON Output For Automation
- How To Read Results
- Architecture Support
- Python API Usage
- Project Structure
- Development Workflow
- CI Integration Example
- Troubleshooting
- Limitations
- Documentation
- License
cuda-sage is a static performance-analysis assistant for CUDA teams. It is not a profiler, compiler replacement, or runtime benchmarking framework. Instead, it operates one step earlier in the workflow and turns PTX into fast optimization guidance that can be reviewed during development and in pull requests.
It does three things very consistently: it parses PTX kernel metadata, computes architecture-aware occupancy estimates, and highlights control-flow and memory risks that commonly lead to performance loss. Because this process is static, the same analysis can run locally and in CI without requiring access to a GPU node.
This matters most when teams need rapid iteration loops. Performance regressions are often discovered late, after integration and benchmark setup. cuda-sage closes that gap by giving reviewers and kernel authors a common signal set before runtime tuning starts, which reduces churn and makes optimization discussions concrete.
Important
cuda-sage is designed to reduce late surprises, not replace final runtime validation. Use it to prioritize work early, then confirm impact with Nsight tools on target hardware.
Many CUDA performance issues are discovered too late. In common workflows, developers only see occupancy bottlenecks, divergence-heavy branches, or spill pressure after kernels are already integrated and benchmarked on specific machines. That creates long feedback loops and makes it harder to isolate regressions in pull requests.
cuda-sage shifts that feedback to an earlier stage by analyzing PTX directly. Because the analysis is static, teams can evaluate likely performance risks on any machine, including CI runners that have no attached GPU. This enables a more consistent review process and helps catch expensive mistakes before runtime optimization begins.
Static PTX analysis is also useful for collaboration. Kernel authors, reviewers, and platform engineers can discuss concrete signals, such as register pressure, divergence predicates, and memory warnings, from a shared report format rather than relying only on local benchmark anecdotes.
Note
PTX is an intermediate representation and not final machine code. Results are best used as early guidance and should be paired with targeted runtime profiling for production sign-off.
| Capability | What It Detects | Why It Matters | Typical Follow-Up |
|---|---|---|---|
| PTX parser | Kernel boundaries, register declarations, instruction categories, shared memory usage | Builds a consistent analysis model without executing kernels | Use parsed metadata for analysis and CI reports |
| Occupancy analyzer | Thread, register, shared-memory, and hardware block limits | Low occupancy can reduce latency hiding and throughput | Reduce register pressure, adjust launch size, trim shared memory |
| Divergence analyzer | Thread-varying branch predicates and high-risk split patterns | Divergent branches serialize warp execution | Refactor branch conditions, prefer predication where practical |
| Memory analyzer | Spill activity, bank-conflict risk hints, missing sync patterns, intensity proxy | Memory pressure often dominates end-to-end performance | Rework data layout, reduce spills, add synchronization where needed |
| PTX diff mode | Delta in occupancy, registers, spills, and divergence sites | Detect directional regressions earlier in review | Gate merges or trigger optimization follow-up |
- Keep the scope narrow enough to remain reliable and maintainable.
- Prefer fast feedback over expensive infrastructure requirements.
- Produce outputs that are useful to humans and machines.
Tip
Teams usually get the most value when they run analyze and diff on every PR that changes kernel code or build output.
flowchart LR
A[CUDA source .cu] --> B[nvcc -ptx]
B --> C[PTX file]
C --> D[PTXParser]
D --> E[OccupancyAnalyzer]
D --> F[DivergenceAnalyzer]
D --> G[MemoryAnalyzer]
E --> H[Reporter]
F --> H
G --> H
H --> I[Rich terminal output]
H --> J[JSON output]
C --> K[diff baseline vs optimized]
K --> L[Regression verdict]
flowchart TD
A[Kernel change lands] --> B[Run analyze]
B --> C{High severity findings?}
C -->|Yes| D[Fix source and rebuild PTX]
D --> B
C -->|No| E[Run diff against baseline PTX]
E --> F{Regression?}
F -->|Yes| G[Investigate and optimize]
F -->|No| H[Proceed to runtime profiling]
git clone https://github.com/hkevin01/cuda-sage
cd cuda-sage
python -m venv .venv
source .venv/bin/activate
pip install -e .nvcc -ptx -arch=sm_86 mykernel.cu -o mykernel.ptxcuda-sage analyze mykernel.ptx --arch sm_86 --threads 256 --curveThis command runs occupancy, divergence, and memory analysis on each kernel entry in the PTX file. The report includes metrics, limiting factors, and recommendations that can be reviewed in local shells or attached to CI artifacts.
Tip
If your team targets multiple GPU generations, run separate reports per architecture to avoid false confidence from a single target assumption.
If you prefer a native desktop app over command-line usage, cuda-sage now includes a PyQt6 GUI.
The GUI is intentionally built on the same parser and analyzer pipeline as the CLI, so it is a different interface, not a different engine. The latest desktop pass gives it a more modern material-style presentation, theme switching, richer summary cards, and a cleaner analysis workspace that feels closer to a current desktop product than a thin wrapper around command-line flags.
This matters because CUDA review work is often collaborative. A polished desktop surface makes it easier to walk through PTX findings in pair sessions, performance reviews, demos, and onboarding without forcing every participant to stay in a terminal-centric workflow.
- It lowers the friction for teams that do not want to memorize CLI flags.
- It makes report interpretation faster with a card-based summary, tabbed results, and clearer visual separation between inputs and findings.
- It helps cross-functional reviews where not every participant is comfortable with terminal tooling.
- It preserves reproducibility by exporting the same machine-consumable JSON schema used in automation.
- It supports runtime theme switching, which helps the tool feel native in both dark and light desktop environments.
pip install -e ".[gui]"cuda-sage-gui| Section | Capability |
|---|---|
| Hero header | Project overview, runtime theme switcher, and a cleaner desktop entry point |
| Analyze panel | Pick a PTX file, choose architecture and threads/block, optional kernel filter, optional occupancy curve |
| Summary tab | KPI cards for kernel count, occupancy, divergence, and spills plus rich narrative recommendations |
| Metrics tab | Per-kernel occupancy, bottleneck, register count, spill count, divergence count, sync-risk flag |
| JSON tab | Full JSON report preview matching CLI schema |
| Action Plan tab | Plain-language explanation of each metric, where to tune next, and concrete code modifications suggested by the analysis |
| Diff tab | Baseline vs optimized PTX comparison with per-kernel deltas and verdict rendered in a more readable report layout |
| Export | Save current analysis JSON report directly from the desktop app |
Note
The desktop app uses the same parser and analyzer pipeline as CLI mode, so results remain consistent across interfaces even though the presentation is now significantly more polished.
Note
All controls use QFormLayout so labels align right and every input field - including architecture and thread-count dropdowns - stretches to fill the available width. Keyboard shortcuts (Alt+R, Alt+S, Alt+B, Alt+D), tooltips, and accessible names are applied to every widget.
KPI cards at the top surface kernels analyzed, average occupancy, divergence sites, and spill count at a glance. The panel below explains what those numbers mean and lists the top recommended modifications.
The metrics tab shows per-kernel register count, spills, divergence sites, limiter, and occupancy in a sortable table. Column headers carry tooltips that explain each metric.
The JSON tab exposes the full structured report that CI automation can consume, keeping desktop review and pipeline behavior in sync.
The action-plan tab explains what each value means, identifies the primary bottleneck, lists concrete tuning levers, and suggests specific code modifications to try next.
The diff tab compares two PTX files and reports occupancy delta, register delta, spill delta, and divergence delta per kernel with a clear IMPROVED / REGRESSION / NEUTRAL verdict.
| Command | Purpose | Common Options | Example |
|---|---|---|---|
analyze |
Analyze kernels in a PTX file | --arch, --threads, --curve, --kernel, --format, --output |
cuda-sage analyze kernel.ptx --arch sm_80 --curve |
diff |
Compare baseline vs optimized PTX | --arch, --threads |
cuda-sage diff base.ptx opt.ptx --arch sm_80 |
list-archs |
Print supported architecture table | none | cuda-sage list-archs |
--version |
Print version and exit | -V |
cuda-sage --version |
# Text report
cuda-sage analyze kernel.ptx --arch sm_80
# Kernel filter
cuda-sage analyze all_kernels.ptx --kernel matmul --arch sm_90
# JSON output to file
cuda-sage analyze kernel.ptx --arch sm_80 --format json --output report.jsoncuda-sage diff baseline.ptx optimized.ptx --arch sm_80 --threads 256Diff compares kernels by name and reports metric deltas with a simple verdict model. This helps reviewers see whether a change appears to improve, regress, or remain neutral before deeper benchmarking.
JSON mode allows policy checks and dashboards without parsing terminal output.
cuda-sage analyze kernel.ptx --arch sm_80 --format json --output report.jsonTypical machine-consumed fields include:
| JSON Path | Meaning |
|---|---|
occupancy.value |
Occupancy in range [0.0, 1.0] |
occupancy.limiting_factor |
Dominant occupancy bottleneck |
divergence.site_count |
Number of detected divergence sites |
divergence.high_severity_count |
Count of high-risk divergence sites |
memory.spill_ops |
Total local load/store spill operations |
memory.possible_missing_sync |
Shared write without observed barrier hint |
Important
JSON values are analysis signals, not hardware measurements. Keep thresholds practical and validate strict failures with runtime profiling.
- Check occupancy and identify the limiting factor.
- Inspect divergence findings, especially high-severity sites.
- Review memory warnings for spills, sync risk, and conflict hints.
- Compare baseline and candidate PTX with
diffto quantify direction.
This sequence reduces rework by solving broad resource constraints before fine-grained micro-optimizations.
| Severity | Interpretation | Typical Priority |
|---|---|---|
| High | Strong signal of likely performance impact | Address before merge for critical kernels |
| Medium | Context-dependent risk worth investigation | Address in current optimization cycle |
| Low | Informational guidance | Track and batch with related work |
- Keep baseline PTX snapshots for critical kernels so
diffremains meaningful. - Align
--threadswith production launch assumptions when possible. - Avoid overreacting to single low-severity signals; look for repeated patterns across kernels.
- Re-run analysis after each optimization pass to validate direction.
Note
Heuristic warnings are intended to be conservative. They are most useful when treated as prioritization hints rather than absolute truth.
| SM Target | Representative GPU | Max Warps/SM | Max Threads/SM | Regs/SM | Shared Mem/SM |
|---|---|---|---|---|---|
sm_70 |
Volta V100 | 64 | 2048 | 65536 | 96 KB |
sm_75 |
Turing T4 / RTX 2080 | 32 | 1024 | 65536 | 64 KB |
sm_80 |
Ampere A100 | 64 | 2048 | 65536 | 164 KB |
sm_86 |
Ampere RTX 3080/3090 | 48 | 1536 | 65536 | 100 KB |
sm_89 |
Ada RTX 4090 class | 48 | 1536 | 65536 | 100 KB |
sm_90 |
Hopper H100 | 64 | 2048 | 65536 | 228 KB |
Use cuda-sage list-archs for runtime display of supported architecture models.
Each component is importable and composable.
from cudasage import PTXParser, OccupancyAnalyzer, DivergenceAnalyzer, MemoryAnalyzer
from cudasage import get_arch
kernels = PTXParser().parse_file("kernel.ptx")
arch = get_arch("sm_80")
occ = OccupancyAnalyzer().analyze(kernels[0], arch, threads_per_block=256)
div = DivergenceAnalyzer().analyze(kernels[0])
mem = MemoryAnalyzer().analyze(kernels[0])
print("Occupancy:", occ.occupancy, occ.limiting_factor)
print("Divergence sites:", len(div.sites))
print("Spill ops:", mem.spill_ops)curve = OccupancyAnalyzer().occupancy_curve(kernels[0], arch)
for pt in curve:
print(pt.threads_per_block, pt.occupancy, pt.limiting_factor)src/cudasage/
├── __init__.py
├── cli.py
├── reporter.py
├── analyzers/
│ ├── occupancy.py
│ ├── divergence.py
│ └── memory.py
├── models/
│ └── architectures.py
└── parsers/
└── ptx_parser.py
tests/
├── fixtures/
│ ├── vecadd.ptx
│ ├── divergent_kernel.ptx
│ ├── matmul.ptx
│ └── reduction.ptx
├── test_cli.py
├── test_parser.py
├── test_occupancy.py
├── test_divergence.py
├── test_memory.py
├── test_reporter.py
├── test_fixtures.py
└── test_public_api.py
The codebase is intentionally compact so contributors can reason about the entire system without crossing multiple product domains.
pip install -e ".[dev]"
pytest -v- Added or updated tests for changed analysis behavior
- Validated CLI behavior for new options or error paths
- Reviewed JSON output impact for downstream automation
- Confirmed architecture assumptions in test fixtures
name: ptx-static-analysis
on:
pull_request:
push:
branches: [main]
jobs:
analyze:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install
run: |
python -m venv .venv
source .venv/bin/activate
pip install -e .
- name: Analyze fixture
run: |
source .venv/bin/activate
cuda-sage analyze tests/fixtures/matmul.ptx --arch sm_80 --format json --output report.json
- name: Upload report
uses: actions/upload-artifact@v4
with:
name: cuda-sage-report
path: report.jsonNote
The parser expects PTX .entry kernels. Verify your compile step produced a PTX file with visible entry functions.
Important
Unknown architecture values fall back to the nearest supported model. Run cuda-sage list-archs to confirm available targets.
Tip
Use --output with a writable path in the runner workspace. The CLI creates parent directories when needed, but the target location still must be writable.
Note
Some patterns are flagged intentionally to reduce false negatives for high-cost branch behavior. Use branch-efficiency runtime metrics to prioritize what to fix first.
- PTX-only static analysis. No SASS parsing and no direct runtime timing.
- Divergence and bank-conflict checks are heuristic by design.
- Reports describe likely risk and optimization direction, not guaranteed speedup.
This trade-off is intentional so the tool remains fast, portable, and suitable for early-stage gating.
MIT. See LICENSE.




