cuda-sage

Fast static PTX analysis for CUDA teams that want actionable performance feedback before runtime profiling.

cuda-sage is intentionally focused. It parses PTX text, estimates occupancy, flags likely warp divergence sites, identifies memory-risk patterns, and compares PTX revisions for regression direction. The tool is designed to run in local development, code review, and CI environments where speed and repeatability matter.

Important

cuda-sage is a static analysis tool. It should guide optimization choices early, but it does not replace runtime validation with Nsight tools on target hardware.

What It Is, What It Does, And Why It Is Needed
Why This Project Exists
What You Get
Architecture At A Glance
Quick Start
Desktop GUI (PyQt6)
CLI Reference
JSON Output For Automation
How To Read Results
Architecture Support
Python API Usage
Project Structure
Development Workflow
CI Integration Example
Troubleshooting
Limitations
Documentation
License

What It Is, What It Does, And Why It Is Needed

cuda-sage is a static performance-analysis assistant for CUDA teams. It is not a profiler, compiler replacement, or runtime benchmarking framework. Instead, it operates one step earlier in the workflow and turns PTX into fast optimization guidance that can be reviewed during development and in pull requests.

It does three things very consistently: it parses PTX kernel metadata, computes architecture-aware occupancy estimates, and highlights control-flow and memory risks that commonly lead to performance loss. Because this process is static, the same analysis can run locally and in CI without requiring access to a GPU node.

This matters most when teams need rapid iteration loops. Performance regressions are often discovered late, after integration and benchmark setup. cuda-sage closes that gap by giving reviewers and kernel authors a common signal set before runtime tuning starts, which reduces churn and makes optimization discussions concrete.

Important

cuda-sage is designed to reduce late surprises, not replace final runtime validation. Use it to prioritize work early, then confirm impact with Nsight tools on target hardware.

Why This Project Exists

Many CUDA performance issues are discovered too late. In common workflows, developers only see occupancy bottlenecks, divergence-heavy branches, or spill pressure after kernels are already integrated and benchmarked on specific machines. That creates long feedback loops and makes it harder to isolate regressions in pull requests.

cuda-sage shifts that feedback to an earlier stage by analyzing PTX directly. Because the analysis is static, teams can evaluate likely performance risks on any machine, including CI runners that have no attached GPU. This enables a more consistent review process and helps catch expensive mistakes before runtime optimization begins.

Static PTX analysis is also useful for collaboration. Kernel authors, reviewers, and platform engineers can discuss concrete signals, such as register pressure, divergence predicates, and memory warnings, from a shared report format rather than relying only on local benchmark anecdotes.

Note

PTX is an intermediate representation and not final machine code. Results are best used as early guidance and should be paired with targeted runtime profiling for production sign-off.

What You Get

_Capability	_{What It Detects}	_{Why It Matters}	_{Typical Follow-Up}
_{PTX parser}	_{Kernel boundaries, register declarations, instruction categories, shared memory usage}	_{Builds a consistent analysis model without executing kernels}	_{Use parsed metadata for analysis and CI reports}
_{Occupancy analyzer}	_{Thread, register, shared-memory, and hardware block limits}	_{Low occupancy can reduce latency hiding and throughput}	_{Reduce register pressure, adjust launch size, trim shared memory}
_{Divergence analyzer}	_{Thread-varying branch predicates and high-risk split patterns}	_{Divergent branches serialize warp execution}	_{Refactor branch conditions, prefer predication where practical}
_{Memory analyzer}	_{Spill activity, bank-conflict risk hints, missing sync patterns, intensity proxy}	_{Memory pressure often dominates end-to-end performance}	_{Rework data layout, reduce spills, add synchronization where needed}
_{PTX diff mode}	_{Delta in occupancy, registers, spills, and divergence sites}	_{Detect directional regressions earlier in review}	_{Gate merges or trigger optimization follow-up}

Focus Principles

Keep the scope narrow enough to remain reliable and maintainable.
Prefer fast feedback over expensive infrastructure requirements.
Produce outputs that are useful to humans and machines.

Tip

Teams usually get the most value when they run analyze and diff on every PR that changes kernel code or build output.

Architecture At A Glance

flowchart LR
    A[CUDA source .cu] --> B[nvcc -ptx]
    B --> C[PTX file]
    C --> D[PTXParser]
    D --> E[OccupancyAnalyzer]
    D --> F[DivergenceAnalyzer]
    D --> G[MemoryAnalyzer]
    E --> H[Reporter]
    F --> H
    G --> H
    H --> I[Rich terminal output]
    H --> J[JSON output]
    C --> K[diff baseline vs optimized]
    K --> L[Regression verdict]

flowchart TD
    A[Kernel change lands] --> B[Run analyze]
    B --> C{High severity findings?}
    C -->|Yes| D[Fix source and rebuild PTX]
    D --> B
    C -->|No| E[Run diff against baseline PTX]
    E --> F{Regression?}
    F -->|Yes| G[Investigate and optimize]
    F -->|No| H[Proceed to runtime profiling]

Quick Start

1. Install

git clone https://github.com/hkevin01/cuda-sage
cd cuda-sage
python -m venv .venv
source .venv/bin/activate
pip install -e .

2. Generate PTX

nvcc -ptx -arch=sm_86 mykernel.cu -o mykernel.ptx

3. Analyze

cuda-sage analyze mykernel.ptx --arch sm_86 --threads 256 --curve

This command runs occupancy, divergence, and memory analysis on each kernel entry in the PTX file. The report includes metrics, limiting factors, and recommendations that can be reviewed in local shells or attached to CI artifacts.

Tip

If your team targets multiple GPU generations, run separate reports per architecture to avoid false confidence from a single target assumption.

Desktop GUI (PyQt6)

If you prefer a native desktop app over command-line usage, cuda-sage now includes a PyQt6 GUI.

The GUI is intentionally built on the same parser and analyzer pipeline as the CLI, so it is a different interface, not a different engine. The latest desktop pass gives it a more modern material-style presentation, theme switching, richer summary cards, and a cleaner analysis workspace that feels closer to a current desktop product than a thin wrapper around command-line flags.

This matters because CUDA review work is often collaborative. A polished desktop surface makes it easier to walk through PTX findings in pair sessions, performance reviews, demos, and onboarding without forcing every participant to stay in a terminal-centric workflow.

Why A GUI Is Useful

It lowers the friction for teams that do not want to memorize CLI flags.
It makes report interpretation faster with a card-based summary, tabbed results, and clearer visual separation between inputs and findings.
It helps cross-functional reviews where not every participant is comfortable with terminal tooling.
It preserves reproducibility by exporting the same machine-consumable JSON schema used in automation.
It supports runtime theme switching, which helps the tool feel native in both dark and light desktop environments.

Install GUI Dependencies

pip install -e ".[gui]"

Launch The Desktop App

cuda-sage-gui

What The GUI Supports

_Section	_Capability
_{Hero header}	_{Project overview, runtime theme switcher, and a cleaner desktop entry point}
_{Analyze panel}	_{Pick a PTX file, choose architecture and threads/block, optional kernel filter, optional occupancy curve}
_{Summary tab}	_{KPI cards for kernel count, occupancy, divergence, and spills plus rich narrative recommendations}
_{Metrics tab}	_{Per-kernel occupancy, bottleneck, register count, spill count, divergence count, sync-risk flag}
_{JSON tab}	_{Full JSON report preview matching CLI schema}
_{Action Plan tab}	_{Plain-language explanation of each metric, where to tune next, and concrete code modifications suggested by the analysis}
_{Diff tab}	_{Baseline vs optimized PTX comparison with per-kernel deltas and verdict rendered in a more readable report layout}
_Export	_{Save current analysis JSON report directly from the desktop app}

Note

The desktop app uses the same parser and analyzer pipeline as CLI mode, so results remain consistent across interfaces even though the presentation is now significantly more polished.

GUI Screenshots

Note

All controls use QFormLayout so labels align right and every input field - including architecture and thread-count dropdowns - stretches to fill the available width. Keyboard shortcuts (Alt+R, Alt+S, Alt+B, Alt+D), tooltips, and accessible names are applied to every widget.

Summary View

KPI cards at the top surface kernels analyzed, average occupancy, divergence sites, and spill count at a glance. The panel below explains what those numbers mean and lists the top recommended modifications.

Metrics Table

The metrics tab shows per-kernel register count, spills, divergence sites, limiter, and occupancy in a sortable table. Column headers carry tooltips that explain each metric.

JSON Preview

The JSON tab exposes the full structured report that CI automation can consume, keeping desktop review and pipeline behavior in sync.

Action Plan

The action-plan tab explains what each value means, identifies the primary bottleneck, lists concrete tuning levers, and suggests specific code modifications to try next.

Diff View

The diff tab compares two PTX files and reports occupancy delta, register delta, spill delta, and divergence delta per kernel with a clear IMPROVED / REGRESSION / NEUTRAL verdict.

CLI Reference

_Command	_Purpose	_{Common Options}	_Example
_analyze	_{Analyze kernels in a PTX file}	_{--arch, --threads, --curve, --kernel, --format, --output}	_{cuda-sage analyze kernel.ptx --arch sm_80 --curve}
_diff	_{Compare baseline vs optimized PTX}	_{--arch, --threads}	_{cuda-sage diff base.ptx opt.ptx --arch sm_80}
_list-archs	_{Print supported architecture table}	_none	_{cuda-sage list-archs}
_--version	_{Print version and exit}	_-V	_{cuda-sage --version}

Analyze Examples

# Text report
cuda-sage analyze kernel.ptx --arch sm_80

# Kernel filter
cuda-sage analyze all_kernels.ptx --kernel matmul --arch sm_90

# JSON output to file
cuda-sage analyze kernel.ptx --arch sm_80 --format json --output report.json

Diff Example

cuda-sage diff baseline.ptx optimized.ptx --arch sm_80 --threads 256

Diff compares kernels by name and reports metric deltas with a simple verdict model. This helps reviewers see whether a change appears to improve, regress, or remain neutral before deeper benchmarking.

JSON Output For Automation

JSON mode allows policy checks and dashboards without parsing terminal output.

cuda-sage analyze kernel.ptx --arch sm_80 --format json --output report.json

Typical machine-consumed fields include:

_{JSON Path}	_Meaning
_{occupancy.value}	_{Occupancy in range [0.0, 1.0]}
_{occupancy.limiting_factor}	_{Dominant occupancy bottleneck}
_{divergence.site_count}	_{Number of detected divergence sites}
_{divergence.high_severity_count}	_{Count of high-risk divergence sites}
_{memory.spill_ops}	_{Total local load/store spill operations}
_{memory.possible_missing_sync}	_{Shared write without observed barrier hint}

Important

JSON values are analysis signals, not hardware measurements. Keep thresholds practical and validate strict failures with runtime profiling.

How To Read Results

Suggested Triage Order

Check occupancy and identify the limiting factor.
Inspect divergence findings, especially high-severity sites.
Review memory warnings for spills, sync risk, and conflict hints.
Compare baseline and candidate PTX with diff to quantify direction.

This sequence reduces rework by solving broad resource constraints before fine-grained micro-optimizations.

Severity Interpretation

_Severity	_{Interpretation}	_{Typical Priority}
_High	_{Strong signal of likely performance impact}	_{Address before merge for critical kernels}
_Medium	_{Context-dependent risk worth investigation}	_{Address in current optimization cycle}
_Low	_{Informational guidance}	_{Track and batch with related work}

Practical Tips

Keep baseline PTX snapshots for critical kernels so diff remains meaningful.
Align --threads with production launch assumptions when possible.
Avoid overreacting to single low-severity signals; look for repeated patterns across kernels.
Re-run analysis after each optimization pass to validate direction.

Note

Heuristic warnings are intended to be conservative. They are most useful when treated as prioritization hints rather than absolute truth.

Architecture Support

_{SM Target}	_{Representative GPU}	_{Max Warps/SM}	_{Max Threads/SM}	_Regs/SM	_{Shared Mem/SM}
_{sm_70}	_{Volta V100}	₆₄	₂₀₄₈	₆₅₅₃₆	_{96 KB}
_{sm_75}	_{Turing T4 / RTX 2080}	₃₂	₁₀₂₄	₆₅₅₃₆	_{64 KB}
_{sm_80}	_{Ampere A100}	₆₄	₂₀₄₈	₆₅₅₃₆	_{164 KB}
_{sm_86}	_{Ampere RTX 3080/3090}	₄₈	₁₅₃₆	₆₅₅₃₆	_{100 KB}
_{sm_89}	_{Ada RTX 4090 class}	₄₈	₁₅₃₆	₆₅₅₃₆	_{100 KB}
_{sm_90}	_{Hopper H100}	₆₄	₂₀₄₈	₆₅₅₃₆	_{228 KB}

Use cuda-sage list-archs for runtime display of supported architecture models.

Python API Usage

Each component is importable and composable.

from cudasage import PTXParser, OccupancyAnalyzer, DivergenceAnalyzer, MemoryAnalyzer
from cudasage import get_arch

kernels = PTXParser().parse_file("kernel.ptx")
arch = get_arch("sm_80")

occ = OccupancyAnalyzer().analyze(kernels[0], arch, threads_per_block=256)
div = DivergenceAnalyzer().analyze(kernels[0])
mem = MemoryAnalyzer().analyze(kernels[0])

print("Occupancy:", occ.occupancy, occ.limiting_factor)
print("Divergence sites:", len(div.sites))
print("Spill ops:", mem.spill_ops)

Occupancy Curve Example

curve = OccupancyAnalyzer().occupancy_curve(kernels[0], arch)
for pt in curve:
    print(pt.threads_per_block, pt.occupancy, pt.limiting_factor)

Project Structure

src/cudasage/
├── __init__.py
├── cli.py
├── reporter.py
├── analyzers/
│   ├── occupancy.py
│   ├── divergence.py
│   └── memory.py
├── models/
│   └── architectures.py
└── parsers/
    └── ptx_parser.py

tests/
├── fixtures/
│   ├── vecadd.ptx
│   ├── divergent_kernel.ptx
│   ├── matmul.ptx
│   └── reduction.ptx
├── test_cli.py
├── test_parser.py
├── test_occupancy.py
├── test_divergence.py
├── test_memory.py
├── test_reporter.py
├── test_fixtures.py
└── test_public_api.py

The codebase is intentionally compact so contributors can reason about the entire system without crossing multiple product domains.

Development Workflow

pip install -e ".[dev]"
pytest -v

Suggested PR Checklist

Added or updated tests for changed analysis behavior
Validated CLI behavior for new options or error paths
Reviewed JSON output impact for downstream automation
Confirmed architecture assumptions in test fixtures

CI Integration Example

name: ptx-static-analysis

on:
  pull_request:
  push:
    branches: [main]

jobs:
  analyze:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - name: Install
        run: |
          python -m venv .venv
          source .venv/bin/activate
          pip install -e .
      - name: Analyze fixture
        run: |
          source .venv/bin/activate
          cuda-sage analyze tests/fixtures/matmul.ptx --arch sm_80 --format json --output report.json
      - name: Upload report
        uses: actions/upload-artifact@v4
        with:
          name: cuda-sage-report
          path: report.json

Troubleshooting

Command Reports No Kernels Found

Note

The parser expects PTX .entry kernels. Verify your compile step produced a PTX file with visible entry functions.

Architecture String Looks Ignored

Important

Unknown architecture values fall back to the nearest supported model. Run cuda-sage list-archs to confirm available targets.

JSON Output File Path Fails In CI

Tip

Use --output with a writable path in the runner workspace. The CLI creates parent directories when needed, but the target location still must be writable.

Divergence Warning Seems Too Conservative

Note

Some patterns are flagged intentionally to reduce false negatives for high-cost branch behavior. Use branch-efficiency runtime metrics to prioritize what to fix first.

Limitations

PTX-only static analysis. No SASS parsing and no direct runtime timing.
Divergence and bank-conflict checks are heuristic by design.
Reports describe likely risk and optimization direction, not guaranteed speedup.

This trade-off is intentional so the tool remains fast, portable, and suitable for early-stage gating.

Documentation

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github		.github
docs		docs
src/cudasage		src/cudasage
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

cuda-sage

Table Of Contents

What It Is, What It Does, And Why It Is Needed

Why This Project Exists

What You Get

Focus Principles

Architecture At A Glance

Quick Start

1. Install

2. Generate PTX

3. Analyze

Desktop GUI (PyQt6)

Why A GUI Is Useful

Install GUI Dependencies

Launch The Desktop App

What The GUI Supports

GUI Screenshots

Summary View

Metrics Table

JSON Preview

Action Plan

Diff View

CLI Reference

Analyze Examples

Diff Example

JSON Output For Automation

How To Read Results

Suggested Triage Order

Severity Interpretation

Practical Tips

Architecture Support

Python API Usage

Occupancy Curve Example

Project Structure

Development Workflow

Suggested PR Checklist

CI Integration Example

Troubleshooting

Command Reports No Kernels Found

Architecture String Looks Ignored

JSON Output File Path Fails In CI

Divergence Warning Seems Too Conservative

Limitations

Documentation

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages