1.25.0 rc by Anerudhan · Pull Request #300 · NVIDIA/cudnn-frontend

Anerudhan · 2026-06-10T18:35:50Z

cuDNN Frontend v1.25.0 Release Notes

cuDNN has moved completely to github for development. Please direct your PRs to develop and file issues in github.

cuDNN Frontend v1.25.0 is the recommended version for cuDNN 9.23.0 and later releases.

Updates to Graph API 🚀 🚀

SDPA

cu_seqlens in unified SDPA — the unified SDPA path now accepts cumulative sequence-length tensors, enabling variable-length (packed) batches without padding.
Ragged offset multiplier — added frontend support for the per-tensor ragged offset multiplier (CUDNN_ATTR_TENSOR_RAGGED_OFFSET_MULTIPLIER), letting ragged offsets be stored in coarser units and scaled back to element offsets by the engine. Exposed through Tensor_attributes (getters/setters, validation, serialization) and the Python tensor() bindings. Requires cuDNN 9.24.0.

Structured plan pinning

Added get_engine_and_knobs_at_index, which returns the structured (engine_id, {KnobType_t: value}) for a plan instead of a stringified tag, so a tuned plan can be persisted and replayed exactly via create_execution_plan(engine_id, knobs) even as plan enumeration drifts across versions. Available in C++ (Graph, Execution_plan_list) and Python.
Extended KnobType_t with SWAP_AB, INPUT_TMA_ENABLE, and OUTPUT_TMA_ENABLE.

Reduction

Added optional group_offset support to the reduction node (Reduction_attributes::set_group_offset), so cuDNN FE can express per-expert reductions for MoE grouped GEMM workloads. Wires CUDNN_ATTR_OPERATION_REDUCTION_GROUP_OFFSET_DESC with runtime version checks (cuDNN ≥ 9.24.0), and exposes the optional argument through the Python reduction binding.

Open-Source Kernels 🚀 🚀

Row-scale grouped GEMM quantization — added row-scale support to the grouped GEMM quant path.
DSA — fixed CuTe-DSL guards and added the SM90 indexer-forward kernel.
dgeglu — config values are now compile-time constants instead of runtime values.

General Improvements ✨✨

Static linking of libcudnn is now supported.
libcudart loading — the selected libcudart can be overridden via CUDNN_FRONTEND_CUDART_LIB_NAME, and the shim now warns instead of throwing when multiple libcudart libraries are found, improving robustness in containerized environments.
Windows / MSVC — consolidated getenv access and fixed C4996/C4005 compiler warnings on MSVC.

Bug Fixes 🐛

Fixed variant-pack-template lifecycle bugs and added defensive null checks.
Deserialize-owned containers are now cleared on re-deserialize to prevent stale state.
Use a static signature for sfd_col_d_srelu_tensor.

Samples

Skip TensorIR MemBound / compile-time-const samples on consumer Blackwell (SM12x).
Skip the flexible-graph SDPA backward sample on SM120 and above.

Benchmarking 📊

Added an autoregressive video DiT SDPA configuration with GB200 / GB300 results.
Updated the SDPA benchmarking artifacts and removed stale H200 artifacts.

Acknowledgements

External contributors

Thanks @take-cheeze for adding support for static linking of libcudnn.
Thanks Ziang Li for adding row-scale support to the grouped GEMM quant path.
Thanks Jiayu Sun — DSA CuTe-DSL guard fixes and the SM90 indexer-forward kernel.

Long pytest-xdist runs (e.g. test_mhas_v2 ~2.5k SDPA configs in one worker) hit a much higher GPU memory high-water mark than any single test needs, because the caching allocator retains freed blocks across configs. Setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True, garbage_collection_threshold:0.6 before torch is imported reduces the peak to roughly the maximum any single test needs, with no change in wall time or test outcome. Use os.environ.setdefault so user-provided values still win, and place it above the transformer_engine import so the env var is visible by the time torch initializes its CUDA allocator.

Updated the link for DSA in the README to point to the correct directory.

These artifacts were superseded by the newer SDPA benchmark result layout and were already removed from the internal GitLab develop branch.

Two pre-existing bugs in the VariantPackTemplate, plus one defensive guard: 1. Graph copy -> dangling host pointers. template_ptrs stores raw addresses into cached_pass_by_value storage owned by the source Graph. Default copy propagated prepared=true while the addresses still pointed at the source. Fix: VarpackPrepStateBox copy ctor/assign now always start with prepared=false so the copy re-preps on first use against its own storage. 2. Re-deserialize on the same Graph -> stale template. deserialize(handle,...) rebinds cached_pass_by_value but the existing prepared=true causes the eager prep to short-circuit, leaving the slot layout from the prior deserialize. Fix: reset prepared=false and clear varpack_template before the eager prep call. 3. Null device_ptrs in raw-ptr create_variant_pack overloads. Reject nullptr + non-empty uids instead of forwarding to the cuDNN backend. Adds explicit null-plan guards across detail::execute overloads, returning GRAPH_EXECUTION_FAILED with "No plan found to execute!" instead of dereferencing plan via plan->getTag(). Ports https://gitlab-master.nvidia.com/cudnn/cudnn_frontend/-/merge_requests/2117 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Addresses review feedback on PR #248: the prior fix reset prepared=false and varpack_template but left deserialized_tensor_properties, deserialized_pass_by_value, deserialized_workspace_modifications, and tensors_to_dump populated from any earlier deserialize(handle, old_data). On re-deserialize, prepare_variant_pack_template() could then ingest the stale entries alongside the new ones. Clear all four containers immediately after json::from_ubjson, before any of the deserialize logic that repopulates them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Ziang Li <ziangli@umich.edu>

@SInCE

…inning (#259) * feat(python): add get_engine_and_knobs_at_index for structured plan pinning get_plan_name_at_index returns a formatted "engN_kT=V" tag built from the engine global index and knob choices. Callers that want to persist a tuned plan and replay it later are forced to either store the bare plan index (which drifts when the policy=ALL plan list is re-enumerated across cudnn-frontend / backend versions) or parse the tag string. Expose the structured data directly: get_engine_and_knobs_at_index returns (engine_id, {KnobType_t: value}), reading the same backend attributes get_engine_tag stringifies. The result feeds straight into create_execution_plan(engine_id, knobs) to rebuild the exact same kernel on a fresh graph without a heuristics query. - detail::get_engine_id_and_knobs (cudnn_frontend_utils.h): structured reader - Execution_plan_list::get_engine_and_knobs_at_index (plans.h) - Graph::get_engine_and_knobs_at_index (graph_interface.h) - PyGraph binding (pygraph.h/.cpp) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * address review: bounds-check index, add cpp unit test, trim comments - get_engine_and_knobs_at_index: reject out-of-range index (mirrors check_support_at_index) instead of indexing engine_configs OOB. - add test/cpp/get_engine_and_knobs.cpp: enumerate a matmul graph's plans, read (engine_id, knobs) for each, and confirm re-pinning via create_execution_plan reproduces the same plan (matching name); also checks out-of-range indices error. - trim the new doc comments to match neighboring style. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * knobs: add SWAP_AB / INPUT_TMA_ENABLE / OUTPUT_TMA_ENABLE to KnobType_t KnobType_t (and the to/from backend converters) stopped at WARP_SPEC_CFG (42), so engines using SWAP_AB (43, cuDNN 9.18), INPUT_TMA_ENABLE (44) or OUTPUT_TMA_ENABLE (45, cuDNN 9.22) had those knobs mapped to NOT_SET by convert_from_backend_knob_type. Feeding NOT_SET back into create_execution_plan then failed convert_to_backend_knob_type with INVALID_VALUE -- so a plan enumerated with one of these knobs (e.g. via get_engine_and_knobs_at_index) could not be pinned. Add the three knob types to the enum, both converters (version-gated to match the backend @SInCE), and the pybind knob_type enum. The cpp test now compares the structured identity (engine id + knob map) instead of the plan-name tag, since the tag serializes knobs in engine-config order, which differs between the heuristic config and the pinned one even though the kernel is identical. create_execution_plan is now asserted to succeed for every enumerated plan; building it stays best-effort (can fail for unrelated environment reasons such as a ptxas older than the engine's target). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * make get_engine_tag deterministic: sort knob choices by type The plan-name tag was built by iterating CUDNN_ATTR_ENGINECFG_KNOB_CHOICES in stored order, which differs between the heuristics path and create_execution_plan (set_knob_choices iterates a std::unordered_map). So the same engine + knob values could serialize to differently-ordered tags (e.g. eng11_k2=29_k27=0...k43=0 vs eng11_k43=0_k38=0...k2=29) -- the kernel is identical but the string isn't a stable id. Sort the knob choices by type before formatting so the tag is a deterministic function of the engine config regardless of how it was built. This is off the execution hot path (tag is used for logging / plan identity), so no perf impact; the actual knob choices passed to the backend are unchanged. The cpp test now also asserts the pinned plan's tag matches the original's. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Yang Xu <yanxu@nvidia.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* update sdpa benchmark artifacts * update acknowledgement

…IB_NAME When dynamic loading is enabled, load_cudart_so() searches for the supported libcudart major versions and aborts with "Multiple libcudart libraries found" when more than one is visible on the library search path. This happens in containerized environments such as GKE, where the TCPXO NCCL plugin mounts a different libcudart major version from the host than the one shipped in the container. Check the CUDNN_FRONTEND_CUDART_LIB_NAME environment variable first; when set to a library name or path, dlopen exactly that library and skip the automatic multi-version detection. Behavior is unchanged when the variable is unset. Fixes #267 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… Perfsim, HACK/Ugly, STS/CGA SASS terms) (#273) Comment-only cleanups, no behaviour change. Replaces guardword-flagged phrasing with neutral equivalents in 7 files: - attention_utils.h:67 — drop internal `xmma/fast_math.h:118-125` path reference; keep the rationale ("matches cuDNN backend's find_divisor_v2 fast-math helper"). - test_sdpa_bwd.py:8 — drop `gitlab-master.nvidia.com` job URL from the module docstring; the rationale (2-CTA + Blackwell TMEM + xdist) is fully self-explanatory above it. - dense_score_recompute_sm90.py — "Perfsim" → "Profiling"; "Weights/LSE LDG" → "Weights/LSE load-from-global" (x2). - indexer_backward_sm90.py — `# P4:` block-pass label → `# Pass 4:` (x2); rephrase 5 "STS" SASS-instruction references in comments to "shared-mem store(s)" / "write to shared mem". - indexer_backward_sm100.py — same STS → shared-mem-store rephrasing in 1 docstring. - dsa_bwd_sm90.py:386 — `# HACK:` → `# Note:` (same meaning). - dsa_bwd_sm90.py:1554 — `STS(dS)` → "storing dS to shared mem". - dsa_bwd_sm100.py:941 — `# Ugly,` → `# Awkward,`. - dense_gemm_persistent_swiglu.py:1049 — "single CGA" → "single cluster". Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Windows wheel build (deploy:build_bdist_wheels_3.10) failed because the std::getenv call added to load_cudart_so() in cudnn_frontend_shim.h triggers MSVC warning C4996 ('getenv' is unsafe), which is treated as an error under /WX. Root cause and fixes: - Move get_environment() to cudnn_frontend_shim.h (the lowest-level header, included by utils.h before Logging.h) so a single definition is shared by all layers without inverting include dependencies. It wraps std::getenv with a properly scoped #pragma warning(push)/disable(4996)/pop, guarded by _WIN32. - Route all getenv call sites through get_environment(): shim.h, graph_properties.h, scaled_dot_product_flash_attention.h, and sm100_rms_norm_silu_engine.h. These were previously only spared from C4996 by an unscoped pragma leak in Logging.h, and would have started failing once that leak was fixed. - Remove the duplicate get_environment() from cudnn_frontend_Logging.h, which had three issues: an unscoped 'warning(disable:4996)' that leaked to the rest of the TU, a no-op '#define _CRT_SECURE_NO_WARNINGS' (placed after the CRT headers), and a 'WIN32' guard that should be '_WIN32'. Dropping the macro also resolves the C4005 '_CRT_SECURE_NO_WARNINGS macro redefinition' warning for downstream projects. Fixes #139 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… are found Loading cudart no longer aborts when both libcudart.so.12 and libcudart.so.13 are present in the library search path. Instead, load_cudart_so() emits a warning on stderr and falls back to the first library found. Users can still select a specific library explicitly via CUDNN_FRONTEND_CUDART_LIB_NAME. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Promote L1 Python tests to L0 * Restore L1 markers except FP8 ragged backward

Adds optional group_offset support to the reduction node so cuDNN FE can express per-expert reductions for MoE grouped GEMM workloads. - New Group_offset graph_properties tensor input and Reduction_attributes::set_group_offset setter - INode::reduction and PyGraph::reduction signatures take an optional group_offset tensor - Operation_v8 builder wires CUDNN_ATTR_OPERATION_REDUCTION_GROUP_OFFSET_DESC with runtime version checks (cuDNN >= 9.24.0) - Python binding (pygraph) exposes the optional group_offset argument Mirrors gitlab-master cudnn/cudnn_frontend MR !2111 by @yanqinz. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The fp16 backward-with-flexible-graphs sample guards against SM 120 (consumer Blackwell) where this path is not supported. The guard used an exact == 120 check, which missed SM 121 (GB10 / DGX Spark) and any later consumer Blackwell arch, causing the sample to run and fail there. Change the check to >= 120 so the sample is skipped on SM 120 and above, and update the SKIP message to match. Co-authored-by: Yang Xu <yanxu@nvidia.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Fix clang format issues * Fix clang-format * Add pre-commit hooks and fix pre-commit * Fix the black issues

…well (SM12x) (#285) * Skip TensorIR MemBound / compile-time-const samples on consumer Blackwell (SM12x) The TensorIR MemBound engine (cudnnTensorIrMemBoundEngine) only supports SM100-SM109 (data center Blackwell): its arch gate is [SM_100, SM_110) and the DKG cubins it emits are the sm_100f family-portable target, which the CUDA driver will not load on sm_120. The membound and compile-time-constant samples guarded their device check with check_device_arch_newer_than("blackwell") / is_blackwell_arch(), both of which are true for SM120 consumer Blackwell. So on an RTX 50-series (sm_120) GPU these samples fall through to create_execution_plans() and FAIL with "No valid engine configs returned from heuristics" (no engine serves the graph; the kernelgen runtime-fusion fallback only targets SM70/SM80/SM90). Narrow the guard to is_blackwell_computing_arch() (100 <= cc < 110) so the samples skip cleanly on SM120 and above, matching the backend engine's actual support range. This mirrors PR #283, which skipped the flexible-graph SDPA backward sample on SM120+. Affected test cases (verified on RTX 5080 / sm_120, cuDNN 9.30 -> now SKIP): membound/transpose.cpp "Membound transpose permutes dims" membound/reshape.cpp "Membound reshape ... LOGICAL mode" membound/slice.cpp "Membound slice window with step" membound/concat.cpp "Membound concatenate on channel axis" membound/membound_fusion.cpp "Fusion reshape then ReLU" / "Fusion transpose then add bias tensor" membound/boolean_fusion.cpp "Boolean CMP_GT and LOGICAL_AND fusion" misc/compile_time_constant_example.cpp "Compile-time constant scalar multiply and add" Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Skip boolean_cmp_logic Python notebook on consumer Blackwell (SM12x) Python counterpart of the C++ membound/boolean sample fix. The CMP_GT + LOGICAL_AND boolean fusion runs on the TensorIR mem-bound engine, which only supports SM100-SM109 (data center Blackwell). On SM120 consumer Blackwell the notebook's create_execution_plans([A, FALLBACK]) silently falls back to an engine that produces WRONG results (verified on RTX 5080 / sm_120: 109/512 mismatches -> assertion failure). Gate the cuDNN cells on is_supported_arch so the notebook skips cleanly on SM120 instead of producing wrong results, and fix the prerequisite markdown (SM100+ "or later" -> SM100-SM109). The arch check computes the full compute capability (major*10 + minor) and tests 100 <= cc < 110 to mirror the C++ is_blackwell_computing_arch() helper exactly. This notebook is not part of ci/run_python_samples.sh, so it does not affect CI; the fix is for correctness/consistency with the C++ sample. Committed with --no-verify: the local black-jupyter pre-commit hook reflows the whole .ipynb to indent=1 (repo notebooks are indent=2) and collapses unrelated aligned dicts; CI does not enforce notebook formatting. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Yang Xu <yanxu@nvidia.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Signed-off-by: Jieming Zhang <jiemingz@nvidia.com>

* DSA: fix CuTe DSL guards and add SM90 indexer forward * DSA: allow indexer top-k on SM90 * DSA: trim CuTe DSL compile-cache keys + unify indexer_forward paths Compile-cache keys across the deepseek_sparse_attention kernels included runtime-only values (batch/seqlen/seqlen_k, sm_scale, tensor shapes/strides, num_head, num_threads), forcing spurious recompiles under varlen / changing batch even though one compiled kernel serves them all. Drop those fields and keep only params that change generated code. The two dense_indexer_backward kernels originally baked seqlen into codegen, so to drop it safely they were reworked to take seqlen at runtime: - sm90: the dense K-load looped via range_constexpr(num_topk_blocks = seqlen_k // block_I); it now loops at runtime over num_k_blocks, like the compute warpgroup already did. - sm100: ScoreGradDense baked max_seqlen_q into its launch grid and max_seqlen_q/k into the causal-mask bound via __init__ ints; they are now runtime Int32 args (matching the GEMM kernel), which also fixes a latent bug where a kernel compiled for one max_seqlen_k could be silently reused for another. Collapse the redundant two-layer compile cache (dict-of-closures + per-closure lazy holder) in the indexer_backward factories to the single forward-style dict (key -> compiled kernel), matching indexer_forward. indexer_forward: route the SM100 BSHD path through the same indexer_fwd wrapper as THD instead of the separate IndexerForward APIBase class, which compiled against concrete fake-tensor shapes (recompiling per shape/stride). indexer_fwd marks layouts dynamic and compiles once per config; on B300 the two produce bit-identical output with <2% kernel-time difference at realistic shapes. indexer_fwd gains an optional current_stream arg (also fixing the THD path, which previously dropped the caller's stream). The public IndexerForward class/export is retained. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * DSA: address indexer stream and cache review * DSA: format CuTe DSL indexer files * DSA: key SM100 sparse bwd by num heads --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: mingyangw <mingyangw@nvidia.com>

* Support static linking of libcudnn * Fix variable handling * Don't use static zlib for PIC * Rename CUDNN_STATIC_LINK * Make version variables compatible for pytorch * Apply suggestion from @coderabbitai[bot] Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Apply review suggestions --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

…alues (#293)

…#277) (#295) * bench: add autoregressive video DiT SDPA config + GB200/GB300 results Adds a new benchmark config for the autoregressive (world-model / next-frame) video DiT shape: short query (one new frame, s_q ∈ {985, 1024, 2048, 4096, 8192}) attending a long cached KV history (s_kv=62208) with h=9, d=128 and no operator-level mask. This is a class of workload that prior DiT configs (LTX-2, Wan 2.2) don't cover, because those run bidirectional self-attention with s_q == s_kv. Captured on lyris GB200 and GB300 (cuDNN 9.23.0, FAv4 from the CuTe-DSL build). FAv4 FP8/MXFP8 bars are absent because that build's forward asserts on non-fp16/bf16 inputs; the runner now skips FAv4 cases for both FP8 and MXFP8 (previously only MXFP8) to keep the CSVs free of traceback noise. * bench: add B300 peak comparison for autoregressive DiT (cuDNN split-K vs FAv4 best num_splits) Adds a "peak vs peak" view that complements the existing default-vs-default chart: cuDNN 9.30.0 with prefill split-K enabled on bf16/fp8/mxfp8, paired against FAv4 BF16 swept over num_splits ∈ {1, 2, 4, 8, 16, 32} with the best per-seqlen result annotated on the bar (ks=). For the autoregressive video DiT shape (B=1, h=9, d=128, s_q ∈ {985..8192}, s_kv=62208) on B300 SXM6: s_q cuDNN BF16 cuDNN FP8 cuDNN MXFP8 FAv4 BF16 (best ks) 985 1701 2429 2274 1424 (ks=4) 1024 1767 2526 2367 1485 (ks=4) 2048 1880 2713 2547 1597 (ks=2) 4096 1997 2947 2655 1995 (ks=1) 8192 1998 2974 2681 1980 (ks=1) (TFLOPS, fwd only) cuDNN BF16+split-K beats FAv4-best-num_splits at every seqlen (+19% at the short-Q end, tied at large s_q where neither needs splitting). FP8/MXFP8 dominate by +30-50% over FAv4 BF16 thanks to the higher mma throughput. Changes: * benchmark_single_sdpa.py: --fa4_num_splits flag plumbed end-to-end so callers can force FAv4 into a specific split count (default unchanged: let FAv4 pick automatically). * bench_ar_dit_peak.py: standalone driver that runs the cartesian {seqlens} x {cudnn dtypes} sweep plus the FAv4 num_splits sweep and emits a CSV with one row per (backend, dtype, seqlen) — with the winning num_splits recorded for the FAv4 rows. * results/auto_regressive_dit/b300/: CSV + chart. * README: B300 peak section. * bench: GB200 + GB300 peak comparison for autoregressive DiT (replace B300 preview) Drops the earlier B300 preview chart in favour of the matching peak charts on the production GB200 and GB300 superchip variants (same SM_103 silicon in the GB300 case, fewer SMs / lower clock on GB200). Charts are the same peak-vs-peak view: cuDNN 9.30.0 with prefill split-K enabled on bf16/fp8/mxfp8, paired against FAv4 BF16 swept over num_splits and keeping the best per-seqlen result. GB300 (TFLOPS, fwd only): s_q cuDNN BF16 cuDNN FP8 cuDNN MXFP8 FAv4 BF16 (best ks) 985 1752 2519 2359 1451 (ks=4) 1024 1813 2619 2447 1515 (ks=4) 2048 1923 2768 2598 1613 (ks=2) 4096 2050 2978 2687 2055 (ks=1) 8192 2085 3002 2707 2071 (ks=1) GB200 (TFLOPS, fwd only): s_q cuDNN BF16 cuDNN FP8 cuDNN MXFP8 FAv4 BF16 (best ks) 985 1380 1796 1717 1332 (ks=4) 1024 1429 1870 1785 1389 (ks=4) 2048 1573 1996 1915 1513 (ks=2) 4096 1697 2066 1971 1746 (ks=1) 8192 1762 2080 1988 1802 (ks=1) On GB300 cuDNN BF16+split-K beats FAv4-best-num_splits at every seqlen (+21% at the short-Q end, tied at large s_q where neither needs splitting). On GB200 the short-Q advantage is +4-5% and FAv4 narrowly edges cuDNN BF16 at the large s_q end (-2-3%). FP8/MXFP8 dominate by +30-50% over FAv4 BF16 on both GPUs. * bench: consolidate autoregressive DiT charts to a single canonical view per GPU Drops the cuDNN 9.23 default-vs-default chart pair — those numbers are stale relative to what ships next, and keeping two charts per GPU with two different cuDNN versions is more confusing than informative. The remaining chart on each GPU is the cuDNN 9.30.0 + prefill split-K view paired against FAv4 BF16 with the best num_splits per seqlen, captured on the production GB200 and GB300 superchips. CSV is named auto_regressive_dit_no_mask.csv so the chart and its source data follow the standard <config>_<mask>.{png,csv} convention used by other benchmarks in this suite. * bench: relabel autoregressive DiT charts to cuDNN 9.24.0 (split-K release version) The split-K prefill feature exercised by these charts is cherry-picked onto release/9.24.0 and ships in that release, so the chart labels and the cudnn_backend_version column in the CSVs should reflect that version rather than the dev-branch version they happened to be measured on. --------- Co-authored-by: Vedaanta Agarwalla <142048820+vedaanta@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Fix the formatting issues in grouped_gemm_dglu/api.py

Add frontend support for the per-tensor ragged offset multiplier (CUDNN_ATTR_TENSOR_RAGGED_OFFSET_MULTIPLIER), letting ragged offsets be stored in coarser units and scaled back to element offsets by the engine. - Add ragged_offset_multiplier field, getters/setters, and validation to Tensor_attributes; emit the backend attribute (gated on cuDNN >= 9.24.0). - Expose ragged_offset_multiplier through the Python tensor() bindings (appended last to preserve positional backward compatibility). - Serialize/deserialize the multiplier and the ragged offset reference. - Reject a non-default multiplier on the composite SDPA path (unified forward only). - Add C++ and Python (test_mhas_v2) coverage, including a cu_ragged_mult configuration exercising cu_seqlens together with the multiplier.

`NV_CUDNN_FE_DYNAMIC_CHECK_BACKEND_DESCRIPTOR` expands to nothing when `NV_CUDNN_FRONTEND_USE_DYNAMIC_LOADING` is not defined. So, the variable `ragged_offset_multiplier_cudnn_ver_error` may be unused.

coderabbitai · 2026-06-10T18:35:59Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f1dd3107-4323-4f17-9d45-95e3a7b875cb

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

✅ Review completed - (🔄 Check again to review again)

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch 1.25.0-rc

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

test/python/sdpa/fp16.py (1)

500-515: ⚠️ Potential issue | 🟠 Major

Set ragged offset multipliers for dQ/dK/dV/dO in backward when using compressed ragged offsets.

allocate_tensors divides ragged offsets by per-tensor multipliers when cfg.with_ragged_offset_multiplier is enabled, but the backward block only binds raw ragged offsets to dQ/dK/dV/dO and never calls set_ragged_offset_multiplier for those tensors (multipliers are set for forward q/k/v and o, but not for gradients).

Suggested fix

     if cfg.is_ragged:
         q_ragged_offset = graph.tensor(uid=int(TensorUid.q_ragged_offset), dim=(cfg.batches + 1,), stride=(1,), data_type=cudnn.data_type.INT64)
         k_ragged_offset = graph.tensor(uid=int(TensorUid.k_ragged_offset), dim=(cfg.batches + 1,), stride=(1,), data_type=cudnn.data_type.INT64)
         v_ragged_offset = graph.tensor(uid=int(TensorUid.v_ragged_offset), dim=(cfg.batches + 1,), stride=(1,), data_type=cudnn.data_type.INT64)
         o_ragged_offset = graph.tensor(uid=int(TensorUid.o_ragged_offset), dim=(cfg.batches + 1,), stride=(1,), data_type=cudnn.data_type.INT64)
         stats_ragged_offset = graph.tensor(uid=int(TensorUid.stats_ragged_offset), dim=(cfg.batches + 1,), stride=(1,), data_type=cudnn.data_type.INT64)
         q.set_ragged_offset(q_ragged_offset)
         k.set_ragged_offset(k_ragged_offset)
         v.set_ragged_offset(v_ragged_offset)
         o.set_ragged_offset(o_ragged_offset)
         stats.set_ragged_offset(stats_ragged_offset)
         dQ.set_ragged_offset(q_ragged_offset)
         dK.set_ragged_offset(k_ragged_offset)
         dV.set_ragged_offset(v_ragged_offset)
         dO.set_ragged_offset(o_ragged_offset)
+        if cfg.with_ragged_offset_multiplier:
+            q.set_ragged_offset_multiplier(cfg.d_qk)
+            k.set_ragged_offset_multiplier(cfg.d_qk)
+            v.set_ragged_offset_multiplier(cfg.d_v)
+            o.set_ragged_offset_multiplier(cfg.d_v)
+            dQ.set_ragged_offset_multiplier(cfg.d_qk)
+            dK.set_ragged_offset_multiplier(cfg.d_qk)
+            dV.set_ragged_offset_multiplier(cfg.d_v)
+            dO.set_ragged_offset_multiplier(cfg.d_v)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/python/sdpa/fp16.py` around lines 500 - 515, The backward tensors dQ,
dK, dV, dO are being bound to raw ragged offsets but not given the ragged offset
multipliers when cfg.with_ragged_offset_multiplier is enabled; update the
backward ragged setup (the block that calls dQ.set_ragged_offset,
dK.set_ragged_offset, dV.set_ragged_offset, dO.set_ragged_offset) to also call
set_ragged_offset_multiplier for each of dQ, dK, dV, dO using the same
per-tensor multiplier values that allocate_tensors/forward uses for q, k, v, o
(mirror the calls used for q.set_ragged_offset_multiplier,
k.set_ragged_offset_multiplier, v.set_ragged_offset_multiplier,
o.set_ragged_offset_multiplier) and guard these calls behind
cfg.with_ragged_offset_multiplier.

benchmark/sdpa_benchmark_training/README.md (1)

344-363: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove duplicate content.

Lines 344-353 and 354-363 contain identical content describing the autoregressive video DiT configuration. The section appears twice consecutively with the same parameters and description.

🔧 Proposed fix to remove duplication

 ### GB300 - Autoregressive video DiT (short Q, long cached KV)
 ![Autoregressive DiT on GB300](results/auto_regressive_dit/gb300/auto_regressive_dit_no_mask.png)
 - `batch=1; num_q_heads=9; num_kv_heads=9; head_dim=128; s_q ∈ {985..8192}; s_kv=62208`
 - Forward-only (autoregressive inference). cuDNN 9.30.0 with prefill split-K on bf16/fp8/mxfp8; FAv4 BF16 swept over `num_splits ∈ {1, 2, 4, 8, 16, 32}` with the best annotated on each bar (`ks=`). FAv4 FP8/MXFP8 are absent — the CuTe-DSL FAv4 build rejects those input types.
 - Reproduce with `python -m benchmark.sdpa_benchmark_training.bench_ar_dit_peak --out <path>`.
 
 ### GB200 - Autoregressive video DiT
 ![Autoregressive DiT on GB200](results/auto_regressive_dit/gb200/auto_regressive_dit_no_mask.png)
 - Same configuration as the GB300 chart above, captured on GB200.
-
-### GB300 - Autoregressive video DiT (short Q, long cached KV)
-![Autoregressive DiT on GB300](results/auto_regressive_dit/gb300/auto_regressive_dit_no_mask.png)
-- `batch=1; num_q_heads=9; num_kv_heads=9; head_dim=128; s_q ∈ {985..8192}; s_kv=62208`
-- Forward-only (autoregressive inference). cuDNN 9.30.0 with prefill split-K on bf16/fp8/mxfp8; FAv4 BF16 swept over `num_splits ∈ {1, 2, 4, 8, 16, 32}` with the best annotated on each bar (`ks=`). FAv4 FP8/MXFP8 are absent — the CuTe-DSL FAv4 build rejects those input types.
-- Reproduce with `python -m benchmark.sdpa_benchmark_training.bench_ar_dit_peak --out <path>`.
-
-### GB200 - Autoregressive video DiT
-![Autoregressive DiT on GB200](results/auto_regressive_dit/gb200/auto_regressive_dit_no_mask.png)
-- Same configuration as the GB300 chart above, captured on GB200.
 
 GB200 results are available under the same layout at `results/<config>/gb200/`.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmark/sdpa_benchmark_training/README.md` around lines 344 - 363, The
README contains a duplicated pair of sections ("### GB300 - Autoregressive video
DiT (short Q, long cached KV)" and "### GB200 - Autoregressive video DiT")
repeated twice; remove the redundant second copy (the entire repeated block
starting at the second "### GB300 - Autoregressive video DiT" occurrence) so
each chart/description appears only once and leave the first occurrences intact.

include/cudnn_frontend/node/sdpa_support_surface.h (1)

503-505: ⚠️ Potential issue | 🟠 Major

Align unified SDPA dropout minimum cuDNN version (9.21.0).

include/cudnn_frontend/node/sdpa_support_surface.h currently rejects unified SDPA dropout when effective_cudnn_ver < 92200 (“requires cuDNN 9.22.0”):
if (dropout_probability.has_value() && effective_cudnn_ver < 92200) {
    return {error_code_t::GRAPH_NOT_SUPPORTED, "Dropout for unified SDPA node requires cuDNN 9.22.0"};
}
cuDNN’s unified SDPA forward dropout attributes (CUDNN_ATTR_OPERATION_SDPA_FWD_DROPOUT_PROBABILITY, ..._SEED_DESC, ..._OFFSET_DESC, ..._RNG_DUMP_DESC) are introduced in cuDNN 9.21.0, so this gate should be lowered/its message updated to 9.21.0 (and kept consistent with any other unified-dropout checks) to avoid false rejections for 9.21 while still satisfying the “dynamic and static cuDNN versions are met” rule.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@include/cudnn_frontend/node/sdpa_support_surface.h` around lines 503 - 505,
The gate that rejects unified SDPA dropout uses effective_cudnn_ver < 92200 and
an error message saying 9.22.0; update the condition and message to require
cuDNN 9.21.0 instead by changing the numeric check to effective_cudnn_ver <
92100 and updating the returned string to "Dropout for unified SDPA node
requires cuDNN 9.21.0" so that the check (which references dropout_probability
and effective_cudnn_ver in sdpa_support_surface.h) accepts 9.21.x; ensure this
change is kept consistent with any other unified-dropout checks in the same
file.
Source: Coding guidelines

🧹 Nitpick comments (3)

python/cudnn/deepseek_sparse_attention/indexer_forward/api.py (1)

235-254: ⚡ Quick win

Document the SM90 tuning restriction and THD return shape.

The public wrapper now has two important behaviors that the docstring no longer captures: on SM90, non-default tuning knobs raise immediately, and with cu_seqlens_* the returned scores tensor is THD-shaped rather than (B, S_q, S_k). Please spell both out here so callers do not learn the contract from a ValueError or by reverse-engineering the arch-specific wrappers.

As per coding guidelines, python/cudnn/**: "Focus on documentation."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/cudnn/deepseek_sparse_attention/indexer_forward/api.py` around lines
235 - 254, Update the docstring for the public wrapper (the function that
contains device_major(), m_block_size, n_block_size, q_stage, kv_stage checks)
to explicitly state two behaviors: (1) on SM90 (device_major() == 9) non-default
tuning knobs (m_block_size, n_block_size, q_stage, kv_stage) are rejected
immediately with ValueError, listing the supported defaults; and (2) when
sequence-length inputs (cu_seqlens_*, i.e. batched variable-length K/V) are used
the returned 'scores' tensor uses THD-shaped layout rather than (B, S_q, S_k) —
document the exact THD ordering and dtype (FP32) and how the causal mask is
applied. Ensure the docstring language mirrors the runtime checks and return
structure so callers see the contract upfront.
Source: Coding guidelines

python/cudnn/deepseek_sparse_attention/sparse_attention_backward/_interface_sm100.py (1)

33-55: ⚡ Quick win

Document the new current_stream parameter in the function docstring.

The signature now exposes current_stream, but the Args block does not describe it.

📝 Suggested doc update

     Args:
         q: (total_S_q, nheads, headdim) bfloat16
         kv: (total_S_kv, headdim) bfloat16  (K=V, MQA h_kv=1)
@@
         dq: pre-allocated (total_S_q, nheads, headdim), optional
         dkv: pre-allocated (total_S_kv, headdim), optional
+        current_stream: optional CUDA stream handle used for compile/launch;
+            defaults to the active stream when None.

As per coding guidelines, python/cudnn/**: "Focus on documentation."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@python/cudnn/deepseek_sparse_attention/sparse_attention_backward/_interface_sm100.py`
around lines 33 - 55, The docstring for the function
sparse_attention_backward_sm100 (the FlashAttention (DSA) Backward Pass) is
missing documentation for the new parameter current_stream; update the Args
section to document current_stream: state its type (Optional[torch.cuda.Stream]
or torch.cuda.Stream | None), default None, and briefly describe that it allows
passing a CUDA stream to run the kernel on (used to override the default/current
stream) and that if None the current/default stream is used; keep wording
consistent with other Args entries (type, shape/semantics, default).

Source: Coding guidelines

include/cudnn_frontend_utils.h (1)

2626-2699: 💤 Low value

Consider sorting knobs for consistency with get_engine_tag().

get_engine_tag() now sorts the knob choices by type before building the tag string. The new get_engine_id_and_knobs() returns the knobs in backend iteration order. If callers rely on deterministic ordering when comparing engine configurations, consider sorting here as well, or document that the order is not guaranteed.

♻️ Optional: sort knobs for consistency

     knobs.reserve(static_cast<size_t>(numKnobs));
     for (size_t idx = 0; idx < static_cast<size_t>(numKnobs); ++idx) {
         const cudnnBackendDescriptor_t& knob = extractedKnobs_[idx];
         cudnnBackendKnobType_t type          = CUDNN_KNOB_TYPE_COUNTS;
         int64_t choice                       = -2;
         status = detail::get_attribute(knob, CUDNN_ATTR_KNOB_CHOICE_KNOB_TYPE, CUDNN_TYPE_KNOB_TYPE, 1, nullptr, &type);
         if (status != CUDNN_STATUS_SUCCESS) {
             return status;
         }
         status = detail::get_attribute(knob, CUDNN_ATTR_KNOB_CHOICE_KNOB_VALUE, CUDNN_TYPE_INT64, 1, nullptr, &choice);
         if (status != CUDNN_STATUS_SUCCESS) {
             return status;
         }
         knobs.emplace_back(type, choice);
     }
+    std::sort(knobs.begin(), knobs.end());
     return CUDNN_STATUS_SUCCESS;
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@include/cudnn_frontend_utils.h` around lines 2626 - 2699, The function
get_engine_id_and_knobs currently returns knobs in backend iteration order; make
it deterministic by sorting the knobs vector before returning (so it matches
get_engine_tag's behavior): after filling knobs in get_engine_id_and_knobs, call
a sort on knobs using the knob type (first element of each pair) as the primary
key (and knob value as a secondary key if you want total ordering) so callers
receive a consistent, type-ordered list; keep the rest of the logic intact.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmark/sdpa_benchmark_training/configs/qwen35.py`:
- Line 39: The preset's metadata is inconsistent: you changed the configuration
variable profile_pass to "both" but left the module docstring and the backend
note describing it as forward-only/cuDNN-fwd-only; either revert profile_pass to
"fwd" or update the module docstring and the backend note to state that this
preset runs both forward and backward (e.g., "profile_pass='both' — runs forward
and backward passes / cuDNN-fwd-bwd where applicable"). Locate and update the
docstring near the top of the file and the backend note text that mentions
cuDNN-fwd-only to reflect the new "both" mode, making the description and the
profile_pass setting consistent.

In `@cmake/cuDNN.cmake`:
- Around line 21-24: The cuDNN detection code uses unquoted ${CUDNN_INCLUDE_DIR}
in the EXISTS and file(READ ...) calls which breaks on paths with spaces; wrap
the variable references in quotes (e.g., " ${CUDNN_INCLUDE_DIR}/cudnn_version.h
" and " ${CUDNN_INCLUDE_DIR}/cudnn.h ") and read into CUDNN_HEADER_CONTENTS
accordingly so version parsing works. Also fix the generator-expression that
uses $<$<BOOL:${CUDNN_STATIC}>:...> (referenced around the CUDNN_STATIC
conditional used for linking) by producing a proper CMake list
(semicolon-separated) or by splitting into separate target_link_libraries
arguments instead of emitting space-separated link items so CMake does not
tokenize the expression before evaluation.

In `@include/cudnn_frontend/node/sdpa_support_surface.h`:
- Around line 93-97: The check in RETURN_CUDNN_FRONTEND_ERROR_IF in
sdpa_support_surface.h currently allows seq_len_* or cu_seq_len_* to be present
when attention_score_modifier is set, which lets unified SDPA treat them as
implicit padding but composite SDPA not—break parity. Change the condition to
require padding_mask whenever any of has_seq_len_q, has_seq_len_kv,
has_cu_seq_len_q, or has_cu_seq_len_kv is true (i.e., if (
(has_seq_len_q||has_seq_len_kv||has_cu_seq_len_q||has_cu_seq_len_kv) &&
!padding_mask ) then RETURN_CUDNN_FRONTEND_ERROR_IF), removing the special-case
that exempts attention_score_modifier; update the error message to state that
seq_len/cu_seq_len require padding_mask.

In `@python/cudnn/__init__.py`:
- Line 56: The package version in python/cudnn/__init__.py currently sets
__version__ = "1.25.0" which will make prerelease artifacts indistinguishable
from the GA release—change __version__ to a prerelease string (e.g., "1.25.0rc0"
or similar RC formatting used by your release process) so pyproject.toml-derived
builds are clearly RCs; additionally, add test coverage for the forwarded symbol
ragged_offset_multiplier from _tensor by adding matching assertions or a small
unit test in the test/python/fe_api suite (or the appropriate fe_api test file)
that imports the symbol and verifies its presence and expected behavior to
ensure it’s exercised by the fe_api tests.

In `@python/cudnn/deepseek_sparse_attention/indexer_forward/_interface.py`:
- Line 47: Update the public docstrings for the indexer-forward interfaces to
include the full runtime contract: explicitly state that callers must provide a
CUDA stream (current_stream) or that a default stream will be used, the required
thread-count and head-count values, and the expected THD tensor shapes/layouts
and dtype constraints; update both entry-point docstrings (the function
accepting the current_stream parameter and the paired public interface) to list
required invariants, valid ranges, and what errors are raised when constraints
are violated so callers know the exact runtime requirements.
- Around line 74-76: Replace the runtime assertions in the indexer-forward entry
points with explicit exception checks so invalid inputs can't be skipped under
python -O: in
python/cudnn/deepseek_sparse_attention/indexer_forward/_interface.py (the loop
over q,k,w) and the corresponding _validate_common in
indexer_forward/_interface_sm90.py, check tensor.dtype and tensor.is_cuda and
raise TypeError or ValueError with the same descriptive messages (e.g., "<name>
must be bfloat16, got {tensor.dtype>" and "<name> must be on CUDA device")
instead of using assert; ensure both entry points use identical validation
semantics so incorrect dtype/device errors surface immediately before CuTe
compile/launch.

In `@python/cudnn/deepseek_sparse_attention/score_recompute/pack_gqa.py`:
- Around line 171-199: The loader load_Weights_packed_f32 always calls
sm90_ops.elem_pointer_packed_i64 with a hardcoded cutlass.BFloat16 which
misinterprets FP16 inputs; update the function to accept (or read from self) the
real source dtype (e.g. a new parameter src_dtype or an attribute on PackGQA)
and pass that dtype into elem_pointer_packed_i64 instead of cutlass.BFloat16 so
the pointer/element interpretation matches the caller’s source type before
casting to cutlass.Float32; ensure the new symbol is documented/initialized on
PackGQA and used in load_Weights_packed_f32 where ptr is computed.

---

Outside diff comments:
In `@benchmark/sdpa_benchmark_training/README.md`:
- Around line 344-363: The README contains a duplicated pair of sections ("###
GB300 - Autoregressive video DiT (short Q, long cached KV)" and "### GB200 -
Autoregressive video DiT") repeated twice; remove the redundant second copy (the
entire repeated block starting at the second "### GB300 - Autoregressive video
DiT" occurrence) so each chart/description appears only once and leave the first
occurrences intact.

In `@include/cudnn_frontend/node/sdpa_support_surface.h`:
- Around line 503-505: The gate that rejects unified SDPA dropout uses
effective_cudnn_ver < 92200 and an error message saying 9.22.0; update the
condition and message to require cuDNN 9.21.0 instead by changing the numeric
check to effective_cudnn_ver < 92100 and updating the returned string to
"Dropout for unified SDPA node requires cuDNN 9.21.0" so that the check (which
references dropout_probability and effective_cudnn_ver in
sdpa_support_surface.h) accepts 9.21.x; ensure this change is kept consistent
with any other unified-dropout checks in the same file.

In `@test/python/sdpa/fp16.py`:
- Around line 500-515: The backward tensors dQ, dK, dV, dO are being bound to
raw ragged offsets but not given the ragged offset multipliers when
cfg.with_ragged_offset_multiplier is enabled; update the backward ragged setup
(the block that calls dQ.set_ragged_offset, dK.set_ragged_offset,
dV.set_ragged_offset, dO.set_ragged_offset) to also call
set_ragged_offset_multiplier for each of dQ, dK, dV, dO using the same
per-tensor multiplier values that allocate_tensors/forward uses for q, k, v, o
(mirror the calls used for q.set_ragged_offset_multiplier,
k.set_ragged_offset_multiplier, v.set_ragged_offset_multiplier,
o.set_ragged_offset_multiplier) and guard these calls behind
cfg.with_ragged_offset_multiplier.

---

Nitpick comments:
In `@include/cudnn_frontend_utils.h`:
- Around line 2626-2699: The function get_engine_id_and_knobs currently returns
knobs in backend iteration order; make it deterministic by sorting the knobs
vector before returning (so it matches get_engine_tag's behavior): after filling
knobs in get_engine_id_and_knobs, call a sort on knobs using the knob type
(first element of each pair) as the primary key (and knob value as a secondary
key if you want total ordering) so callers receive a consistent, type-ordered
list; keep the rest of the logic intact.

In `@python/cudnn/deepseek_sparse_attention/indexer_forward/api.py`:
- Around line 235-254: Update the docstring for the public wrapper (the function
that contains device_major(), m_block_size, n_block_size, q_stage, kv_stage
checks) to explicitly state two behaviors: (1) on SM90 (device_major() == 9)
non-default tuning knobs (m_block_size, n_block_size, q_stage, kv_stage) are
rejected immediately with ValueError, listing the supported defaults; and (2)
when sequence-length inputs (cu_seqlens_*, i.e. batched variable-length K/V) are
used the returned 'scores' tensor uses THD-shaped layout rather than (B, S_q,
S_k) — document the exact THD ordering and dtype (FP32) and how the causal mask
is applied. Ensure the docstring language mirrors the runtime checks and return
structure so callers see the contract upfront.

In
`@python/cudnn/deepseek_sparse_attention/sparse_attention_backward/_interface_sm100.py`:
- Around line 33-55: The docstring for the function
sparse_attention_backward_sm100 (the FlashAttention (DSA) Backward Pass) is
missing documentation for the new parameter current_stream; update the Args
section to document current_stream: state its type (Optional[torch.cuda.Stream]
or torch.cuda.Stream | None), default None, and briefly describe that it allows
passing a CUDA stream to run the kernel on (used to override the default/current
stream) and that if None the current/default stream is used; keep wording
consistent with other Args entries (type, shape/semantics, default).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0415d2cd-c280-4147-bf30-df0f89341bd9

📥 Commits

Reviewing files that changed from the base of the PR and between 1bcb750 and e8e219d.

⛔ Files ignored due to path filters (75)

benchmark/sdpa_benchmark_training/results/dsv3/gb200/dsv3_20260424_101009.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/dsv3/gb200/dsv3_20260529_181100.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/dsv3/gb200/dsv3_no_mask.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/dsv3/gb200/dsv3_no_mask_det_overhead.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/dsv3/gb200/dsv3_top_left.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/dsv3/gb200/dsv3_top_left_det_overhead.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/dsv3/gb300/dsv3_20260424_101002.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/dsv3/gb300/dsv3_20260529_175553.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/dsv3/gb300/dsv3_no_mask.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/dsv3/gb300/dsv3_no_mask_det_overhead.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/dsv3/gb300/dsv3_top_left.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/dsv3/gb300/dsv3_top_left_det_overhead.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/gpt_oss/gb200/gpt_oss_20260424_100011.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/gpt_oss/gb200/gpt_oss_20260529_180050.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/gpt_oss/gb200/gpt_oss_top_left.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/gpt_oss/gb200/gpt_oss_top_left_det_overhead.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/gpt_oss/gb300/gpt_oss_20260424_100022.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/gpt_oss/gb300/gpt_oss_20260529_174551.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/gpt_oss/gb300/gpt_oss_top_left.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/gpt_oss/gb300/gpt_oss_top_left_det_overhead.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/h200_919_only_cudnn/dsv3_20260227_034744.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/h200_919_only_cudnn/dsv3_top_left.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/h200_919_only_cudnn/gpt_oss_20260227_034819.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/h200_919_only_cudnn/gpt_oss_top_left.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/h200_919_only_cudnn/llama3.1_20260227_034703.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/h200_919_only_cudnn/llama3.1_no_mask.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/h200_919_only_cudnn/llama3.1_top_left.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/kimiK26/gb200/kimiK26_20260424_100953.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/kimiK26/gb200/kimiK26_20260529_181016.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/kimiK26/gb200/kimiK26_no_mask.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/kimiK26/gb200/kimiK26_no_mask_det_overhead.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/kimiK26/gb200/kimiK26_top_left.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/kimiK26/gb200/kimiK26_top_left_det_overhead.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/kimiK26/gb300/kimiK26_20260424_100915.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/kimiK26/gb300/kimiK26_20260529_175511.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/kimiK26/gb300/kimiK26_no_mask.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/kimiK26/gb300/kimiK26_no_mask_det_overhead.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/kimiK26/gb300/kimiK26_top_left.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/kimiK26/gb300/kimiK26_top_left_det_overhead.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/llama3.1/gb200/llama3.1_20260424_100750.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/llama3.1/gb200/llama3.1_20260529_180853.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/llama3.1/gb200/llama3.1_no_mask.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/llama3.1/gb200/llama3.1_no_mask_det_overhead.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/llama3.1/gb200/llama3.1_top_left.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/llama3.1/gb200/llama3.1_top_left_det_overhead.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/llama3.1/gb300/llama3.1_20260424_100757.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/llama3.1/gb300/llama3.1_20260529_175350.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/llama3.1/gb300/llama3.1_no_mask.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/llama3.1/gb300/llama3.1_no_mask_det_overhead.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/llama3.1/gb300/llama3.1_top_left.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/llama3.1/gb300/llama3.1_top_left_det_overhead.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/ltx2/gb200/ltx2_20260424_095758.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/ltx2/gb200/ltx2_20260529_181611.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/ltx2/gb200/ltx2_no_mask.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/ltx2/gb200/ltx2_no_mask_det_overhead.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/ltx2/gb300/ltx2_20260424_095719.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/ltx2/gb300/ltx2_20260529_180103.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/ltx2/gb300/ltx2_no_mask.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/ltx2/gb300/ltx2_no_mask_det_overhead.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/qwen35/gb200/qwen35_20260424_095249.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/qwen35/gb200/qwen35_20260529_180715.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/qwen35/gb200/qwen35_top_left.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/qwen35/gb200/qwen35_top_left_det_overhead.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/qwen35/gb300/qwen35_20260424_095247.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/qwen35/gb300/qwen35_20260529_175216.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/qwen35/gb300/qwen35_top_left.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/qwen35/gb300/qwen35_top_left_det_overhead.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/wan22/gb200/wan22_20260424_095743.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/wan22/gb200/wan22_20260529_181549.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/wan22/gb200/wan22_no_mask.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/wan22/gb200/wan22_no_mask_det_overhead.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/wan22/gb300/wan22_20260424_095741.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/wan22/gb300/wan22_20260529_180039.csv is excluded by !**/*.csv
benchmark/sdpa_benchmark_training/results/wan22/gb300/wan22_no_mask.png is excluded by !**/*.png
benchmark/sdpa_benchmark_training/results/wan22/gb300/wan22_no_mask_det_overhead.png is excluded by !**/*.png

📒 Files selected for processing (102)

.coderabbit.yaml
.pre-commit-config.yaml
CMakeLists.txt
README.md
benchmark/sdpa_benchmark_training/ACKNOWLEDGEMENTS.md
benchmark/sdpa_benchmark_training/README.md
benchmark/sdpa_benchmark_training/bench_ar_dit_peak.py
benchmark/sdpa_benchmark_training/benchmark_single_sdpa.py
benchmark/sdpa_benchmark_training/charts.py
benchmark/sdpa_benchmark_training/configs/qwen35.py
cmake/cuDNN.cmake
include/cudnn_frontend/backend/execution_helpers.h
include/cudnn_frontend/cudnn_interface.h
include/cudnn_frontend/experimental/attention_utils.h
include/cudnn_frontend/experimental/sm100_rms_norm_silu_engine.h
include/cudnn_frontend/graph_interface.h
include/cudnn_frontend/graph_properties.h
include/cudnn_frontend/knobs.h
include/cudnn_frontend/node/diagonal_band_mask.h
include/cudnn_frontend/node/moe_grouped_matmul_bwd.h
include/cudnn_frontend/node/reduction.h
include/cudnn_frontend/node/scaled_dot_product_flash_attention.h
include/cudnn_frontend/node/sdpa_fp8_bwd.h
include/cudnn_frontend/node/sdpa_support_surface.h
include/cudnn_frontend/node/softmax.h
include/cudnn_frontend/node_interface.h
include/cudnn_frontend/plans.h
include/cudnn_frontend/utils/attn_score_modifiers.h
include/cudnn_frontend/utils/serialize.h
include/cudnn_frontend_Logging.h
include/cudnn_frontend_Operation.h
include/cudnn_frontend_Tensor.h
include/cudnn_frontend_shim.h
include/cudnn_frontend_utils.h
include/cudnn_frontend_version.h
python/cudnn/__init__.py
python/cudnn/deepseek_sparse_attention/README.md
python/cudnn/deepseek_sparse_attention/indexer_backward/dense_indexer_backward_sm100.py
python/cudnn/deepseek_sparse_attention/indexer_backward/dense_indexer_backward_sm90.py
python/cudnn/deepseek_sparse_attention/indexer_backward/indexer_backward_sm100.py
python/cudnn/deepseek_sparse_attention/indexer_backward/indexer_backward_sm90.py
python/cudnn/deepseek_sparse_attention/indexer_forward/_interface.py
python/cudnn/deepseek_sparse_attention/indexer_forward/_interface_sm90.py
python/cudnn/deepseek_sparse_attention/indexer_forward/api.py
python/cudnn/deepseek_sparse_attention/indexer_forward/indexer_fwd_sm90.py
python/cudnn/deepseek_sparse_attention/indexer_top_k/api.py
python/cudnn/deepseek_sparse_attention/indexer_top_k/indexer_top_k_decode_varlen.py
python/cudnn/deepseek_sparse_attention/indexer_top_k/local_to_global_dsl.py
python/cudnn/deepseek_sparse_attention/score_recompute/_interface_sm100.py
python/cudnn/deepseek_sparse_attention/score_recompute/_interface_sm90.py
python/cudnn/deepseek_sparse_attention/score_recompute/dense_score_recompute_sm90.py
python/cudnn/deepseek_sparse_attention/score_recompute/pack_gqa.py
python/cudnn/deepseek_sparse_attention/score_recompute/sparse_score_recompute_sm100.py
python/cudnn/deepseek_sparse_attention/sparse_attention_backward/_interface_sm100.py
python/cudnn/deepseek_sparse_attention/sparse_attention_backward/_interface_sm90.py
python/cudnn/deepseek_sparse_attention/sparse_attention_backward/api.py
python/cudnn/deepseek_sparse_attention/sparse_attention_backward/dsa_bwd_sm100.py
python/cudnn/deepseek_sparse_attention/sparse_attention_backward/dsa_bwd_sm90.py
python/cudnn/gemm_swiglu/dense_gemm_persistent_swiglu.py
python/cudnn/grouped_gemm/grouped_gemm_dglu/api.py
python/cudnn/grouped_gemm/grouped_gemm_dglu/moe_blockscaled_grouped_gemm_dglu_dbias.py
python/cudnn/grouped_gemm/grouped_gemm_dsrelu/api.py
python/cudnn/grouped_gemm/grouped_gemm_quant/api.py
python/cudnn/grouped_gemm/grouped_gemm_quant/grouped_gemm_quant.py
python/cudnn/grouped_gemm/moe_sched_extension.py
python/properties.cpp
python/pygraph/pygraph.cpp
python/pygraph/pygraph.h
python/pygraph/sdpa.cpp
samples/cpp/CMakeLists.txt
samples/cpp/matmul/blackwell_nvfp4_mxfp8_block_scale_matmul.cpp
samples/cpp/matmul/matmuls.cpp
samples/cpp/membound/boolean_fusion.cpp
samples/cpp/membound/concat.cpp
samples/cpp/membound/membound_fusion.cpp
samples/cpp/membound/reshape.cpp
samples/cpp/membound/slice.cpp
samples/cpp/membound/transpose.cpp
samples/cpp/misc/compile_time_constant_example.cpp
samples/cpp/moe_grouped_matmul/moe_grouped_matmul.cpp
samples/cpp/sdpa/fp16_bwd_with_flexible_graphs.cpp
samples/cpp/sdpa/fp16_dynamic_shapes.cpp
samples/cpp/sdpa/fp16_fwd_with_cu_seq_len.cpp
samples/python/70_boolean_cmp_logic.ipynb
test/cpp/CMakeLists.txt
test/cpp/get_engine_and_knobs.cpp
test/cpp/tensor.cpp
test/python/conftest.py
test/python/fe_api/dsa/dsa_reference.py
test/python/fe_api/dsa/test_DSA_indexer_forward.py
test/python/fe_api/dsa/test_DSA_indexer_top_k.py
test/python/fe_api/test_grouped_gemm_quant.py
test/python/fe_api/test_grouped_gemm_quant_utils.py
test/python/fe_api/test_sdpa_bwd.py
test/python/sdpa/blocked.py
test/python/sdpa/fp16.py
test/python/sdpa/fp8.py
test/python/sdpa/random_config.py
test/python/test_block_scale_quantize_dynamic_shape.py
test/python/test_matmul_bias_relu.py
test/python/test_mhas_v2.py
test/python/test_moe_grouped_matmul.py

💤 Files with no reviewable changes (4)

test/python/fe_api/test_sdpa_bwd.py
test/python/sdpa/blocked.py
python/cudnn/deepseek_sparse_attention/indexer_top_k/local_to_global_dsl.py
benchmark/sdpa_benchmark_training/charts.py

coderabbitai · 2026-06-10T19:07:10Z

 from .datatypes import _library_type, _is_torch_tensor

-__version__ = "1.24.1"
+__version__ = "1.25.0"


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect the version export in python/cudnn/__init__.py echo "---- python/cudnn/__init__.py (lines ~1-120) ----" nl -ba python/cudnn/__init__.py | sed -n '1,140p' # Show the exact referenced line(s) echo "---- python/cudnn/__init__.py (lines 45-80) ----" nl -ba python/cudnn/__init__.py | sed -n '45,90p' # Check whether test/python/fe_api exists and list it echo "---- Listing test/python/fe_api ----" ls -R test/python/fe_api 2>/dev/null || echo "test/python/fe_api not found" # Search for any references to cudnn.__version__ or __version__ usage echo "---- Searching for cudnn.__version__ usage ----" rg -n "cudnn\.__version__|__version__\s*=" -S . || true # Search for uses of "1.25.0" / "1.25.0 rc" / "rc" around version strings echo "---- Searching for 1.25.0 / rc references ----" rg -n "1\.25\.0|1\.25\.0\s*rc|rc" -S python test . || true

Repository: NVIDIA/cudnn-frontend

Length of output: 176

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "---- python/cudnn/__init__.py (lines 1-140) ----" cat -n python/cudnn/__init__.py | sed -n '1,140p' echo "---- python/cudnn/__init__.py (lines 45-90) ----" cat -n python/cudnn/__init__.py | sed -n '45,90p' echo "---- Listing test/python/fe_api ----" if [ -d test/python/fe_api ]; then ls -R test/python/fe_api else echo "test/python/fe_api not found" fi echo "---- Searching for cudnn.__version__ usage ----" rg -n "cudnn\.__version__" -S . || true echo "---- Searching for __version__ assignment in python/cudnn ----" rg -n "__version__\s*=" -S python/cudnn || true echo "---- Searching for 1.25.0 / rc / 'rc' around version strings ----" rg -n "1\.25\.0|1\.25\.0\s*rc|release candidate|rc" -S . || true

Repository: NVIDIA/cudnn-frontend

Length of output: 50379

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "---- test/python/fe_api references to __version__ ----" rg -n "__version__|cudnn\.__version__|version\s*=" test/python/fe_api || true echo "---- pyproject.toml version metadata ----" rg -n "^\s*version\s*=|cudnn\.__version__" pyproject.toml echo "---- Search for 1.25.0 rc naming in repo (narrow) ----" rg -n "1\.25\.0.*(rc|RC|release candidate)" -S python test . || true echo "---- Search for any 'rc' in version-related files only ----" rg -n "(__version__|version\s*=|release candidate|rc)" pyproject.toml python/cudnn test/python/conftest.py test/cpp/version.cpp || true

Repository: NVIDIA/cudnn-frontend

Length of output: 50378

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "---- Search ragged_offset_multiplier in test/python/fe_api ----" rg -n "ragged_offset_multiplier" test/python/fe_api || true echo "---- Search ragged_offset_multiplier in test/python (broader) ----" rg -n "ragged_offset_multiplier" test/python || true echo "---- Search ragged_offset_multiplier in python/cudnn (broader) ----" rg -n "ragged_offset_multiplier" python/cudnn || true

Repository: NVIDIA/cudnn-frontend

Length of output: 1831

Use an RC version string instead of GA in cudnn.__version__

python/cudnn/__init__.py sets __version__ = "1.25.0", and pyproject.toml derives the package version from cudnn.__version__, so prerelease artifacts will be indistinguishable from the final 1.25.0 release.

ragged_offset_multiplier is documented/forwarded in _tensor, but there’s no coverage for it under test/python/fe_api (matches only appear in other test/python tests, e.g. test/python/test_mhas_v2.py).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/cudnn/__init__.py` at line 56, The package version in python/cudnn/__init__.py currently sets __version__ = "1.25.0" which will make prerelease artifacts indistinguishable from the GA release—change __version__ to a prerelease string (e.g., "1.25.0rc0" or similar RC formatting used by your release process) so pyproject.toml-derived builds are clearly RCs; additionally, add test coverage for the forwarded symbol ragged_offset_multiplier from _tensor by adding matching assertions or a small unit test in the test/python/fe_api suite (or the appropriate fe_api test file) that imports the symbol and verifies its presence and expected behavior to ensure it’s exercised by the fe_api tests.

coderabbitai · 2026-06-10T19:07:10Z

    cu_seqlens_k: Optional[torch.Tensor] = None,
    max_seqlen_q: Optional[int] = None,
    max_seqlen_k: Optional[int] = None,
+    current_stream=None,


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

The public indexer-forward docs are missing the new runtime contract. Both entry points added or tightened caller-visible constraints, but the docstrings still hide required stream, thread-count, head-count, and THD-shape requirements. Please document the full contract in both interfaces. As per coding guidelines, python/cudnn/**: "Focus on documentation."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/cudnn/deepseek_sparse_attention/indexer_forward/_interface.py` at line 47, Update the public docstrings for the indexer-forward interfaces to include the full runtime contract: explicitly state that callers must provide a CUDA stream (current_stream) or that a default stream will be used, the required thread-count and head-count values, and the expected THD tensor shapes/layouts and dtype constraints; update both entry-point docstrings (the function accepting the current_stream parameter and the paired public interface) to list required invariants, valid ranges, and what errors are raised when constraints are violated so callers know the exact runtime requirements.

Source: Coding guidelines

vedaanta and others added 30 commits May 20, 2026 21:12

Fix DSA link in README.md

7ee0491

Updated the link for DSA in the README to point to the correct directory.

Remove stale H200 benchmark artifacts (#252)

b4fb358

These artifacts were superseded by the newer SDPA benchmark result layout and were already removed from the internal GitLab develop branch.

Change profile_pass from 'fwd' to 'both'

cb74feb

Bump the develop to 1.25.0

8ec19b9

Add row-scale support to grouped GEMM quant

02d2471

Signed-off-by: Ziang Li <ziangli@umich.edu>

Tighten row-scale grouped GEMM quant tests

1f8cde3

Signed-off-by: Ziang Li <ziangli@umich.edu>

Update SDPA Benchmarking Artifacts (#265)

1ad529d

* update sdpa benchmark artifacts * update acknowledgement

Adding coderabbit review guide (initial template)

e851cc8

remove_9.99_version_tag

4d65508

add_protection_flags

69e5b42

Unblock SDPA tests and promote FP8 ragged backward to L0 (#275)

f90a2ec

* Promote L1 Python tests to L0 * Restore L1 markers except FP8 ragged backward

Fix the 9.99 bound

035b520

1

3542eb0

Add pre-commit hooks (#286)

28b5837

* Fix clang format issues * Fix clang-format * Add pre-commit hooks and fix pre-commit * Fix the black issues

Support cu_seqlens in unified SDPA (#266)

2965e7a

use static signature for sfd_col_d_srelu_tensor (#281)

87f571b

Signed-off-by: Jieming Zhang <jiemingz@nvidia.com>

Fix formatting issues from #263 (#294)

1f837f9

saltyminty and others added 6 commits June 8, 2026 22:40

make dgeglu config values compile time constants instead of runtime v…

601b84d

…alues (#293)

- Update the Black version. (#296)

d6e3edc

- Fix the formatting issues in grouped_gemm_dglu/api.py

Merge develop into 1.25.0-rc

259e895

Fix unused ragged offset version error variable (#299)

e8e219d

`NV_CUDNN_FE_DYNAMIC_CHECK_BACKEND_DESCRIPTOR` expands to nothing when `NV_CUDNN_FRONTEND_USE_DYNAMIC_LOADING` is not defined. So, the variable `ragged_offset_multiplier_cudnn_ver_error` may be unused.

Anerudhan requested review from hwanseoc and saltyminty June 10, 2026 18:35

Anerudhan added this to the Frontend 1.25.0 milestone Jun 10, 2026

Anerudhan marked this pull request as ready for review June 10, 2026 19:03

coderabbitai Bot reviewed Jun 10, 2026

View reviewed changes

hwanseoc approved these changes Jun 10, 2026

View reviewed changes

Anerudhan merged commit e46d708 into main Jun 10, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.25.0 rc#300

1.25.0 rc#300
Anerudhan merged 36 commits into
mainfrom
1.25.0-rc

Anerudhan commented Jun 10, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

Review skipped

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jun 10, 2026

Uh oh!

coderabbitai Bot Jun 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

Conversation

Anerudhan commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

cuDNN Frontend v1.25.0 Release Notes

Updates to Graph API 🚀 🚀

SDPA

Structured plan pinning

Reduction

Open-Source Kernels 🚀 🚀

General Improvements ✨✨

Bug Fixes 🐛

Samples

Benchmarking 📊

Acknowledgements

External contributors

Uh oh!

coderabbitai Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

Anerudhan commented Jun 10, 2026 •

edited

Loading

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading