Skip to content

Sync origin/main with upstream/main#16

Open
murphymatt wants to merge 137 commits intomainfrom
sync-with-upstream-main
Open

Sync origin/main with upstream/main#16
murphymatt wants to merge 137 commits intomainfrom
sync-with-upstream-main

Conversation

@murphymatt
Copy link
Copy Markdown
Contributor

Rebases origin/main onto upstream/main to resync with flashinfer-ai/flashinfer repository.

bkryu and others added 30 commits November 12, 2025 20:17
<!-- .github/pull_request_template.md -->

## 📌 Description
Brings in some changes to `test_hopper.py` to pass more unit tests

* `test_deepseek_prefill` --> Raise tolerance for bf16 inputs
* Others:  The
```
token_pos_in_items_len=torch.tensor(token_pos_in_items_len)
.to(dtype=torch.uint32)
.to(0),
```
is an incorrect API and results in invalid input errors. Change it to:
`token_pos_in_items_len=token_pos_in_items_len,` so that it matches the
correct usage in e.g.
[test_batch_prefill_kernels.py](https://github.com/flashinfer-ai/flashinfer/blob/6765cadd14fbedc9ffab428a87149a7d3f5d69f1/tests/attention/test_batch_prefill_kernels.py#L890)

After this, `test_hopper.py` result improves to `3 failed, 2865 passed,
1320 skipped in 65.26s (0:01:05) `

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->
<!-- .github/pull_request_template.md -->

## 📌 Description

Thor and Spark support when wheels are generating

## 🔍 Related Issues

Output says that is not compatible. Only with JIT is working.



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Broadened GPU architecture support to include additional newer
architectures.

* **Documentation**
* Updated README and installation docs to show the revised CUDA
architecture example list.

* **Chores**
* Adjusted release/nightly workflows and build scripts to select
architectures using an expanded CUDA-version threshold and branching
logic.

* **Performance**
* Extended architecture-specific build/runtime handling to cover an
additional GPU architecture affecting memory-related behavior.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Zihao Ye <expye@outlook.com>
Co-authored-by: yzh119 <zihaoy@nvidia.com>
<!-- .github/pull_request_template.md -->

## 📌 Description
Deprecate `tile_token_dim` in trtllm_moe. It is already not used and
mark with deprecation warning, plan to deprecate totally in next major
release
<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Refactor**
* Removed the deprecated `tile_tokens_dim` parameter from MOE benchmarks
and kernel functions, streamlining API calls and eliminating associated
deprecation warnings.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
<!-- .github/pull_request_template.md -->

Co-authored-by: @Edenzzzz 

## 📌 Description
Fixes flashinfer-ai/flashinfer#1022. Unlike
flashinfer-ai/flashinfer#1231, this splits the
inputs into separate prefill and decode inputs. It probably should be
possible to automatically handle this splitting in Python so you can
simply just provide a single batch of requests?

To run the benchmark for this run: `python
benchmarks/bench_mixed_attention.py`

Performance:
===== Benchmark 1: (kv_len, qo_len) set =====
Prefill = 2 requests, 2048 Q len, 2048 KV len
Decode = 128 requests, 2048 KV len
Elapsed time (Batched Prefill): 0.65 ms
Elapsed time (Batched POD Attention): 0.46 ms
Elapsed time (Persistent BatchAttention): 0.56 ms
**Batch POD speedup over Persistent BatchAttention: 1.22x**

===== Benchmark 2: (kv_len, qo_len) set =====
Prefill = 1 request, 2048 Q len, 2048 KV len
Decode = 128 requests, 2048 KV len
Elapsed time (Batched Prefill): 0.55 ms
Elapsed time (Batched POD Attention): 0.41 ms
Elapsed time (POD Attention): 0.41 ms
Elapsed time (Sequential two kernels): 0.51 ms
Elapsed time (Persistent BatchAttention): 0.45 ms
**Batch POD speedup over Persistent BatchAttention: 1.11x**

===== Benchmark 3: (kv_len, qo_len) set =====
Prefill = 1 request, 4096 Q len, 4096 KV len
Decode = 128 requests, 4096 KV len
Elapsed time (Batched Prefill): 1.27 ms
Elapsed time (Batched POD Attention): 0.86 ms
Elapsed time (POD Attention): 0.82 ms
Elapsed time (Sequential two kernels): 1.15 ms
Elapsed time (Persistent BatchAttention): 1.08 ms
Batch POD speedup over Persistent BatchAttention: 1.26x

===== Benchmark 4: (kv_len, qo_len) set =====
Prefill = 1 request, 4096 Q len, 4096 KV len
Decode = 128 requests, 8192 KV len
Elapsed time (Batched Prefill): 2.15 ms
Elapsed time (Batched POD Attention): 1.52 ms
Elapsed time (POD Attention): 1.54 ms
Elapsed time (Sequential two kernels): 1.82 ms
Elapsed time (Persistent BatchAttention): 1.76 ms
**Batch POD speedup over Persistent BatchAttention: 1.16x**

===== Benchmark 5: (kv_len, qo_len) set =====
Prefill = 1 request, 6000 Q len, 7000 KV len
Decode = 128 requests, 8192 KV len
Elapsed time (Batched Prefill): 2.86 ms
Elapsed time (Batched POD Attention): 2.03 ms
Elapsed time (POD Attention): 1.95 ms
Elapsed time (Sequential two kernels): 2.52 ms
Elapsed time (Persistent BatchAttention): 2.45 ms
**Batch POD speedup over Persistent BatchAttention: 1.20x**


## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added a batched prefill+decode attention path with a public
batch-oriented POD wrapper and JIT module export.

* **Performance**
* Benchmarks extended to include batched-path timings, memory bandwidth,
elapsed-time and comparative speedup metrics across expanded
prefill/decode scenarios.

* **API**
* Runtime binding for batched KV‑cache execution added; planning APIs
now accept an optional colocated-CTA parameter that influences
scheduling.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Aditya K Kamath <akamath1997@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
<!-- .github/pull_request_template.md -->

## 📌 Description

Patch sm103 for 3xfp4 moe generation

## 🔍 Related Issues

Following up of #2020 #1925 

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

```
$ ls csrc/nv_internal/tensorrt_llm/cutlass_instantiations/103/gemm_grouped
100  103  80

$ pytest tests/moe/test_trtllm_cutlass_fused_moe.py
22 passed, 3 skipped, 1 warning in 771.89s (0:12:51)
```


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added support for Blackwell (SM103) GPU architecture in MOE (Mixture
of Experts) operations with specialized CUTLASS-optimized modules.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- .github/pull_request_template.md -->

## 📌 Description

This PR does two things:
* Add a check for the number of tokens and raise an exception if the max
token size was exceeded
* Adds an optional parameter to allow users to dial in an arbitrary
workspace

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [ ] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [ ] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added an optional configurable workspace buffer size for all-reduce
operations with a sensible default to preserve backwards compatibility.
* Runtime input validation now enforces 2D inputs and token-count
limits, with clearer error messages guiding corrective actions.

* **Tests**
* Expanded test coverage for workspace behavior: default sizing,
explicit sizing, and negative tests for insufficient workspace.
* Tests now allow supplying an explicit workspace size to validate
allocation and reuse scenarios.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- .github/pull_request_template.md -->

## 📌 Description

- Small optimization for TRT-LLM Gen MoE finalize kernel

TopK=8, NumExperts=128, HiddenSize=4096

| BS | Baseline, us | Optimized, us | Speed-up |
| ------------- | ------------- | ------------- | ------------- |
| 256  | 11  | 6  | 1.83 |
| 512  | 12  | 7  | 1.71 |
| 1024 | 16  | 15  | 1.06 |
| 4096  | 55 | 49  | 1.12 |
| 8192 | 107 | 95  | 1.13 |
| 16384  | 205  | 183  | 1.12 |

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [ ] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [ ] I have installed the hooks with `pre-commit install`.
- [ ] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Enabled vectorized, Top-K unrolled finalize path for MOE (Mixture of
Experts) kernel operations with improved performance.
* Added support for multiple data types (bfloat16, float, half) with
enhanced type specialization and packing.
* Introduced runtime validation for TopK configurations (≤ 64) to ensure
optimal vectorized execution.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- .github/pull_request_template.md -->

## 📌 Description

Refactor fused_moe test.

Split test on model+precision.

Part [1]:
- test deepseek (kimi, lite) fp8 block-scaled fused moe
- default TP8
- PDL enabled
- MajorK weight layout
- higher tolerance and matching percentage

Next Part [2]:
- add BlockMajorK weight layout

Next Part [x]:
- Per Tensor FP8 MoE,  FP4MoE

later:
- refactor llama4, topk?, renormalize? routing tests

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Tests**
* Added a comprehensive FP8 block-scale fused Mixture-of-Experts test
validating end-to-end correctness across many routing, expert and
precision configurations. Includes randomized inputs,
per-token/per-expert workflows, extensive parameterizations, diagnostic
statistics, autotune-path checks, and a minimal sanity run.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- .github/pull_request_template.md -->

## 📌 Description

Duplicate of #2091, created PR from flashinfer-ai to enable workflow.

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Corrected CUDA compute capability targeting from 11.0f to 11.0a for
improved compatibility across build configurations.

* **Documentation**
* Updated installation and build documentation to reflect updated CUDA
architecture configurations for both older and newer CUDA versions.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- .github/pull_request_template.md -->

## 📌 Description

The `enablePDL` flag is set to false, this PR turned them on.
Set to true for both because sm_100 and sm_120 should have support of
pdl.

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [ ] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [ ] I have installed the hooks with `pre-commit install`.
- [ ] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Refactor**
* Updated runtime configuration for FP4 GEMM operations to enhance
execution performance on SM100 and SM120 GPU architectures.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

This PR updates the CODEOWNERS file based on git commit history analysis
from the last 180 days.

## Changes

- Updated `.github/CODEOWNERS` with current code ownership based on:
  - Commit frequency
  - File coverage
  - Commit recency

## How to Review

1. Review the changes to `.github/CODEOWNERS`
2. Verify that the assigned owners are appropriate for each module
3. Make manual adjustments if needed before merging

## Notes

- This is an automated PR generated weekly
- Minimum commits threshold: 1
- Analysis period: 180 days
- Directory depth: 3 levels
- Top N owners per module: 5

---

🤖 This PR was automatically generated by the [update-codeowners
workflow](.github/workflows/update-codeowners.yml)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **Chores**
  * Internal maintenance updates to code ownership mappings.

---

**Note:** This release contains no user-facing changes.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: flashinfer-bot <flashinfer-bot@users.noreply.github.com>
Co-authored-by: Claude <noreply@anthropic.com>
…sed RoPE + Q + KV cache, supports MLA/GQA/MHA) (#2037)

<!-- .github/pull_request_template.md -->

## 📌 Description

Add `flashinfer.rope.rope_quantize_fp8_append_paged_kv_cache`, which
runs a fused RoPE + Quantization (16 -> 8) + append KV Cache operation
kernel.

Note that this does not support optional quantization (there is no "RoPE
+ append KV Cache" fused operation available).

Tested on NVIDIA H100 NVL + flashinfer/flashinfer-ci-cu130:latest for
MLA/MHA/GQA problem sizes for decode and prefill cases.

## 🔍 Related Issues

"[Model Optimization] Add RoPE, RoPE+Q, RoPE+Q+KVCacheUpdate fused
kernels for MLA/GQA/MHA" item from Q4 roadmap:
flashinfer-ai/flashinfer#1770.

This PR is part 2 to earlier PR for RoPE + Q:
flashinfer-ai/flashinfer#1924

FW Stakeholders: @nvpohanh @pavanimajety 

## 🧪 Test results

```
$ pytest tests/attention/test_rope.py::test_rope_quantize_fp8_append_paged_kv_cache_decode -s
======================================================== test session starts =========================================================platform linux -- Python 3.12.11, pytest-8.4.2, pluggy-1.6.0
rootdir: /workspace/flashinfer
configfile: pytest.ini
collected 384 items

tests/attention/test_rope.py ................................................................................................................................................................................................................................................................................................................................................................................................

======================================================== 384 passed in 35.22s ========================================================
```

```
$ pytest tests/attention/test_rope.py::test_generalized_rope_quantize_append_kv_cache -s
======================================================== test session starts =========================================================
platform linux -- Python 3.12.11, pytest-8.4.2, pluggy-1.6.0
rootdir: /workspace/flashinfer
configfile: pytest.ini
collected 1248 items

tests/attention/test_rope.py .........................................................................................................
......................................................................................................................................
......................................................................................................................................
......................................................................................................................................
......................................................................................................................................
......................................................................................................................................
......................................................................................................................................
......................................................................................................................................
......................................................................................................................................
.......................................................................

================================================== 1248 passed in 63.07s (0:01:03) ===================================================
```

```
$ python benchmarks/bench_rope_quantize_fp8_append_cache.py

Detected GPU: NVIDIA GB200
Theoretical Peak Memory Bandwidth: 7928.06 GB/s


====================================================================================================
  MLA: 128 Q heads, 1 K head, 64+512 dims (DeepSeek-style)
====================================================================================================
Tokens     Time (ms)    BW (GB/s)    BW% (Peak)     TFLOPs
----------------------------------------------------------------------
1          0.00258      86.53        1.1            0.010
32         0.00381      1873.82      23.6           0.208
128        0.00763      3744.50      47.2           0.416
384        0.01848      4637.34      58.5           0.515
768        0.03694      4639.75      58.5           0.515
1024       0.04879      4683.57      59.1           0.520
2048       0.09590      4766.09      60.1           0.529
4096       0.19031      4803.27      60.6           0.533
8192       0.38523      4745.78      59.9           0.527

====================================================================================================
  GQA: 32 Q heads, 8 K heads, 64+64 dims (Llama-style)
====================================================================================================
Tokens     Time (ms)    BW (GB/s)    BW% (Peak)     TFLOPs
----------------------------------------------------------------------
1          0.00294      6.36         0.1            0.003
32         0.00316      189.48       2.4            0.078
128        0.00317      755.23       9.5            0.310
384        0.00398      1803.09      22.7           0.741
768        0.00522      2750.51      34.7           1.130
1024       0.00617      3100.80      39.1           1.274
2048       0.00927      4130.83      52.1           1.697
4096       0.01631      4695.01      59.2           1.929
8192       0.03466      4418.01      55.7           1.815

====================================================================================================
  MHA: 32 Q heads, 32 K heads, 64+64 dims (Standard)
====================================================================================================
Tokens     Time (ms)    BW (GB/s)    BW% (Peak)     TFLOPs
----------------------------------------------------------------------
1          0.00293      12.68        0.2            0.004
32         0.00313      379.98       4.8            0.126
128        0.00357      1331.80      16.8           0.441
384        0.00517      2756.73      34.8           0.912
768        0.00742      3840.41      48.4           1.271
1024       0.00887      4287.15      54.1           1.419
2048       0.01504      5055.18      63.8           1.673
4096       0.03343      4548.12      57.4           1.505
8192       0.06410      4744.76      59.8           1.571

====================================================================================================
Configuration details:
  Page size: 32, Batch size: 4
  Token range: 1 (single decode) → 8192 (large prefill)
  GPU: NVIDIA GB200
  Theoretical Peak Memory Bandwidth: 7928.06 GB/s
  BW% calculated as: (achieved_bandwidth / peak_bandwidth) * 100
====================================================================================================

```

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Fused RoPE + FP8 quantize-and-append for paged KV caches (MLA,
GQA/MHA) with layout, page-size, interleave and PDL options; returns
quantized Q outputs and writes K/V into paged caches; public ops and
high-level API added.

* **Tests**
* Deterministic, parameterized tests for append and decode/continuation
across attention types, layouts, dtypes and quant settings with
reference validation.

* **Benchmarks**
* New benchmark script for performance, bandwidth and Nsight profiling
of the paged-KV quantize+append path.

* **Chores**
  * Added cached GPU memory-bandwidth utility for benchmarks.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Zihao Ye <expye@outlook.com>
…tion (#2084)

<!-- .github/pull_request_template.md -->

## 📌 Description

- change `bmm1_scale` and `bmm2_scale` to `Union[float, torch.Tensor]`.
notice that when using tensor, it must be applied by log2e
- **remove the `bmm1_scale_log2_tensor` and `bmm2_scale_tensor` in the
`xqa_batch_decode_with_kv_cache_mla`**
- update trtllm-gen FMHA kernels

TODO: do the same refactor for xqa kernels. The support for the device
side scales was removed in #2033

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Attention scale parameters now accept either floats or 1-element
tensors across prefill, decode and runtime; tensor scales are validated
and applied on-device and pointer-backed scale paths are supported.

* **Chores**
* Updated FMHA artifact path and checksum constants; added a public
utility import and removed an obsolete inline comment.

* **Tests**
* Updated tests to exercise device/tensor-or-scalar scale flows, removed
legacy per-tensor call-site args, and added device-scale parametrization
for several test variants.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
<!-- .github/pull_request_template.md -->

## 📌 Description

Add shuffling and blockmajorK layout in dpskv3 fused_moe fp8_blockscaled
tests.

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Tests**
* Expanded MoE test suite with per-expert weight shuffling, optional
block-layout conversion, selectable weight-processing modes, and dynamic
kernel flags.
* Added a reference FP8 block-scale validation path and centralized
accuracy checks for clearer correctness verification.
* **Refactor**
* Centralized test utilities: quantization mode and test-skip logic
moved into shared helpers for consistent gating across MoE tests.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Zihao Ye <expye@outlook.com>
<!-- .github/pull_request_template.md -->

## 📌 Description

Added DSR1 MLA test, and split up the trtllm_batch_decode_mla function.

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [ ] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [ ] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Tests**
* Improved test suite for batch decoding by making maximum sequence
length configurable, adding parameterized runs across short and long
lengths, and introducing a compatibility wrapper to preserve legacy
behavior. This enhances coverage and validation across varied
sequence-length scenarios.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Zihao Ye <expye@outlook.com>
<!-- .github/pull_request_template.md -->

## 📌 Description

In #1898, it was raised that trtllm-gen's attention kernels fail for
batch size 1. The prefill kernel was fixed in #1912 and prefill tests
have been enabled.

Further updates to trtllm-gen kernels have also fixed the decode batch
size 1 issue. Current PR re-enables testing.

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Tests**
* Expanded batch_decode test scenarios to cover additional small-batch
and page-size combinations.
* Increased coverage for max_in_kv_len by testing multiple length
options instead of a single value.
* Restored previously marked-as-expected-failure case to run normally,
improving overall test pass coverage.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Zihao Ye <expye@outlook.com>
<!-- .github/pull_request_template.md -->

## 📌 Description

This pr adds a parameter `return_lse_base_on_e` to control the base of
LSE returned by MLA. Default to `False`, which keeps the same with
current implementation. If `return_lse_base_on_e` is `True`, multiply
the final LSE by `loge2` to maintain consistency with the standard
softmax and FA3.

## 🔍 Related Issues

#2113 

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added a run-time option to control whether returned log‑sum‑exp (LSE)
baselines are scaled by ln(2) (default: disabled).

* **Bug Fixes**
* Conditional scaling ensures returned LSE values are consistent when
the option is enabled, improving numerical consistency.

* **Chores**
* The new option is exposed in public APIs and bindings and is
propagated through the execution path.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
<!-- .github/pull_request_template.md -->

## 📌 Description

Update xqa license based on
NVIDIA/TensorRT-LLM#8807
<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues
flashinfer-ai/flashinfer#1977
<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [ ] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [ ] I have installed the hooks with `pre-commit install`.
- [ ] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Updated project licensing to Apache License 2.0 with extended
copyright years through 2025.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>
<!-- .github/pull_request_template.md -->

## 📌 Description

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [ ] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [ ] I have installed the hooks with `pre-commit install`.
- [ ] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Attention ops now accept tensor-based per-head scaling (q/kv) in C++
and Python paths, enabling dynamic or per-tensor quantization scales.
  * Python APIs and docs updated to accept float or tensor scales.

* **Tests**
* Batch-decode tests adjusted to use per-sequence cache/block sizing for
more accurate memory dimensioning.

* **Documentation**
  * Docstrings updated to describe tensor-or-scalar scale inputs.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>
<!-- .github/pull_request_template.md -->

## 📌 Description

9.0a was removed from installation documentation by accident, in some
recent PRs.

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->
…cubin (#2123)

<!-- .github/pull_request_template.md -->

## 📌 Description

flashinfer-cubin package building failed because we flashinfer/utils.py
relies on nvidia-ml-py which is not specified as part of build system
requirements of the package.

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Added a new build system dependency to support enhanced system
functionality.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
…udnn' (#1979)

<!-- .github/pull_request_template.md -->

## 📌 Description
Current PR:
* Introduces an `auto` backend to `mm_fp4` that can be autotuned. **It
replaces `cudnn` as the default.**
  * Implementation matches `bmm_fp8`'s auto backend support.
* Allows `cudnn` backend to be autotuned.
* Added unit test test cases for backend=auto
 
Behavior of `auto` backend:
* Examines CUDA version & cuDNN version and calls either `cutlass` or
`cudnn` kernel backends. `trtllm` kernel is not considered due to a
non-interchangeable interface with other backends.
* `auto` backend therefore only supports inputs runnable by `cutlass`
and/or `cudnn.
* Non-autotuned behavior: 
* Constructs an ordered list of backends (cudnn, cutlass) or (cutlass,
cudnn) where ordering is based on previous microbenchmark study results.
    * If CUDA 12 --> cutlass comes to front.
    * If CUDA 13 and cuDNN version < 9.15 --> cutlass comes front
    * If CUDA 13 and cuDNN version >= 9.15 --> cudnn comes front
* If kernel is not available from a support check, it is removed from
the list.
* Autotune behavior:
* If backend is explicitly provided --> Autotunes within the backend.
Same as previous behavior, but now autotuning is supported for cudnn.
* If `backend='auto'` --> Autotunes within and across backends (cudnn &
cutlass) and chooses the best config of best backend. `trtllm` kernel is
not considered
* A lot of helper functions to `mm_fp4` were refactored to enable
cross-backend autotuning. Refactoring was done to match cross-backend
autotune-enabled `bmm_fp8` as a reference.

### Pytest outputs
`pytest tests/gemm/test_mm_fp4.py`
* SM100 (B200) CUDA 13 & cuDNN 9.15: `900 passed, 2532 skipped in
125.19s (0:02:05)`
* SM100 (B200) CUDA 12 & cuDNN 9.15: `900 passed, 2532 skipped in
125.67s (0:02:05)`
* SM120 (RTX 5090) CUDA 13 & cuDNN 9.15: `720 passed, 2712 skipped in
76.50s (0:01:16)`

### Example microbenchmark outputs:
On SM100 (B200) CUDA 13 & cuDNN 9.15
```
flashinfer/benchmarks# python3 flashinfer_benchmark.py --routine mm_fp4 --m 1024 --n 7168 --k 4608 --out_dtype bfloat16 --backends cudnn cutlass trtllm auto --use_128x4_sf_layout --use_nvfp4 --refcheck
[PERF] cudnn          :: median time 0.018 ms; std 0.000 ms; achieved tflops 3797.932 TFLOPs/sec; achieved tb_per_sec 1.884 TB/sec
[PERF] cutlass        :: median time 0.020 ms; std 0.000 ms; achieved tflops 3440.640 TFLOPs/sec; achieved tb_per_sec 1.707 TB/sec
[PERF] trtllm         :: median time 0.031 ms; std 0.000 ms; achieved tflops 2187.427 TFLOPs/sec; achieved tb_per_sec 1.085 TB/sec
[PERF] auto           :: median time 0.018 ms; std 0.000 ms; achieved tflops 3840.714 TFLOPs/sec; achieved tb_per_sec 1.905 TB/sec
/flashinfer/benchmarks# python3 flashinfer_benchmark.py --routine mm_fp4 --m 1024 --n 7168 --k 4608 --out_dtype bfloat16 --backends cudnn cutlass trtllm auto --use_128x4_sf_layout --refcheck
[INFO] cutlass backend does not support this configuration: ValueError: Only cudnn and auto FP4 GEMM supports mxfp4 quantization.
[INFO] trtllm backend does not support this configuration: ValueError: Only cudnn and auto FP4 GEMM supports mxfp4 quantization.
[PERF] cudnn          :: median time 0.021 ms; std 0.000 ms; achieved tflops 3238.249 TFLOPs/sec; achieved tb_per_sec 1.606 TB/sec
[PERF] auto           :: median time 0.021 ms; std 0.000 ms; achieved tflops 3237.753 TFLOPs/sec; achieved tb_per_sec 1.606 TB/sec median time 0.009 ms; std 0.000 ms; achieved tflops 938.356 TFLOPs/sec; achieved tb_per_sec 2.069 TB/sec

## Autotune
/flashinfer/benchmarks# python3 flashinfer_benchmark.py --routine mm_fp4 --m 1024 --n 7168 --k 4608 --out_dtype bfloat16 --backends cudnn cutlass trtllm auto --use_128x4_sf_layout --use_nvfp4 --refcheck --autotune
2025-11-11 23:43:23,715 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-11-11 23:43:25,789 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
2025-11-11 23:43:25,790 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-11-11 23:43:26,251 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
2025-11-11 23:43:26,251 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-11-11 23:43:26,327 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
2025-11-11 23:43:26,327 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-11-11 23:43:26,335 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
[PERF] cudnn_autotune :: median time 0.016 ms; std 0.000 ms; achieved tflops 4129.171 TFLOPs/sec; achieved tb_per_sec 2.048 TB/sec
[PERF] cutlass_autotun:: median time 0.019 ms; std 0.000 ms; achieved tflops 3513.845 TFLOPs/sec; achieved tb_per_sec 1.743 TB/sec
[PERF] trtllm_autotune:: median time 0.026 ms; std 0.000 ms; achieved tflops 2613.338 TFLOPs/sec; achieved tb_per_sec 1.296 TB/sec
[PERF] auto_autotune  :: median time 0.016 ms; std 0.000 ms; achieved tflops 4128.768 TFLOPs/sec; achieved tb_per_sec 2.048 TB/sec

/flashinfer/benchmarks# python3 flashinfer_benchmark.py --routine mm_fp4 --m 1024 --n 7168 --k 4608 --out_dtype bfloat16 --backends cudnn cutlass trtllm auto --use_128x4_sf_layout --refcheck --autotune
[INFO] cutlass backend does not support this configuration: ValueError: Only cudnn and auto FP4 GEMM supports mxfp4 quantization.
[INFO] trtllm backend does not support this configuration: ValueError: Only cudnn and auto FP4 GEMM supports mxfp4 quantization.
2025-11-11 23:43:37,942 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-11-11 23:43:43,116 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
2025-11-11 23:43:43,116 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-11-11 23:43:43,124 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
[PERF] cudnn_autotune :: median time 0.020 ms; std 0.000 ms; achieved tflops 3370.154 TFLOPs/sec; achieved tb_per_sec 1.672 TB/sec
[PERF] auto_autotune  :: median time 0.020 ms; std 0.000 ms; achieved tflops 3370.692 TFLOPs/sec; achieved tb_per_sec 1.672 TB/sec
```

On SM100 (B200) CUDA 12 & cuDNN 9.15
```
flashinfer/benchmarks# python3 flashinfer_benchmark.py --routine mm_fp4 --m 1024 --n 7168 --k 4608 --out_dtype bfloat16 --backends cudnn cutlass trtllm auto --use_128x4_sf_layout --use_nvfp4 --refcheck
[PERF] cudnn          :: median time 0.023 ms; std 0.001 ms; achieved tflops 2975.898 TFLOPs/sec; achieved tb_per_sec 1.476 TB/sec
[PERF] cutlass        :: median time 0.020 ms; std 0.000 ms; achieved tflops 3370.423 TFLOPs/sec; achieved tb_per_sec 1.672 TB/sec
[PERF] trtllm         :: median time 0.031 ms; std 0.000 ms; achieved tflops 2187.427 TFLOPs/sec; achieved tb_per_sec 1.085 TB/sec
[PERF] auto           :: median time 0.020 ms; std 0.000 ms; achieved tflops 3371.229 TFLOPs/sec; achieved tb_per_sec 1.672 TB/sec
(py312) root@84ef83abb1b5:/flashinfer/benchmarks# python3 flashinfer_benchmark.py --routine mm_fp4 --m 1024 --n 7168 --k 4608 --out_dtype bfloat16 --backends cudnn cutlass trtllm auto --use_128x4_sf_layout --refcheck
[INFO] cutlass backend does not support this configuration: ValueError: Only cudnn and auto FP4 GEMM supports mxfp4 quantization.
[INFO] trtllm backend does not support this configuration: ValueError: Only cudnn and auto FP4 GEMM supports mxfp4 quantization.
[PERF] cudnn          :: median time 0.021 ms; std 0.000 ms; achieved tflops 3238.249 TFLOPs/sec; achieved tb_per_sec 1.606 TB/sec
[PERF] auto           :: median time 0.021 ms; std 0.000 ms; achieved tflops 3238.249 TFLOPs/sec; achieved tb_per_sec 1.606 TB/sec

## Autotune
/flashinfer/benchmarks# python3 flashinfer_benchmark.py --routine mm_fp4 --m 1024 --n 7168 --k 4608 --out_dtype bfloat16 --backends cudnn cutlass trtllm auto --use_128x4_sf_layout --use_nvfp4 --refcheck --autotune
2025-11-11 23:42:43,378 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-11-11 23:42:45,451 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
2025-11-11 23:42:45,451 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-11-11 23:42:45,910 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
2025-11-11 23:42:45,910 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-11-11 23:42:45,986 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
2025-11-11 23:42:45,986 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-11-11 23:42:45,993 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
[PERF] cudnn_autotune :: median time 0.021 ms; std 0.000 ms; achieved tflops 3190.355 TFLOPs/sec; achieved tb_per_sec 1.583 TB/sec
[PERF] cutlass_autotun:: median time 0.019 ms; std 0.000 ms; achieved tflops 3551.330 TFLOPs/sec; achieved tb_per_sec 1.762 TB/sec
[PERF] trtllm_autotune:: median time 0.026 ms; std 0.000 ms; achieved tflops 2621.440 TFLOPs/sec; achieved tb_per_sec 1.300 TB/sec
[PERF] auto_autotune  :: median time 0.019 ms; std 0.000 ms; achieved tflops 3551.628 TFLOPs/sec; achieved tb_per_sec 1.762 TB/sec
flashinfer/benchmarks# python3 flashinfer_benchmark.py --routine mm_fp4 --m 1024 --n 7168 --k 4608 --out_dtype bfloat16 --backends cudnn cutlass trtllm auto --use_128x4_sf_layout --refcheck --autotune
[INFO] cutlass backend does not support this configuration: ValueError: Only cudnn and auto FP4 GEMM supports mxfp4 quantization.
[INFO] trtllm backend does not support this configuration: ValueError: Only cudnn and auto FP4 GEMM supports mxfp4 quantization.
2025-11-11 23:42:55,176 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-11-11 23:42:58,600 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
2025-11-11 23:42:58,601 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-11-11 23:42:58,608 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
[PERF] cudnn_autotune :: median time 0.021 ms; std 0.000 ms; achieved tflops 3238.249 TFLOPs/sec; achieved tb_per_sec 1.606 TB/sec
[PERF] auto_autotune  :: median time 0.021 ms; std 0.000 ms; achieved tflops 3238.249 TFLOPs/sec; achieved tb_per_sec 1.606 TB/sec
```

On SM120 (RTX 5090) CUDA 13 & cuDNN 9.15
```
/flashinfer/benchmarks$ python3 flashinfer_benchmark.py --routine mm_fp4 --m 1024 --n 7168 --k 4608 --out_dtype bfloat16 --backends cudnn cutlass trtllm auto --use_128x4_sf_layout --use_nvfp4 --refcheck
[INFO] trtllm backend does not support this configuration: BackendSupportedError: mm_fp4 does not support backend 'trtllm' with capability 120
[PERF] cudnn          :: median time 0.058 ms; std 0.000 ms; achieved tflops 1167.143 TFLOPs/sec; achieved tb_per_sec 0.579 TB/sec
[PERF] cutlass        :: median time 0.060 ms; std 0.000 ms; achieved tflops 1135.056 TFLOPs/sec; achieved tb_per_sec 0.563 TB/sec
[PERF] auto           :: median time 0.058 ms; std 0.000 ms; achieved tflops 1158.952 TFLOPs/sec; achieved tb_per_sec 0.575 TB/sec
/flashinfer/benchmarks$ python3 flashinfer_benchmark.py --routine mm_fp4 --m 1024 --n 7168 --k 4608 --out_dtype bfloat16 --backends cudnn cutlass trtllm auto --use_128x4_sf_layout --refcheck
[INFO] cutlass backend does not support this configuration: ValueError: Only cudnn and auto FP4 GEMM supports mxfp4 quantization.
[INFO] trtllm backend does not support this configuration: BackendSupportedError: mm_fp4 does not support backend 'trtllm' with capability 120
[PERF] cudnn          :: median time 0.054 ms; std 0.000 ms; achieved tflops 1241.735 TFLOPs/sec; achieved tb_per_sec 0.616 TB/sec
[PERF] auto           :: median time 0.054 ms; std 0.000 ms; achieved tflops 1241.735 TFLOPs/sec; achieved tb_per_sec 0.616 TB/sec
```

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

#1722 
<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
  * "auto" backend selection for FP4 ops to choose backend at runtime
  * cuDNN, CUTLASS and TRTLLM selectable as FP4 GEMM backends
  * CUDA/cuDNN version awareness to guide auto-backend heuristics

* **Improvements**
* Runtime capability checks replace static backend lists; unsupported
backends are removed dynamically
  * Heuristic-driven auto-backend selection required for automatic mode
* Expanded autotuning/warmup across backends and relaxed FP4 validation
tolerance

* **Tests**
* Tests updated and added to exercise auto-backend scenarios and relaxed
constraints

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- .github/pull_request_template.md -->

## 📌 Description

`bench_mm_fp8.py` was not functioning because `res` was being provided
as a fourth positional argument when it should be given as out=res

```
def mm_fp8(
    a: torch.Tensor,
    b: torch.Tensor,
    alpha: Optional[torch.Tensor] = None,
    out_dtype: torch.dtype = torch.bfloat16,
    out: Optional[torch.Tensor] = None,
    backend: Literal["trtllm_low_latency"] = "trtllm_low_latency",
):
```

Output after fix:
```
flashinfer$ python3 benchmarks/bench_mm_fp8.py 
2025-11-21 09:38:10,084 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-11-21 09:38:10,328 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
mm_fp8 m=1 n=2560 k=16384 in_dtype=torch.float8_e4m3fn out_dtype=torch.bfloat16: 6.36 TFLOPs/s over 0.013199 ms, 3.18 TB/s
2025-11-21 09:38:10,551 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-11-21 09:38:10,573 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
mm_fp8 m=1 n=2560 k=32768 in_dtype=torch.float8_e4m3fn out_dtype=torch.bfloat16: 7.28 TFLOPs/s over 0.023040 ms, 3.64 TB/s
2025-11-21 09:38:10,671 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-11-21 09:38:10,692 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
mm_fp8 m=1 n=5120 k=16384 in_dtype=torch.float8_e4m3fn out_dtype=torch.bfloat16: 8.31 TFLOPs/s over 0.020191 ms, 4.16 TB/s
2025-11-21 09:38:10,789 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-11-21 09:38:10,813 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
mm_fp8 m=1 n=5120 k=32768 in_dtype=torch.float8_e4m3fn out_dtype=torch.bfloat16: 9.40 TFLOPs/s over 0.035696 ms, 4.70 TB/s
2025-11-21 09:38:10,918 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-11-21 09:38:10,941 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
mm_fp8 m=1 n=8192 k=16384 in_dtype=torch.float8_e4m3fn out_dtype=torch.bfloat16: 9.16 TFLOPs/s over 0.029312 ms, 4.58 TB/s
2025-11-21 09:38:11,045 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-11-21 09:38:11,072 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
mm_fp8 m=1 n=8192 k=32768 in_dtype=torch.float8_e4m3fn out_dtype=torch.bfloat16: 10.14 TFLOPs/s over 0.052959 ms, 5.07 TB/s
...
```
Also changed measurement methodology slightly to use cupti. Previous
methodology inflated performance numbers due to not flushing L2 cache or
using a rotating buffer to start with a cold cash. Benchmark should
produce much accurate performance numbers due to L2 flush with
`enable_cupti=True`

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Chores**
* Adjusted benchmark timing settings to shorten warm-up and measurement
durations for faster test runs.
* Enabled CUPTI profiling for more detailed GPU performance metrics in
FP8 matrix-multiplication benchmarks.
* Made non-functional parameter/argument updates and clarifying
comments; no changes to core computation logic.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- .github/pull_request_template.md -->

## 📌 Description
tl; dr: Current PR adds a logging system for input/output tracking to
aid debugging FlashInfer APIs via a `@flashinfer_api` decorator.

**This PR does not label `@flashinfer_api` to every FlashInfer API --
many operations are missing labels. Further labeling is left for
subsequent work.**

This PR introduces a production-ready API logging infrastructure that
tracks function calls, arguments, and return values via a simple
one-line decorator. Any function can be decorated with the decorator to
track the input/output values in the API logger.

Key Features:
* Logging level controlled by `FLASHINFER_LOGLEVEL`
* Log destination set by `FLASHINFER_LOGDEST`; defaults to `stdout`
* Zero overhead when disabled (level 0 returns original function) as
seen from `benchmarks/bench_logging_overhead.py`

Example usage
```
export FLASHINFER_LOGLEVEL=1
export FLASHINFER_LOGDEST="./flashinfer_api.log"

python3 benchmarks/flashinfer_benchmark.py --routine BatchDecodeWithPagedKVCacheWrapper --backends fa2 fa2_tc cudnn trtllm-gen trtllm-gen-native --page_size 16 --batch_size 1 --s_qo 1 --s_kv 1024 --num_qo_heads 64 --num_kv_heads 8 --head_dim_qk 128 --head_dim_vo 128 --random_actual_seq_len -vv --refcheck --q_dtype bfloat16 --kv_dtype bfloat16
```
produces log
```
================================================================================
[2025-11-20 17:51:18] FlashInfer API Logging - System Information
================================================================================
FlashInfer version: 0.5.2
CUDA toolkit version: 13.0
cuDNN version: 91600
Number of GPUs: 1
  GPU 0: NVIDIA B200
    Compute capability: 10.0 (SM100)
PyTorch version: 2.9.0+cu130
================================================================================

[2025-11-20 17:51:19] FlashInfer API Call: BatchDecodeWithPagedKVCacheWrapper.__init__
[2025-11-20 17:51:19] FlashInfer API Call: BatchDecodeWithPagedKVCacheWrapper.plan
[2025-11-20 17:51:19] FlashInfer API Call: BatchDecodeWithPagedKVCacheWrapper.__init__
[2025-11-20 17:51:19] FlashInfer API Call: BatchDecodeWithPagedKVCacheWrapper.plan
[2025-11-20 17:51:19] FlashInfer API Call: BatchDecodeWithPagedKVCacheWrapper.__init__
[2025-11-20 17:51:19] FlashInfer API Call: BatchDecodeWithPagedKVCacheWrapper.plan
[2025-11-20 17:51:19] FlashInfer API Call: BatchDecodeWithPagedKVCacheWrapper.run
[2025-11-20 17:51:19] FlashInfer API Call: BatchDecodeWithPagedKVCacheWrapper.run
...

```

`export FLASHINFER_LOGLEVEL=3` produces:
```
(System Info same as above)
================================================================================
[2025-11-20 17:51:58] FlashInfer API Call: BatchDecodeWithPagedKVCacheWrapper.__init__
--------------------------------------------------------------------------------
Positional input arguments:
  arg[0]:
    <flashinfer.decode.BatchDecodeWithPagedKVCacheWrapper object at 0x1234399e3410>
  arg[1]:
    Tensor(
      shape=(134217728,)
      stride=(1,)
      dtype=torch.int8
      device=cuda:0
      requires_grad=False
      is_contiguous=True
    )
  arg[2]:
    'HND'
Keyword input arguments:
  use_cuda_graph=
    True
  use_tensor_cores=
    False
  paged_kv_indptr_buffer=
    Tensor(
      shape=(2,)
      stride=(1,)
      dtype=torch.int32
      device=cuda:0
      requires_grad=False
      is_contiguous=True
    )
  paged_kv_indices_buffer=
    Tensor(
      shape=(6,)
      stride=(1,)
      dtype=torch.int32
      device=cuda:0
      requires_grad=False
      is_contiguous=True
    )
  paged_kv_last_page_len_buffer=
    Tensor(
      shape=(1,)
      stride=(1,)
      dtype=torch.int32
      device=cuda:0
      requires_grad=False
      is_contiguous=True
    )
  backend=
    'fa2'
Default parameters (not explicitly provided):
  jit_args= [DEFAULT]
    None
Output value:
  None
================================================================================
...
```
`export FLASHINFER_LOGLEVEL=5` produces:
```
(System Info same as above)
================================================================================
[2025-11-20 17:52:23] FlashInfer API Call: BatchDecodeWithPagedKVCacheWrapper.__init__
--------------------------------------------------------------------------------
Positional input arguments:
  arg[0]:
    <flashinfer.decode.BatchDecodeWithPagedKVCacheWrapper object at 0x7a9fd9a88c0>
  arg[1]:
    Tensor(
      shape=(134217728,)
      stride=(1,)
      dtype=torch.int8
      device=cuda:0
      requires_grad=False
      is_contiguous=True
      min=0
      max=0
      mean=0.000000
    )
  arg[2]:
    'HND'
Keyword input arguments:
  use_cuda_graph=
    True
  use_tensor_cores=
    False
  paged_kv_indptr_buffer=
    Tensor(
      shape=(2,)
      stride=(1,)
      dtype=torch.int32
      device=cuda:0
      requires_grad=False
      is_contiguous=True
      min=0
      max=6
      mean=3.000000
    )
  paged_kv_indices_buffer=
    Tensor(
      shape=(6,)
      stride=(1,)
      dtype=torch.int32
      device=cuda:0
      requires_grad=False
      is_contiguous=True
      min=0
      max=5
      mean=2.500000
    )
  paged_kv_last_page_len_buffer=
    Tensor(
      shape=(1,)
      stride=(1,)
      dtype=torch.int32
      device=cuda:0
      requires_grad=False
      is_contiguous=True
      min=4
      max=4
      mean=4.000000
    )
  backend=
    'fa2'
Default parameters (not explicitly provided):
  jit_args= [DEFAULT]
    None
Output value:
  None
================================================================================
...
```

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [ ] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [ ] I have installed the hooks with `pre-commit install`.
- [ ] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Release Notes

* **New Features**
* Added API logging feature configurable via environment variables
(FLASHINFER_LOGLEVEL for level control, FLASHINFER_LOGDEST for
destination)
* Supports five verbosity levels with function names, inputs, outputs,
metadata, and tensor statistics
  * Zero-overhead operation when disabled

* **Tests**
  * Added comprehensive logging test suite

* **Documentation**
  * Added logging configuration and usage documentation

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- .github/pull_request_template.md -->

## 📌 Description

New function to validate that the indices type, when provided, is
`int32`. To close
flashinfer-ai/flashinfer#2115.
There are now two separate functions doing checking in this file. I will
move them to the C++ side later when I have some more bandwidth,
probably after Thanksgiving. Just a short fix for now. You can close if
you'd rather wait for that.

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

flashinfer-ai/flashinfer#2115
<!-- Link any related issues here -->

Relevant to the issue. Now running their code:
```
(flashinfer) raayan@uril-1:~/projects/flashinfer$ python test.py 
tensor([1, 1, 0, 0], device='cuda:0', dtype=torch.int32)
Traceback (most recent call last):
  File "/home/raayan/projects/flashinfer/test.py", line 15, in <module>
    incorrect_samples = flashinfer.sampling.top_k_top_p_sampling_from_logits(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/raayan/projects/flashinfer/flashinfer/sampling.py", line 1031, in top_k_top_p_sampling_from_logits
    _check_indices_dtype(indices)
  File "/home/raayan/projects/flashinfer/flashinfer/sampling.py", line 487, in _check_indices_dtype
    raise ValueError(f"indices must have dtype torch.int32, got {indices.dtype}")
ValueError: indices must have dtype torch.int32, got torch.int64
```

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Improvements**
* Enforced that indices passed to sampling operations must use int32,
adding runtime validation before sampling.

* **Documentation**
* Clarified docstrings to state the int32 requirement for indices
parameters.

* **Tests**
* Updated and expanded tests to cover the new dtype validation paths and
related error cases.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Raayan Dhar raayan.dhar@gmail.com <raayan.dhar@gmail.com>
<!-- .github/pull_request_template.md -->

## 📌 Description

Update autotuner input tensor random range from [0,1) to [-5,5) for
larger range and closer to real tensor

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Improved tensor initialization used during autotuning: values are now
drawn from a symmetric range around zero ([-5, 5]) with a more
uniform-like distribution, yielding more consistent and stable parameter
tuning results.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
<!-- .github/pull_request_template.md -->

## 📌 Description

Enable xqa with speculative decoding and add mask tensor in
trtllm_batch_decode_with_kv_cache.

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Speculative decoding: multi-token query support (q_seq_len) with
optional attention mask threaded end-to-end.

* **API**
* Public APIs updated to accept q_seq_len and an optional mask;
automatic reshaping and runtime checks for multi-token decoding.

* **JIT / Build**
* JIT now emits SPEC_DEC-enabled variants and includes spec-dec flags in
generated specs.

* **Backend / Runtime**
* Mask propagation and architecture-aware backend selection improved for
compatible kernels.

* **Tests**
* Added helpers and tests to generate causal masks and validate
multi-token speculative decoding.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>
Co-authored-by: yzh119 <zihaoy@nvidia.com>
<!-- .github/pull_request_template.md -->

## 📌 Description

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [ ] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added optional communication-backend parameter for multi-node memory
and buffer allocation to allow using a provided communicator for handle
transfer.

* **Bug Fixes / Reliability**
* Multi-node synchronization now uses the provided communicator's
barrier when available, preserving previous behavior otherwise.

* **Tests**
* Added end-to-end tests covering custom communication backends and
multi-node all-reduce synchronization.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Anerudhan and others added 23 commits December 22, 2025 16:43
<!-- .github/pull_request_template.md -->

## 📌 Description

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **Bug Fixes**
* Lowered minimum cuDNN version requirement for FP8 support from 9.18.0
to 9.17.1, enabling FP8 functionality on earlier cuDNN versions.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- .github/pull_request_template.md -->

## 📌 Description

This is a port of NVIDIA/TensorRT-LLM#9822 which
was done by @bobboli

This feature is necessary for SGlang integration because some DP workers
may have 0 tokens. The workaround to use a dummy token is quite messy
and brittle.

## 🔍 Related Issues

Follow up to flashinfer-ai/flashinfer#2102

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [ ] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [ ] I have installed the hooks with `pre-commit install`.
- [ ] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Improved robustness of mixture-of-experts all-to-all communication to
gracefully handle scenarios with zero local tokens, preventing
synchronization failures and ensuring stable operation in edge cases.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
…#2257)

<!-- .github/pull_request_template.md -->

## 📌 Description

Support inplace update output for `get_batch_indices_positions`. User
can pre-allocate `batch_indices` and `positions` to avoid additional
copies.

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Public API now accepts optional pre-allocated output buffers for batch
indices and positions, enabling memory reuse while preserving previous
behavior.
* Pre-allocated buffers are validated for compatibility; automatic
allocation remains the fallback.

* **Documentation**
* Docstrings clarified to describe the new optional outputs, validation
rules (shape/dtype/device), and allocation behavior.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…e N is not divisible by ScaleGranularityN. (#2261)

<!-- .github/pull_request_template.md -->

## 📌 Description

The SM120 CUTLASS blockwise gemm kernel requires dimensions like N to be
multiples of 128 due to hardware constraints
(https://github.com/NVIDIA/cutlass/blob/3f4c086d09bd1dc55defb955862f333893bbb28b/include/cutlass/gemm/collective/sm120_mma_tma_blockwise_scaling.hpp#L345C5-L346).

We met the shape `a: torch.Size([1, 1, 2688]), b: torch.Size([1, 2688,
10304]), scale_a: torch.Size([]), scale_b: torch.Size([]), out:
torch.Size([1, 1, 10304]), workspace_buffer: torch.Size([33554432])`
from Nemotron-Nano-v3, where 10304 is not a multiple of 128, the cutlass
gemm does not work for it properly. In this PR, we add a pad and slice
to get it work.


## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **Bug Fixes**
* FP8 matrix operations on SM120/SM121 GPUs now support arbitrary input
dimensions, removing the previous K dimension minimum requirement and
enabling broader use cases.

* **Tests**
* Expanded test coverage for FP8 matrix operations with additional
parameter combinations and improved hardware compatibility validation.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
…8_append_paged_kv_cache` (#2255)

<!-- .github/pull_request_template.md -->

## 📌 Description

`rope_quantize_fp8_append_paged_kv_cache` is a merged API of
`rope_quantize` and `append_paged_kv_cache`(#2037). However, `typename
IdType` from `RopeQuantize` and `AppendPagedKVCache` should not be
merged into the same one since they could be in different dtype.
`AppendPagedKVCache`'s `IdType` is hardcoded to `int32` but
`RopeQuantize`'s `IdType` may be `int64` in frameworks.

This PR splits `typename IdType` into separated `typename RoPEIdType,
typename PagedKVIdType`, and this will fix the accuracy issue when
passing int64 `pos_ids`(RoPE part argument that with `RoPEIdType` type)
to API.

cc @kahyunnam @yzh119

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Refactor**
* Generalized position-encoding and paged KV cache handling to support
multiple integer identifier dtypes and improve type consistency.
* **Bug Fixes**
* Enforced/validated consistent integer dtype for index tensors before
processing to reduce dtype-mismatch errors.
* **Tests**
* Expanded tests to cover different integer index dtypes (e.g., int32
and int64) for ROPE and paged KV workflows.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Zihao Ye <expye@outlook.com>
<!-- .github/pull_request_template.md -->

## 📌 Description

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->
In QKNorm kernel with small batch size, we can reduce the number of
blocks launched. This can reduce block launching overhead especially in
decode stage.

A example result on B200 where (batch_size, num_heads, head_dim) = (128,
8, 128), which is common in Qwen3 model decode stage.

Before this PR: 2.448us
After this PR: 1.584us

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Refactor**
* Optimized GPU kernel grid size calculation to reduce unnecessary block
launches and improve overall performance efficiency.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- .github/pull_request_template.md -->

## 📌 Description

saw some [test
failures](https://gitlab-master.nvidia.com/dl/flashinfer/flashinfer-ci/-/jobs/247866505)
on Blackwell boards after #2261, all the failed assertions are related
to the large value 10304.

Use `.float()` to help reduce precision loss during `cosine_similarity`
(`dot(x, y) / (||x|| * ||y||)`) check.

```
FAILED tests/gemm/test_bmm_fp8.py::test_bmm_fp8[True-cutlass-res_dtype1-mat2_dtype0-input_dtype0-256-10304-128-16] - AssertionError: assert tensor(0., device='cuda:0') > 0.99
2025-12-24T07:00:08.299846Z 01O FAILED tests/gemm/test_bmm_fp8.py::test_bmm_fp8[False-cudnn-res_dtype1-mat2_dtype0-input_dtype1-256-10304-128-16] - AssertionError: assert tensor(0., device='cuda:0') > 0.99
... # the failure occurs for all backend (cutlass, cudnn, etc)
```

cc: @zihaoye @bkryu 


## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Tests**
* Improved test accuracy by ensuring tensor comparisons use
floating-point precision for cosine similarity calculations.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- .github/pull_request_template.md -->

## 📌 Description

Add support for GEMM with MXFP8 (`bmm_mxfp8`).

At this time only cuDNN is supported.

Added test `tests/gemm/test_bmm_mxfp8.py`

Added routine `bmm_mxfp8` to `flashinfer_benchmark`.

Benchmark results for `bmm_mxfp8` (on B200 GPU):
```
python benchmarks/flashinfer_benchmark.py \
--routine bmm_mxfp8 -vv \
--num_iters 30 \
--batch_size 128 \
--m 512 --n 512 --k 4096 \
--out_dtype bfloat16 \
--backends cudnn \
--refcheck

[PERF] cudnn          :: median time 0.117 ms; std 0.001 ms; achieved tflops 2347.650 TFLOPs/sec; achieved tb_per_sec 0.040 TB/sec
```

And `bmm_fp8` for comparison:
```
python benchmarks/flashinfer_benchmark.py \
--routine bmm_fp8 -vv \
--num_iters 30 \
--batch_size 128 \
--m 512 --n 512 --k 4096 \
--input_dtype fp8_e4m3 \
--mat2_dtype fp8_e4m3 \
--out_dtype bfloat16 \
--backends cudnn \
--refcheck

[PERF] cudnn          :: median time 0.116 ms; std 0.001 ms; achieved tflops 2369.049 TFLOPs/sec; achieved tb_per_sec 0.041 TB/sec
```

When running `ncu` the kernel
`nvjet_sm100_qqtst_128x256_128x6_2x1_2cta_v_bz_Avec32UE8M0_Bvec32UE8M0_NNT`
seems to trigger.

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

flashinfer-ai/flashinfer#2209

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added MXFP8 (mixed 8-bit float) batched matrix multiplication with
cuDNN acceleration and package-level export.

* **Tests**
* Added parameterized tests validating MXFP8 BMM against reference
results across shapes, dtypes, layouts, backends, and autotune modes.

* **Chores**
* Updated benchmark catalog and backend-support mappings to include
MXFP8 BMM.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
<!-- .github/pull_request_template.md -->

## 📌 Description

I find the nvfp4 implemantation could only 1.3-1.4x speedup compared to
fp8 in deepseek-v3-0324 model .
and as the fp4 pflops is twice that of fp8, I think there should be some
points that could be optimization.

now after applying this pr, we can get an extra 10-15% speedup on fp4.
1369.89/1192.91=1.148 ~= 15% speedup

test cmd
```
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 1000 --random-input 1000 --random-output 1000 --max-concurrency 60 --port 30000 --host 0.0.0.0
```

accuracy
```
+------------------+-----------+----------+----------+-------+---------+---------+
| Model            | Dataset   | Metric   | Subset   |   Num |   Score | Cat.0   |
+==================+===========+==========+==========+=======+=========+=========+
| DeepSeek-V3-0324 | aime24    | mean_acc | default  |   300 |  0.5467 | default |
+------------------+-----------+----------+----------+-------+---------+---------+ 
```

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Refactor**
* Optimized FP4/FP8 quantization paths with improved register efficiency
* Enhanced kernel launch configuration to improve GPU occupancy and
performance
  * Streamlined accumulation processes to reduce memory footprint

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: bruce.xu <bruce.xu@gmicloud.ai>
Co-authored-by: bruce.xu <bruce.xu@gmicloud.ai>
<!-- .github/pull_request_template.md -->

## 📌 Description

Add CLAUDE.md as contribution guide to agents (and human).
Add several skills (adding an CUDA operator to flashinfer, debug,
profiling), this list will grow in the future.

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: aleozlx <aleyang@nvidia.com>
Co-Authored-By: bkryu <bkryu@nvidia.com>
Co-Authored-By: nvmbreughe <nvmbreughe@nvidia.com>
Co-Authored-By: jimmyzho <jimmzhou@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Element-wise tensor scaling (in-place/out-of-place) with JIT-backed
modules and a simple Python API supporting FP16, BF16, and FP32; AOT
pre-generation support.

* **Tests**
* Unit tests covering FP16, BF16, FP32 across sizes, in-place outputs,
and invalid-input handling.

* **Chores**
* Build integration to pre-generate scale modules and package export of
the new API.

* **Documentation**
* Kernel benchmarking tutorial; CUDA crash debugging guide;
comprehensive developer workflow doc.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Alex Yang <aleozlx@gmail.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: jimmzhou <jimmzhou@nvidia.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: bkryu <bkryu@nvidia.com>
Co-authored-by: nvmbreughe <nvmbreughe@nvidia.com>
…orm+FP4Quant fusion kernels (#2260)

<!-- .github/pull_request_template.md -->

## 📌 Description

This PR enhances the `rmsnorm_fp4quant` and `add_rmsnorm_fp4quant`
CuTe-DSL kernels with two key improvements:
* **Optional output allocation**:` y_fp4` and `block_scale` outputs can
now be either provided for in-place update or omitted for automatic
allocation and return
* **Global scale support**: Both fusion patterns now accept an optional
`global_scale` tensor (torch.Tensor | None, shape [1], dtype float32)
for NVFP4 quantization, enabling proper dynamic range scaling when
global_scale is pre-computed. Should not be provided for mxfp4

File Changes:
* `rmsnorm_fp4quant.py` / `add_rmsnorm_fp4quant.py`: Added global_scale:
torch.Tensor | None = None parameter; kernel now reads global scale from
device memory and incorporates it into block scale computation
* `bench_cute_dsl_rmsnorm_fp4quant.py` /
`bench_cute_dsl_add_rmsnorm_fp4quant.py`: Updated unfused baseline to
measure time for (add +) rmsnorm + fp4 quant, instead of measuring
separately.
* `test_rmsnorm_fp4_quant_cute_dsl.py` /
`test_add_rmsnorm_fp4_quant_cute_dsl.py`: Added auto-allocation tests,
global scale verification tests, and fused-vs-separate comparison tests.

API Changes:
```
# Before: outputs required
rmsnorm_fp4quant(x, weight, y_fp4, block_scale, ...)

# After: outputs optional, global_scale supported
y_fp4, block_scale = rmsnorm_fp4quant(x, weight, global_scale=gs, ...)  # auto-allocate
rmsnorm_fp4quant(x, weight, y_fp4, block_scale, global_scale=gs, ...)   # in-place
```

<details>
<summary>B200 (SM100) Benchmarks</summary>

```
$ python3 bench_cute_dsl_rmsnorm_fp4quant.py 
================================================================================
Fused RMSNorm + FP4 Quantization Benchmark
================================================================================
GPU Compute Capability: SM100

Running sanity check...
  OK: (128, 256) - FP4 match 99.8%
  OK: (512, 1024) - FP4 match 99.8%
  OK: (1024, 2048) - FP4 match 99.8%
✓ Confirmed: CuTe-DSL output is equivalent to RMSNorm + fp4_quantize


Batch    Hidden   Fused (µs)   BW (GB/s)  Unfused (µs)   Speedup   
-------------------------------------------------------------------
1000     1536     4.4          898.5      6.8            1.54x     
1000     2048     5.2          1019.4     7.4            1.43x     
1000     4096     6.7          1563.1     12.1           1.80x     
1000     8192     9.2          2291.5     20.2           2.20x     
1000     16384    22.1         1897.4     31.5           1.42x     
1000     32768    31.6         2663.3     52.0           1.65x     
1024     1536     4.4          920.1      6.8            1.55x     
1024     2048     5.1          1050.4     7.4            1.44x     
1024     4096     6.8          1593.1     12.2           1.80x     
1024     8192     9.2          2342.4     20.3           2.21x     
1024     16384    22.9         1880.4     31.8           1.39x     
1024     32768    31.9         2697.1     51.9           1.63x     
2048     1536     5.5          1465.1     9.9            1.80x     
2048     2048     6.5          1663.4     11.6           1.80x     
2048     4096     9.1          2357.9     20.1           2.20x     
2048     8192     16.8         2562.4     34.6           2.06x     
2048     16384    36.5         2357.9     57.3           1.57x     
2048     32768    53.5         3217.2     94.1           1.76x     
3000     1536     6.5          1818.2     12.7           1.96x     
3000     2048     7.7          2033.6     15.2           1.97x     
3000     4096     12.3         2563.2     26.9           2.19x     
3000     8192     22.4         2816.3     50.4           2.25x     
3000     16384    49.0         2569.9     83.1           1.70x     
3000     32768    73.2         3443.0     130.5          1.78x     
4096     1536     7.5          2153.4     15.4           2.05x     
4096     2048     8.8          2434.3     19.3           2.19x     
4096     4096     16.5         2606.7     35.4           2.14x     
4096     8192     29.2         2943.6     66.8           2.29x     
4096     16384    61.3         2803.8     109.1          1.78x     
4096     32768    95.8         3591.7     173.8          1.81x     
5000     1536     8.5          2312.4     18.2           2.14x     
5000     2048     10.4         2531.3     22.9           2.21x     
5000     4096     18.7         2803.9     42.3           2.26x     
5000     8192     35.2         2982.3     80.0           2.27x     
5000     16384    72.7         2889.0     130.0          1.79x     
5000     32768    114.1        3680.8     206.1          1.81x     
8192     1536     11.6         2776.2     27.1           2.33x     
8192     2048     15.6         2747.7     34.3           2.19x     
8192     4096     28.6         3002.4     67.6           2.36x     
8192     8192     52.4         3279.1     127.2          2.42x     
8192     16384    113.9        3021.1     209.4          1.84x     
8192     32768    178.5        3854.4     332.1          1.86x     
10000    1536     14.1         2783.0     31.6           2.23x     
10000    2048     17.8         2944.7     40.3           2.26x     
10000    4096     34.5         3038.7     81.3           2.35x     
10000    8192     62.1         3380.8     153.1          2.46x     
10000    16384    135.2        3106.7     252.2          1.87x     
10000    32768    214.7        3911.2     401.1          1.87x     
15000    1536     19.4         3044.7     45.8           2.36x     
15000    2048     25.2         3126.0     59.7           2.37x     
15000    4096     47.4         3322.2     118.1          2.49x     
15000    8192     89.0         3539.8     224.8          2.53x     
15000    16384    192.3        3274.4     373.5          1.94x     
15000    32768    315.1        3997.2     592.1          1.88x     
16384    1536     20.9         3086.3     50.2           2.40x     
16384    2048     27.2         3165.0     64.8           2.39x     
16384    4096     51.0         3371.5     128.2          2.51x     
16384    8192     96.3         3570.9     245.7          2.55x     
16384    16384    210.2        3272.7     407.1          1.94x     
16384    32768    342.7        4014.3     646.6          1.89x     
25000    1536     30.4         3231.8     75.1           2.47x     
25000    2048     38.7         3392.7     96.8           2.50x     
25000    4096     73.0         3596.6     191.8          2.63x     
25000    8192     142.4        3686.3     369.4          2.59x     
25000    16384    310.0        3386.3     614.6          1.98x     
25000    32768    515.6        4071.7     976.8          1.89x     
32768    1536     38.2         3378.5     96.8           2.53x     
32768    2048     48.2         3568.4     124.3          2.58x     
32768    4096     92.8         3705.0     249.0          2.68x     
32768    8192     184.0        3739.5     482.0          2.62x     
32768    16384    401.8        3424.1     799.3          1.99x     
32768    32768    672.9        4088.8     1312.0         1.95x     
60000    1536     64.1         3682.7     171.8          2.68x     
60000    2048     81.5         3863.4     222.0          2.72x     
60000    4096     162.3        3880.2     449.3          2.77x     
60000    8192     329.5        3822.1     873.7          2.65x     
60000    16384    719.2        3502.5     1458.1         2.03x     
60000    32768    1265.2       3982.2     2440.1         1.93x     
65536    1536     69.3         3723.3     187.5          2.71x     
65536    2048     88.3         3895.6     242.6          2.75x     
65536    4096     176.5        3896.3     489.2          2.77x     
65536    8192     359.2        3830.4     953.7          2.66x     
65536    16384    783.9        3510.1     1590.3         2.03x     
65536    32768    1341.8       4101.3     2705.2         2.02x     

================================================================================
Geomean speedup vs Unfused (rmsnorm + fp4_quantize): 2.10x
================================================================================
Benchmark Complete
================================================================================

$ python3 bench_cute_dsl_add_rmsnorm_fp4quant.py
================================================================================
Fused Add + RMSNorm + FP4 Quantization Benchmark
================================================================================
GPU Compute Capability: SM100

Running sanity check...
  OK: (128, 256) - FP4 match 99.9%
  OK: (512, 1024) - FP4 match 99.9%
  OK: (1024, 2048) - FP4 match 99.9%
✓ Confirmed: CuTe-DSL output is equivalent to torch.add + RMSNorm + fp4_quantize


Batch    Hidden   Fused (µs)   BW (GB/s)  Unfused (µs)   Speedup   
-------------------------------------------------------------------
1000     1536     5.0          1413.5     9.7            1.96x     
1000     2048     5.5          1708.4     10.7           1.95x     
1000     4096     8.9          2094.1     16.4           1.84x     
1000     8192     13.1         2864.0     27.5           2.11x     
1000     16384    33.5         2232.1     44.4           1.33x     
1000     32768    66.7         2243.9     83.8           1.26x     
1024     1536     5.0          1438.2     9.8            1.96x     
1024     2048     5.5          1729.1     10.8           1.95x     
1024     4096     9.0          2121.5     16.6           1.84x     
1024     8192     13.2         2890.1     27.8           2.10x     
1024     16384    34.5         2220.9     45.0           1.31x     
1024     32768    67.4         2272.1     85.5           1.27x     
2048     1536     7.1          2020.8     13.9           1.96x     
2048     2048     8.7          2211.3     16.6           1.92x     
2048     4096     13.1         2928.4     27.4           2.09x     
2048     8192     22.2         3447.5     49.0           2.20x     
2048     16384    61.7         2481.9     89.9           1.46x     
2048     32768    121.2        2525.8     155.4          1.28x     
3000     1536     9.9          2130.1     18.0           1.82x     
3000     2048     10.6         2638.8     21.4           2.02x     
3000     4096     17.1         3275.2     38.1           2.23x     
3000     8192     30.5         3675.4     73.7           2.42x     
3000     16384    86.7         2587.8     128.4          1.48x     
3000     32768    170.0        2639.4     218.1          1.28x     
4096     1536     11.2         2555.9     22.0           1.96x     
4096     2048     12.5         3067.1     27.2           2.18x     
4096     4096     22.1         3462.1     49.6           2.24x     
4096     8192     39.1         3915.4     98.9           2.53x     
4096     16384    115.7        2646.4     170.0          1.47x     
4096     32768    224.7        2725.9     291.4          1.30x     
5000     1536     13.5         2598.1     25.8           1.91x     
5000     2048     14.6         3209.1     32.2           2.21x     
5000     4096     25.9         3609.7     60.3           2.33x     
5000     8192     45.9         4068.6     118.5          2.58x     
5000     16384    137.7        2714.0     202.8          1.47x     
5000     32768    269.2        2777.2     349.1          1.30x     
8192     1536     19.7         2917.3     38.9           1.97x     
8192     2048     20.8         3680.3     49.4           2.38x     
8192     4096     38.8         3941.0     100.4          2.58x     
8192     8192     70.5         4343.5     188.2          2.67x     
8192     16384    220.1        2782.6     326.6          1.48x     
8192     32768    427.1        2867.7     563.7          1.32x     
10000    1536     23.3         3004.2     45.4           1.95x     
10000    2048     24.5         3819.6     59.4           2.43x     
10000    4096     45.4         4112.9     120.5          2.65x     
10000    8192     84.3         4432.8     226.4          2.69x     
10000    16384    267.8        2791.0     393.8          1.47x     
10000    32768    517.6        2888.4     683.8          1.32x     
15000    1536     33.2         3167.9     67.5           2.03x     
15000    2048     34.3         4085.9     90.2           2.63x     
15000    4096     64.7         4334.6     174.9          2.70x     
15000    8192     122.2        4587.7     333.1          2.73x     
15000    16384    397.2        2823.4     582.8          1.47x     
15000    32768    766.5        2925.8     1014.2         1.32x     
16384    1536     36.0         3192.3     74.9           2.08x     
16384    2048     36.9         4145.8     98.1           2.66x     
16384    4096     69.9         4379.2     189.6          2.71x     
16384    8192     132.8        4609.6     363.1          2.73x     
16384    16384    433.0        2828.8     635.6          1.47x     
16384    32768    837.1        2926.2     1113.5         1.33x     
25000    1536     51.3         3417.7     112.4          2.19x     
25000    2048     52.0         4496.5     145.2          2.79x     
25000    4096     102.8        4546.2     283.1          2.75x     
25000    8192     197.7        4725.8     547.1          2.77x     
25000    16384    653.6        2859.5     962.6          1.47x     
25000    32768    1266.7       2950.7     1726.5         1.36x     
32768    1536     64.7         3547.3     144.4          2.23x     
32768    2048     66.0         4639.2     186.7          2.83x     
32768    4096     132.4        4625.2     367.1          2.77x     
32768    8192     256.2        4779.7     713.6          2.78x     
32768    16384    856.9        2858.6     1259.6         1.47x     
32768    32768    1652.6       2964.4     2267.0         1.37x     
60000    1536     112.7        3729.8     255.0          2.26x     
60000    2048     115.0        4876.2     331.3          2.88x     
60000    4096     235.2        4767.4     662.0          2.81x     
60000    8192     462.3        4851.2     1294.6         2.80x     
60000    16384    1560.6       2873.9     2311.2         1.48x     
60000    32768    3008.9       2981.2     4225.7         1.40x     
65536    1536     122.4        3751.8     277.6          2.27x     
65536    2048     124.9        4901.8     361.0          2.89x     
65536    4096     256.2        4780.9     721.2          2.82x     
65536    8192     503.5        4864.7     1412.8         2.81x     
65536    16384    1703.0       2876.7     2508.2         1.47x     
65536    32768    3288.8       2979.2     4617.6         1.40x     

================================================================================
Geomean speedup vs Unfused (add + rmsnorm + fp4_quantize): 1.96x
================================================================================
Benchmark Complete
================================================================================
```
</details>

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added optional global scaling for FP4 quantization; quantization APIs
now return quantized output plus block scales and support
auto-allocation.

* **Benchmark Improvements**
* Benchmarks now propagate global_scale, report fused vs unfused
timings, and show a single speedup metric versus the unfused path with
simplified output formatting.

* **Testing**
* Expanded tests to cover global-scale paths, auto-allocation, swizzled
layouts, large sizes, and introduced two-tier tolerance assertions.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- .github/pull_request_template.md -->

## 📌 Description

Adds workflow that runs weekly on Mondays (and can be triggered
manually) to execute scripts/xfails_tracker.py, which scans the test
suite for xfail markers and outputs a comprehensive report to
reports/xfails_report.txt. If the report has changed, it automatically
creates a pull request to commit the updated report to the repository.

## 🔍 Related Issues

cont. testing item from november roadmap

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Chores**
* Added an automated weekly report generation that detects changes and
opens pull requests to incorporate updates.
* Updated automation to remove generated-attribution boilerplate from
automated commit/PR messages.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- .github/pull_request_template.md -->

## 📌 Description

Based on this
[comment](flashinfer-ai/flashinfer#2127 (review))
in flashinfer-ai/flashinfer#2127, we can add
support for Int64 indices as well. I decided to do this using `IdType`
like it is done in other files.

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

Test results:

```
(flashinfer) raayan@uril-1:~/projects/flashinfer$ pytest tests/utils/test_sampling.py
============================================================= test session starts =============================================================
platform linux -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0
rootdir: /home/raayan/projects/flashinfer
configfile: pytest.ini
collected 1884 items

tests/utils/test_sampling.py .......................................................................................................... [  5%]
....................................................................................................................................... [ 12%]
....................................................................................................................................... [ 19%]
....................s..s..s..........................................................................sss........................sss.... [ 27%]
....................................................................................................................................... [ 34%]
..........................ssss................................ssss................................ssss................................s [ 41%]
sss................................ssss................................ssss................................ssss........................ [ 48%]
........ssss................................ssss................................ssss................................ssss............... [ 55%]
.................ssss................................ssss................................ssss................................ssss...... [ 62%]
..........................ssss................................ssss................................ssss................................s [ 70%]
sss................................ssss................................ssss................................ssss........................ [ 77%]
........ssss................................ssss................................ssss................................ssss............... [ 84%]
.................ssss.................................................................................................................. [ 91%]
........................................................sss............................................................................ [ 98%]
.......................                                                                                                                 [100%]

================================================ 1764 passed, 120 skipped in 546.33s (0:09:06) ================================================
(flashinfer) raayan@uril-1:~/projects/flashinfer$
```


## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->

---------

Signed-off-by: raayandhar <raayan.dhar@gmail.com>
<!-- .github/pull_request_template.md -->

## 📌 Description

This PR adds implementation for Gated Delta Rule (or Gated Delta Net) on
Hopper architecture to better support Qwen-next like architecture.

## 🔍 Related Issues

#1690 

## 🚀 Pull Request Checklist

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->

Thanks @jiahanc for initiating the kernel integration and implementing
the API.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* SM90-optimized Gated Delta Rule (GDN) prefill: Python API
(chunk_gated_delta_rule), host launcher, and FFI export; supports
optional alpha/beta gating and returns output and final state.

* **Benchmarks & Tests**
* New GPU benchmark for GDN prefill reporting runtime, TFLOPs and
bandwidth.
* Added reference implementations and comprehensive tests validating
prefill, chunked prefill, and delta-rule behavior.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: Zihao Ye <expye@outlook.com>
<!-- .github/pull_request_template.md -->

## 📌 Description

2025 -> 2026.

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Updated copyright year ranges to 2026 in project headers and
documentation.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- .github/pull_request_template.md -->

## 📌 Description

Update minimal version requirement of nvidia-cutlass-dsl to 4.3.4, which
should resolve the arm issue in
flashinfer-ai/flashinfer#2279

## 🔍 Related Issues

#2279 

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Chores**
* Updated internal dependencies to improve stability and compatibility.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- .github/pull_request_template.md -->

## 📌 Description

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

flashinfer-ai/flashinfer#2111 already enabled
Hopper FA3 FP8 attention in `prefill.py`. This is just a follow-up PR to
make the same change in `decode.py` because `decode.py` actually uses
prefill kernels.

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added selectable backend support (including a new backend option) and
explicit output-dtype control for decode/prefill workflows.

* **Improvements**
* Improved FP8 handling and propagation of scales; runtime checks
enforce output-dtype consistency and avoid unnecessary scaling when
scale == 1.0.
  * Backend auto-selection logic enhanced to consider output dtype.

* **Documentation**
  * FP8 guidance updated to allow float16 and bfloat16 outputs.

* **Tests**
  * Added tests validating FP8 paged decoding with the new backend.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
This PR updates the Docker CI image tags to the latest version:
`20260105-a97b5d7`

Updated images:
- flashinfer/flashinfer-ci-cu126:20260105-a97b5d7
- flashinfer/flashinfer-ci-cu128:20260105-a97b5d7
- flashinfer/flashinfer-ci-cu129:20260105-a97b5d7
- flashinfer/flashinfer-ci-cu130:20260105-a97b5d7

Auto-generated by [release-ci-docker
workflow](https://github.com/flashinfer-ai/flashinfer/actions/runs/20731143681)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
  * Updated Docker image tags in CI/CD pipeline configuration

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: yzh119 <11773619+yzh119@users.noreply.github.com>
…-ffi for cute-dsl kernels (#2279)

<!-- .github/pull_request_template.md -->

## 📌 Description

cute-dsl adds support of compiling with tvm-ffi since 4.3 release
https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/cute_dsl_general/compile_with_tvm_ffi.html,
which allows user to pass torch tensors directly with negligible dlpack
conversion cost, without the need of manually creating cute tensors from
cute pointer.

In this PR we refactored the existing cute-dsl kernels to enable tvm-ffi
and simplify the torch -> cute-dsl boilerplate.

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [ ] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [ ] I have installed the hooks with `pre-commit install`.
- [ ] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Refactor**
* FP4 quant kernels (RMSNorm and Add+RMSNorm) now accept TVM-FFI tensors
and a generic stream instead of raw pointers, simplifying invocation and
runtime flow and improving handling of swizzled vs non‑swizzled scale
layouts.
* Compilation path updated to use TVM-FFI-friendly scaffolding with
symbolic/fake tensors and streams.
* **Documentation**
* Docstrings and user-facing notes updated to describe the tensor-based
inputs, TVM-FFI usage, and swizzle-dependent layout behavior.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
<!-- .github/pull_request_template.md -->

## 📌 Description

During `flashinfer_benchmark.py`'s attention benchmark, using `fa2_tc`
for "FlashAttention2 with tensor cores enabled" would lead to incorrect
backend name "fa2_tc" to wrapper when it should be "fa2". This bug did
not cause any issues, but recent commits have caused the bug to surface.

Current PR changed the benchmark code to fix the issue.

**No library code or unit test code changes so will not trigger unit
tests**

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Tests**
* Improved backend configuration handling in batch decoding benchmarks
to ensure correct parameter mapping during wrapper instantiation.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@murphymatt murphymatt requested review from a team and divchenko January 7, 2026 21:54
cyx-6 and others added 5 commits January 7, 2026 15:45
<!-- .github/pull_request_template.md -->

## 📌 Description

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

Fixed the #2284. In the past, before #1641, the flashinfer used torch
default generator `at::cuda::detail::getDefaultCUDAGenerator()` while
#1641 will create one new generator instance at a time. This PR recovers
the default generator from torch.

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [ ] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [ ] I have installed the hooks with `pre-commit install`.
- [ ] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Sampling now uses a device-aware default random generator, ensuring
consistent and correct sampling behavior across CPU and GPU when no
generator is provided.

* **Chores**
* Small public API update to accept a device context so sampling
routines derive RNG state from the correct device.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
…roughput + speculative decoding (#2265)

<!-- .github/pull_request_template.md -->

## 📌 Description

This MR adds the optimized decode attention kernels for high throughput
(large batch size) + speculative decoding (seqlen_q > 1).

See below for speedups (collected by
`benchmarks/flashinfer_benchmark.py`). The seqlenKv is 16K for all
cases.

| test case | median_time_ms | median_time_ms (opt) | speedup |

|------------------------------------------|----------------|----------------------|----------|
| Qwen3-480B-fp8_e4m3-batchSize8-seqLenQ2 | 0.057 | 0.046 | 1.24 |
| Qwen3-480B-fp8_e4m3-batchSize16-seqLenQ2 | 0.11 | 0.083 | 1.33 |
| Qwen3-480B-fp8_e4m3-batchSize32-seqLenQ2 | 0.213 | 0.168 | 1.27 |
| Qwen3-480B-fp8_e4m3-batchSize40-seqLenQ2 | 0.266 | 0.241 | 1.10 |
| Qwen3-480B-fp8_e4m3-batchSize64-seqLenQ2 | 0.432 | 0.336 | 1.29 |
| Qwen3-480B-fp8_e4m3-batchSize8-seqLenQ4 | 0.109 | 0.048 | 2.27 |
| Qwen3-480B-fp8_e4m3-batchSize16-seqLenQ4 | 0.212 | 0.083 | 2.55 |
| Qwen3-480B-fp8_e4m3-batchSize32-seqLenQ4 | 0.371 | 0.168 | 2.21 |
| Qwen3-480B-fp8_e4m3-batchSize40-seqLenQ4 | 0.472 | 0.245 | 1.93 |
| Qwen3-480B-fp8_e4m3-batchSize64-seqLenQ4 | 0.736 | 0.348 | 2.11 |
| Qwen3-480B-fp8_e4m3-batchSize8-seqLenQ8 | 0.212 | 0.061 | 3.48 |
| Qwen3-480B-fp8_e4m3-batchSize16-seqLenQ8 | 0.37 | 0.106 | 3.49 |
| Qwen3-480B-fp8_e4m3-batchSize32-seqLenQ8 | 0.732 | 0.239 | 3.06 |
| Qwen3-480B-fp8_e4m3-batchSize40-seqLenQ8 | 0.937 | 0.321 | 2.92 |
| Qwen3-480B-fp8_e4m3-batchSize64-seqLenQ8 | 1.456 | 0.484 | 3.01 |
| GPT-OSS-fp8_e4m3-batchSize8-seqLenQ2 | 0.051 | 0.03 | 1.70 |
| GPT-OSS-fp8_e4m3-batchSize16-seqLenQ2 | 0.098 | 0.054 | 1.81 |
| GPT-OSS-fp8_e4m3-batchSize32-seqLenQ2 | 0.188 | 0.104 | 1.81 |
| GPT-OSS-fp8_e4m3-batchSize40-seqLenQ2 | 0.234 | 0.15 | 1.56 |
| GPT-OSS-fp8_e4m3-batchSize64-seqLenQ2 | 0.332 | 0.199 | 1.67 |
| GPT-OSS-fp8_e4m3-batchSize8-seqLenQ4 | 0.099 | 0.038 | 2.61 |
| GPT-OSS-fp8_e4m3-batchSize16-seqLenQ4 | 0.188 | 0.07 | 2.69 |
| GPT-OSS-fp8_e4m3-batchSize32-seqLenQ4 | 0.332 | 0.136 | 2.44 |
| GPT-OSS-fp8_e4m3-batchSize40-seqLenQ4 | 0.418 | 0.2 | 2.09 |
| GPT-OSS-fp8_e4m3-batchSize64-seqLenQ4 | 0.647 | 0.265 | 2.44 |
| GPT-OSS-fp8_e4m3-batchSize8-seqLenQ8 | 0.188 | 0.039 | 4.82 |
| GPT-OSS-fp8_e4m3-batchSize16-seqLenQ8 | 0.332 | 0.065 | 5.11 |
| GPT-OSS-fp8_e4m3-batchSize32-seqLenQ8 | 0.647 | 0.126 | 5.13 |
| GPT-OSS-fp8_e4m3-batchSize40-seqLenQ8 | 0.83 | 0.185 | 4.49 |
| GPT-OSS-fp8_e4m3-batchSize64-seqLenQ8 | 1.29 | 0.245 | 5.27 |

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Generation attention now enforces causal masking during token
generation.

* **Performance / Refactor**
* Improved kernel selection and on-demand loading for better performance
and GPU compatibility.
* Added finer-grained tuning parameters for tile/grouping,
tokens-per-CTA and inflation to enable more optimal kernel choices.

* **Chores**
  * Updated FMHA artifact paths and checksums.

* **Tests**
* Expanded parameterized tests to cover larger batch decoding scenarios.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: yzh119 <zihaoy@nvidia.com>
Co-authored-by: Zihao Ye <expye@outlook.com>
<!-- .github/pull_request_template.md -->

## 📌 Description

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
  * Version bumped to 0.6.0 with no functional changes.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## 🤖 Installing Claude Code GitHub App

This PR adds a GitHub Actions workflow that enables Claude Code
integration in our repository.

### What is Claude Code?

[Claude Code](https://claude.com/claude-code) is an AI coding agent that
can help with:
- Bug fixes and improvements  
- Documentation updates
- Implementing new features
- Code reviews and suggestions
- Writing tests
- And more!

### How it works

Once this PR is merged, we'll be able to interact with Claude by
mentioning @claude in a pull request or issue comment.
Once the workflow is triggered, Claude will analyze the comment and
surrounding context, and execute on the request in a GitHub action.

### Important Notes

- **This workflow won't take effect until this PR is merged**
- **@claude mentions won't work until after the merge is complete**
- The workflow runs automatically whenever Claude is mentioned in PR or
issue comments
- Claude gets access to the entire PR or issue context including files,
diffs, and previous comments

### Security

- Only approved team members can use this feature.
- Our Anthropic API key is securely stored as a GitHub Actions secret
- Only users with write access to the repository can trigger the
workflow
- All Claude runs are stored in the GitHub Actions run history
- Claude's default tools are limited to reading/writing files and
interacting with our repo by creating comments, branches, and commits.
- We can add more allowed tools by adding them to the workflow file
like:

```
allowed_tools: Bash(npm install),Bash(npm run build),Bash(npm run lint),Bash(npm run test)
```

There's more information in the [Claude Code action
repo](https://github.com/anthropics/claude-code-action).

After merging this PR, let's try mentioning @claude in a comment on any
PR to get started!

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added automated AI code review workflows that run on pull requests and
when the bot is mentioned in comments.
* Reviews can post feedback as PR comments and are configurable with
optional prompts.

* **Chores**
* Workflows verify contributor authorization and only run reviews for
authorized users.
* Reviews assess quality, bugs, performance, security, and test
coverage.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
<!-- .github/pull_request_template.md -->

## 📌 Description

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [ x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [ x] I have installed the hooks with `pre-commit install`.
- [ x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ x] Tests have been added or updated as needed.
- [ x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Tests**
* Improved test suite with a refined hardware check: an FP8-related test
now requires a specific GPU compute capability so it only runs on
compatible hardware, reducing false skips and improving reliability.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.