Sync origin/main with upstream/main by murphymatt · Pull Request #16 · fw-ai/flashinfer

murphymatt · 2026-01-07T21:46:41Z

Rebases origin/main onto upstream/main to resync with flashinfer-ai/flashinfer repository.

## 📌 Description Brings in some changes to `test_hopper.py` to pass more unit tests * `test_deepseek_prefill` --> Raise tolerance for bf16 inputs * Others: The ``` token_pos_in_items_len=torch.tensor(token_pos_in_items_len) .to(dtype=torch.uint32) .to(0), ``` is an incorrect API and results in invalid input errors. Change it to: `token_pos_in_items_len=token_pos_in_items_len,` so that it matches the correct usage in e.g. [test_batch_prefill_kernels.py](https://github.com/flashinfer-ai/flashinfer/blob/6765cadd14fbedc9ffab428a87149a7d3f5d69f1/tests/attention/test_batch_prefill_kernels.py#L890) After this, `test_hopper.py` result improves to `3 failed, 2865 passed, 1320 skipped in 65.26s (0:01:05) `  ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes

## 📌 Description Thor and Spark support when wheels are generating ## 🔍 Related Issues Output says that is not compatible. Only with JIT is working.  ## Summary by CodeRabbit * **New Features** * Broadened GPU architecture support to include additional newer architectures. * **Documentation** * Updated README and installation docs to show the revised CUDA architecture example list. * **Chores** * Adjusted release/nightly workflows and build scripts to select architectures using an expanded CUDA-version threshold and branching logic. * **Performance** * Extended architecture-specific build/runtime handling to cover an additional GPU architecture affecting memory-related behavior.  --------- Co-authored-by: Zihao Ye <expye@outlook.com> Co-authored-by: yzh119 <zihaoy@nvidia.com>

## 📌 Description Deprecate `tile_token_dim` in trtllm_moe. It is already not used and mark with deprecation warning, plan to deprecate totally in next major release  ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Refactor** * Removed the deprecated `tile_tokens_dim` parameter from MOE benchmarks and kernel functions, streamlining API calls and eliminating associated deprecation warnings.  Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

@Edenzzzz

Co-authored-by: @Edenzzzz ## 📌 Description Fixes flashinfer-ai/flashinfer#1022. Unlike flashinfer-ai/flashinfer#1231, this splits the inputs into separate prefill and decode inputs. It probably should be possible to automatically handle this splitting in Python so you can simply just provide a single batch of requests? To run the benchmark for this run: `python benchmarks/bench_mixed_attention.py` Performance: ===== Benchmark 1: (kv_len, qo_len) set ===== Prefill = 2 requests, 2048 Q len, 2048 KV len Decode = 128 requests, 2048 KV len Elapsed time (Batched Prefill): 0.65 ms Elapsed time (Batched POD Attention): 0.46 ms Elapsed time (Persistent BatchAttention): 0.56 ms **Batch POD speedup over Persistent BatchAttention: 1.22x** ===== Benchmark 2: (kv_len, qo_len) set ===== Prefill = 1 request, 2048 Q len, 2048 KV len Decode = 128 requests, 2048 KV len Elapsed time (Batched Prefill): 0.55 ms Elapsed time (Batched POD Attention): 0.41 ms Elapsed time (POD Attention): 0.41 ms Elapsed time (Sequential two kernels): 0.51 ms Elapsed time (Persistent BatchAttention): 0.45 ms **Batch POD speedup over Persistent BatchAttention: 1.11x** ===== Benchmark 3: (kv_len, qo_len) set ===== Prefill = 1 request, 4096 Q len, 4096 KV len Decode = 128 requests, 4096 KV len Elapsed time (Batched Prefill): 1.27 ms Elapsed time (Batched POD Attention): 0.86 ms Elapsed time (POD Attention): 0.82 ms Elapsed time (Sequential two kernels): 1.15 ms Elapsed time (Persistent BatchAttention): 1.08 ms Batch POD speedup over Persistent BatchAttention: 1.26x ===== Benchmark 4: (kv_len, qo_len) set ===== Prefill = 1 request, 4096 Q len, 4096 KV len Decode = 128 requests, 8192 KV len Elapsed time (Batched Prefill): 2.15 ms Elapsed time (Batched POD Attention): 1.52 ms Elapsed time (POD Attention): 1.54 ms Elapsed time (Sequential two kernels): 1.82 ms Elapsed time (Persistent BatchAttention): 1.76 ms **Batch POD speedup over Persistent BatchAttention: 1.16x** ===== Benchmark 5: (kv_len, qo_len) set ===== Prefill = 1 request, 6000 Q len, 7000 KV len Decode = 128 requests, 8192 KV len Elapsed time (Batched Prefill): 2.86 ms Elapsed time (Batched POD Attention): 2.03 ms Elapsed time (POD Attention): 1.95 ms Elapsed time (Sequential two kernels): 2.52 ms Elapsed time (Persistent BatchAttention): 2.45 ms **Batch POD speedup over Persistent BatchAttention: 1.20x** ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **New Features** * Added a batched prefill+decode attention path with a public batch-oriented POD wrapper and JIT module export. * **Performance** * Benchmarks extended to include batched-path timings, memory bandwidth, elapsed-time and comparative speedup metrics across expanded prefill/decode scenarios. * **API** * Runtime binding for batched KV‑cache execution added; planning APIs now accept an optional colocated-CTA parameter that influences scheduling.  --------- Co-authored-by: Aditya K Kamath <akamath1997@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Edenzzzz <wtan45@wisc.edu>

## 📌 Description Patch sm103 for 3xfp4 moe generation ## 🔍 Related Issues Following up of #2020 #1925 ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes ``` $ ls csrc/nv_internal/tensorrt_llm/cutlass_instantiations/103/gemm_grouped 100 103 80 $ pytest tests/moe/test_trtllm_cutlass_fused_moe.py 22 passed, 3 skipped, 1 warning in 771.89s (0:12:51) ```  ## Summary by CodeRabbit * **New Features** * Added support for Blackwell (SM103) GPU architecture in MOE (Mixture of Experts) operations with specialized CUTLASS-optimized modules.

## 📌 Description This PR does two things: * Add a check for the number of tokens and raise an exception if the max token size was exceeded * Adds an optional parameter to allow users to dial in an arbitrary workspace ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [ ] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [ ] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **New Features** * Added an optional configurable workspace buffer size for all-reduce operations with a sensible default to preserve backwards compatibility. * Runtime input validation now enforces 2D inputs and token-count limits, with clearer error messages guiding corrective actions. * **Tests** * Expanded test coverage for workspace behavior: default sizing, explicit sizing, and negative tests for insufficient workspace. * Tests now allow supplying an explicit workspace size to validate allocation and reuse scenarios.

## 📌 Description - Small optimization for TRT-LLM Gen MoE finalize kernel TopK=8, NumExperts=128, HiddenSize=4096 | BS | Baseline, us | Optimized, us | Speed-up | | ------------- | ------------- | ------------- | ------------- | | 256 | 11 | 6 | 1.83 | | 512 | 12 | 7 | 1.71 | | 1024 | 16 | 15 | 1.06 | | 4096 | 55 | 49 | 1.12 | | 8192 | 107 | 95 | 1.13 | | 16384 | 205 | 183 | 1.12 |  ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [ ] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [ ] I have installed the hooks with `pre-commit install`. - [ ] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **New Features** * Enabled vectorized, Top-K unrolled finalize path for MOE (Mixture of Experts) kernel operations with improved performance. * Added support for multiple data types (bfloat16, float, half) with enhanced type specialization and packing. * Introduced runtime validation for TopK configurations (≤ 64) to ensure optimal vectorized execution.

## 📌 Description Refactor fused_moe test. Split test on model+precision. Part [1]: - test deepseek (kimi, lite) fp8 block-scaled fused moe - default TP8 - PDL enabled - MajorK weight layout - higher tolerance and matching percentage Next Part [2]: - add BlockMajorK weight layout Next Part [x]: - Per Tensor FP8 MoE, FP4MoE later: - refactor llama4, topk?, renormalize? routing tests ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Tests** * Added a comprehensive FP8 block-scale fused Mixture-of-Experts test validating end-to-end correctness across many routing, expert and precision configurations. Includes randomized inputs, per-token/per-expert workflows, extensive parameterizations, diagnostic statistics, autotune-path checks, and a minimal sanity run.

## 📌 Description Duplicate of #2091, created PR from flashinfer-ai to enable workflow. ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Bug Fixes** * Corrected CUDA compute capability targeting from 11.0f to 11.0a for improved compatibility across build configurations. * **Documentation** * Updated installation and build documentation to reflect updated CUDA architecture configurations for both older and newer CUDA versions.

## 📌 Description The `enablePDL` flag is set to false, this PR turned them on. Set to true for both because sm_100 and sm_120 should have support of pdl. ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [ ] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [ ] I have installed the hooks with `pre-commit install`. - [ ] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Refactor** * Updated runtime configuration for FP4 GEMM operations to enhance execution performance on SM100 and SM120 GPU architectures.

## Summary This PR updates the CODEOWNERS file based on git commit history analysis from the last 180 days. ## Changes - Updated `.github/CODEOWNERS` with current code ownership based on: - Commit frequency - File coverage - Commit recency ## How to Review 1. Review the changes to `.github/CODEOWNERS` 2. Verify that the assigned owners are appropriate for each module 3. Make manual adjustments if needed before merging ## Notes - This is an automated PR generated weekly - Minimum commits threshold: 1 - Analysis period: 180 days - Directory depth: 3 levels - Top N owners per module: 5 --- 🤖 This PR was automatically generated by the [update-codeowners workflow](.github/workflows/update-codeowners.yml)  ## Summary by CodeRabbit ## Release Notes * **Chores** * Internal maintenance updates to code ownership mappings. --- **Note:** This release contains no user-facing changes.  Co-authored-by: flashinfer-bot <flashinfer-bot@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com>

@nvpohanh

…sed RoPE + Q + KV cache, supports MLA/GQA/MHA) (#2037)  ## 📌 Description Add `flashinfer.rope.rope_quantize_fp8_append_paged_kv_cache`, which runs a fused RoPE + Quantization (16 -> 8) + append KV Cache operation kernel. Note that this does not support optional quantization (there is no "RoPE + append KV Cache" fused operation available). Tested on NVIDIA H100 NVL + flashinfer/flashinfer-ci-cu130:latest for MLA/MHA/GQA problem sizes for decode and prefill cases. ## 🔍 Related Issues "[Model Optimization] Add RoPE, RoPE+Q, RoPE+Q+KVCacheUpdate fused kernels for MLA/GQA/MHA" item from Q4 roadmap: flashinfer-ai/flashinfer#1770. This PR is part 2 to earlier PR for RoPE + Q: flashinfer-ai/flashinfer#1924 FW Stakeholders: @nvpohanh @pavanimajety ## 🧪 Test results ``` $ pytest tests/attention/test_rope.py::test_rope_quantize_fp8_append_paged_kv_cache_decode -s ======================================================== test session starts =========================================================platform linux -- Python 3.12.11, pytest-8.4.2, pluggy-1.6.0 rootdir: /workspace/flashinfer configfile: pytest.ini collected 384 items tests/attention/test_rope.py ................................................................................................................................................................................................................................................................................................................................................................................................ ======================================================== 384 passed in 35.22s ======================================================== ``` ``` $ pytest tests/attention/test_rope.py::test_generalized_rope_quantize_append_kv_cache -s ======================================================== test session starts ========================================================= platform linux -- Python 3.12.11, pytest-8.4.2, pluggy-1.6.0 rootdir: /workspace/flashinfer configfile: pytest.ini collected 1248 items tests/attention/test_rope.py ......................................................................................................... ...................................................................................................................................... ...................................................................................................................................... ...................................................................................................................................... ...................................................................................................................................... ...................................................................................................................................... ...................................................................................................................................... ...................................................................................................................................... ...................................................................................................................................... ....................................................................... ================================================== 1248 passed in 63.07s (0:01:03) =================================================== ``` ``` $ python benchmarks/bench_rope_quantize_fp8_append_cache.py Detected GPU: NVIDIA GB200 Theoretical Peak Memory Bandwidth: 7928.06 GB/s ==================================================================================================== MLA: 128 Q heads, 1 K head, 64+512 dims (DeepSeek-style) ==================================================================================================== Tokens Time (ms) BW (GB/s) BW% (Peak) TFLOPs ---------------------------------------------------------------------- 1 0.00258 86.53 1.1 0.010 32 0.00381 1873.82 23.6 0.208 128 0.00763 3744.50 47.2 0.416 384 0.01848 4637.34 58.5 0.515 768 0.03694 4639.75 58.5 0.515 1024 0.04879 4683.57 59.1 0.520 2048 0.09590 4766.09 60.1 0.529 4096 0.19031 4803.27 60.6 0.533 8192 0.38523 4745.78 59.9 0.527 ==================================================================================================== GQA: 32 Q heads, 8 K heads, 64+64 dims (Llama-style) ==================================================================================================== Tokens Time (ms) BW (GB/s) BW% (Peak) TFLOPs ---------------------------------------------------------------------- 1 0.00294 6.36 0.1 0.003 32 0.00316 189.48 2.4 0.078 128 0.00317 755.23 9.5 0.310 384 0.00398 1803.09 22.7 0.741 768 0.00522 2750.51 34.7 1.130 1024 0.00617 3100.80 39.1 1.274 2048 0.00927 4130.83 52.1 1.697 4096 0.01631 4695.01 59.2 1.929 8192 0.03466 4418.01 55.7 1.815 ==================================================================================================== MHA: 32 Q heads, 32 K heads, 64+64 dims (Standard) ==================================================================================================== Tokens Time (ms) BW (GB/s) BW% (Peak) TFLOPs ---------------------------------------------------------------------- 1 0.00293 12.68 0.2 0.004 32 0.00313 379.98 4.8 0.126 128 0.00357 1331.80 16.8 0.441 384 0.00517 2756.73 34.8 0.912 768 0.00742 3840.41 48.4 1.271 1024 0.00887 4287.15 54.1 1.419 2048 0.01504 5055.18 63.8 1.673 4096 0.03343 4548.12 57.4 1.505 8192 0.06410 4744.76 59.8 1.571 ==================================================================================================== Configuration details: Page size: 32, Batch size: 4 Token range: 1 (single decode) → 8192 (large prefill) GPU: NVIDIA GB200 Theoretical Peak Memory Bandwidth: 7928.06 GB/s BW% calculated as: (achieved_bandwidth / peak_bandwidth) * 100 ==================================================================================================== ``` ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **New Features** * Fused RoPE + FP8 quantize-and-append for paged KV caches (MLA, GQA/MHA) with layout, page-size, interleave and PDL options; returns quantized Q outputs and writes K/V into paged caches; public ops and high-level API added. * **Tests** * Deterministic, parameterized tests for append and decode/continuation across attention types, layouts, dtypes and quant settings with reference validation. * **Benchmarks** * New benchmark script for performance, bandwidth and Nsight profiling of the paged-KV quantize+append path. * **Chores** * Added cached GPU memory-bandwidth utility for benchmarks.  --------- Co-authored-by: Zihao Ye <expye@outlook.com>

…tion (#2084)  ## 📌 Description - change `bmm1_scale` and `bmm2_scale` to `Union[float, torch.Tensor]`. notice that when using tensor, it must be applied by log2e - **remove the `bmm1_scale_log2_tensor` and `bmm2_scale_tensor` in the `xqa_batch_decode_with_kv_cache_mla`** - update trtllm-gen FMHA kernels TODO: do the same refactor for xqa kernels. The support for the device side scales was removed in #2033 ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **New Features** * Attention scale parameters now accept either floats or 1-element tensors across prefill, decode and runtime; tensor scales are validated and applied on-device and pointer-backed scale paths are supported. * **Chores** * Updated FMHA artifact path and checksum constants; added a public utility import and removed an obsolete inline comment. * **Tests** * Updated tests to exercise device/tensor-or-scalar scale flows, removed legacy per-tensor call-site args, and added device-scale parametrization for several test variants.  --------- Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

## 📌 Description Add shuffling and blockmajorK layout in dpskv3 fused_moe fp8_blockscaled tests. ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Tests** * Expanded MoE test suite with per-expert weight shuffling, optional block-layout conversion, selectable weight-processing modes, and dynamic kernel flags. * Added a reference FP8 block-scale validation path and centralized accuracy checks for clearer correctness verification. * **Refactor** * Centralized test utilities: quantization mode and test-skip logic moved into shared helpers for consistent gating across MoE tests.  --------- Co-authored-by: Zihao Ye <expye@outlook.com>

## 📌 Description Added DSR1 MLA test, and split up the trtllm_batch_decode_mla function. ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [ ] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [ ] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Tests** * Improved test suite for batch decoding by making maximum sequence length configurable, adding parameterized runs across short and long lengths, and introducing a compatibility wrapper to preserve legacy behavior. This enhances coverage and validation across varied sequence-length scenarios.  --------- Co-authored-by: Zihao Ye <expye@outlook.com>

## 📌 Description In #1898, it was raised that trtllm-gen's attention kernels fail for batch size 1. The prefill kernel was fixed in #1912 and prefill tests have been enabled. Further updates to trtllm-gen kernels have also fixed the decode batch size 1 issue. Current PR re-enables testing.  ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Tests** * Expanded batch_decode test scenarios to cover additional small-batch and page-size combinations. * Increased coverage for max_in_kv_len by testing multiple length options instead of a single value. * Restored previously marked-as-expected-failure case to run normally, improving overall test pass coverage.  --------- Co-authored-by: Zihao Ye <expye@outlook.com>

## 📌 Description This pr adds a parameter `return_lse_base_on_e` to control the base of LSE returned by MLA. Default to `False`, which keeps the same with current implementation. If `return_lse_base_on_e` is `True`, multiply the final LSE by `loge2` to maintain consistency with the standard softmax and FA3. ## 🔍 Related Issues #2113 ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **New Features** * Added a run-time option to control whether returned log‑sum‑exp (LSE) baselines are scaled by ln(2) (default: disabled). * **Bug Fixes** * Conditional scaling ensures returned LSE values are consistent when the option is enabled, improving numerical consistency. * **Chores** * The new option is exposed in public APIs and bindings and is propagated through the execution path.  --------- Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>

## 📌 Description Update xqa license based on NVIDIA/TensorRT-LLM#8807  ## 🔍 Related Issues flashinfer-ai/flashinfer#1977  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [ ] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [ ] I have installed the hooks with `pre-commit install`. - [ ] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Chores** * Updated project licensing to Apache License 2.0 with extended copyright years through 2025. ✏️ Tip: You can customize this high-level summary in your review settings.  Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>

## 📌 Description  ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [ ] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [ ] I have installed the hooks with `pre-commit install`. - [ ] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **New Features** * Attention ops now accept tensor-based per-head scaling (q/kv) in C++ and Python paths, enabling dynamic or per-tensor quantization scales. * Python APIs and docs updated to accept float or tensor scales. * **Tests** * Batch-decode tests adjusted to use per-sequence cache/block sizing for more accurate memory dimensioning. * **Documentation** * Docstrings updated to describe tensor-or-scalar scale inputs.  --------- Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com>

## 📌 Description 9.0a was removed from installation documentation by accident, in some recent PRs. ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes

…cubin (#2123)  ## 📌 Description flashinfer-cubin package building failed because we flashinfer/utils.py relies on nvidia-ml-py which is not specified as part of build system requirements of the package. ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Chores** * Added a new build system dependency to support enhanced system functionality. ✏️ Tip: You can customize this high-level summary in your review settings.

…udnn' (#1979)  ## 📌 Description Current PR: * Introduces an `auto` backend to `mm_fp4` that can be autotuned. **It replaces `cudnn` as the default.** * Implementation matches `bmm_fp8`'s auto backend support. * Allows `cudnn` backend to be autotuned. * Added unit test test cases for backend=auto Behavior of `auto` backend: * Examines CUDA version & cuDNN version and calls either `cutlass` or `cudnn` kernel backends. `trtllm` kernel is not considered due to a non-interchangeable interface with other backends. * `auto` backend therefore only supports inputs runnable by `cutlass` and/or `cudnn. * Non-autotuned behavior: * Constructs an ordered list of backends (cudnn, cutlass) or (cutlass, cudnn) where ordering is based on previous microbenchmark study results. * If CUDA 12 --> cutlass comes to front. * If CUDA 13 and cuDNN version < 9.15 --> cutlass comes front * If CUDA 13 and cuDNN version >= 9.15 --> cudnn comes front * If kernel is not available from a support check, it is removed from the list. * Autotune behavior: * If backend is explicitly provided --> Autotunes within the backend. Same as previous behavior, but now autotuning is supported for cudnn. * If `backend='auto'` --> Autotunes within and across backends (cudnn & cutlass) and chooses the best config of best backend. `trtllm` kernel is not considered * A lot of helper functions to `mm_fp4` were refactored to enable cross-backend autotuning. Refactoring was done to match cross-backend autotune-enabled `bmm_fp8` as a reference. ### Pytest outputs `pytest tests/gemm/test_mm_fp4.py` * SM100 (B200) CUDA 13 & cuDNN 9.15: `900 passed, 2532 skipped in 125.19s (0:02:05)` * SM100 (B200) CUDA 12 & cuDNN 9.15: `900 passed, 2532 skipped in 125.67s (0:02:05)` * SM120 (RTX 5090) CUDA 13 & cuDNN 9.15: `720 passed, 2712 skipped in 76.50s (0:01:16)` ### Example microbenchmark outputs: On SM100 (B200) CUDA 13 & cuDNN 9.15 ``` flashinfer/benchmarks# python3 flashinfer_benchmark.py --routine mm_fp4 --m 1024 --n 7168 --k 4608 --out_dtype bfloat16 --backends cudnn cutlass trtllm auto --use_128x4_sf_layout --use_nvfp4 --refcheck [PERF] cudnn :: median time 0.018 ms; std 0.000 ms; achieved tflops 3797.932 TFLOPs/sec; achieved tb_per_sec 1.884 TB/sec [PERF] cutlass :: median time 0.020 ms; std 0.000 ms; achieved tflops 3440.640 TFLOPs/sec; achieved tb_per_sec 1.707 TB/sec [PERF] trtllm :: median time 0.031 ms; std 0.000 ms; achieved tflops 2187.427 TFLOPs/sec; achieved tb_per_sec 1.085 TB/sec [PERF] auto :: median time 0.018 ms; std 0.000 ms; achieved tflops 3840.714 TFLOPs/sec; achieved tb_per_sec 1.905 TB/sec /flashinfer/benchmarks# python3 flashinfer_benchmark.py --routine mm_fp4 --m 1024 --n 7168 --k 4608 --out_dtype bfloat16 --backends cudnn cutlass trtllm auto --use_128x4_sf_layout --refcheck [INFO] cutlass backend does not support this configuration: ValueError: Only cudnn and auto FP4 GEMM supports mxfp4 quantization. [INFO] trtllm backend does not support this configuration: ValueError: Only cudnn and auto FP4 GEMM supports mxfp4 quantization. [PERF] cudnn :: median time 0.021 ms; std 0.000 ms; achieved tflops 3238.249 TFLOPs/sec; achieved tb_per_sec 1.606 TB/sec [PERF] auto :: median time 0.021 ms; std 0.000 ms; achieved tflops 3237.753 TFLOPs/sec; achieved tb_per_sec 1.606 TB/sec median time 0.009 ms; std 0.000 ms; achieved tflops 938.356 TFLOPs/sec; achieved tb_per_sec 2.069 TB/sec ## Autotune /flashinfer/benchmarks# python3 flashinfer_benchmark.py --routine mm_fp4 --m 1024 --n 7168 --k 4608 --out_dtype bfloat16 --backends cudnn cutlass trtllm auto --use_128x4_sf_layout --use_nvfp4 --refcheck --autotune 2025-11-11 23:43:23,715 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ... 2025-11-11 23:43:25,789 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends 2025-11-11 23:43:25,790 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ... 2025-11-11 23:43:26,251 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends 2025-11-11 23:43:26,251 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ... 2025-11-11 23:43:26,327 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends 2025-11-11 23:43:26,327 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ... 2025-11-11 23:43:26,335 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends [PERF] cudnn_autotune :: median time 0.016 ms; std 0.000 ms; achieved tflops 4129.171 TFLOPs/sec; achieved tb_per_sec 2.048 TB/sec [PERF] cutlass_autotun:: median time 0.019 ms; std 0.000 ms; achieved tflops 3513.845 TFLOPs/sec; achieved tb_per_sec 1.743 TB/sec [PERF] trtllm_autotune:: median time 0.026 ms; std 0.000 ms; achieved tflops 2613.338 TFLOPs/sec; achieved tb_per_sec 1.296 TB/sec [PERF] auto_autotune :: median time 0.016 ms; std 0.000 ms; achieved tflops 4128.768 TFLOPs/sec; achieved tb_per_sec 2.048 TB/sec /flashinfer/benchmarks# python3 flashinfer_benchmark.py --routine mm_fp4 --m 1024 --n 7168 --k 4608 --out_dtype bfloat16 --backends cudnn cutlass trtllm auto --use_128x4_sf_layout --refcheck --autotune [INFO] cutlass backend does not support this configuration: ValueError: Only cudnn and auto FP4 GEMM supports mxfp4 quantization. [INFO] trtllm backend does not support this configuration: ValueError: Only cudnn and auto FP4 GEMM supports mxfp4 quantization. 2025-11-11 23:43:37,942 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ... 2025-11-11 23:43:43,116 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends 2025-11-11 23:43:43,116 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ... 2025-11-11 23:43:43,124 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends [PERF] cudnn_autotune :: median time 0.020 ms; std 0.000 ms; achieved tflops 3370.154 TFLOPs/sec; achieved tb_per_sec 1.672 TB/sec [PERF] auto_autotune :: median time 0.020 ms; std 0.000 ms; achieved tflops 3370.692 TFLOPs/sec; achieved tb_per_sec 1.672 TB/sec ``` On SM100 (B200) CUDA 12 & cuDNN 9.15 ``` flashinfer/benchmarks# python3 flashinfer_benchmark.py --routine mm_fp4 --m 1024 --n 7168 --k 4608 --out_dtype bfloat16 --backends cudnn cutlass trtllm auto --use_128x4_sf_layout --use_nvfp4 --refcheck [PERF] cudnn :: median time 0.023 ms; std 0.001 ms; achieved tflops 2975.898 TFLOPs/sec; achieved tb_per_sec 1.476 TB/sec [PERF] cutlass :: median time 0.020 ms; std 0.000 ms; achieved tflops 3370.423 TFLOPs/sec; achieved tb_per_sec 1.672 TB/sec [PERF] trtllm :: median time 0.031 ms; std 0.000 ms; achieved tflops 2187.427 TFLOPs/sec; achieved tb_per_sec 1.085 TB/sec [PERF] auto :: median time 0.020 ms; std 0.000 ms; achieved tflops 3371.229 TFLOPs/sec; achieved tb_per_sec 1.672 TB/sec (py312) root@84ef83abb1b5:/flashinfer/benchmarks# python3 flashinfer_benchmark.py --routine mm_fp4 --m 1024 --n 7168 --k 4608 --out_dtype bfloat16 --backends cudnn cutlass trtllm auto --use_128x4_sf_layout --refcheck [INFO] cutlass backend does not support this configuration: ValueError: Only cudnn and auto FP4 GEMM supports mxfp4 quantization. [INFO] trtllm backend does not support this configuration: ValueError: Only cudnn and auto FP4 GEMM supports mxfp4 quantization. [PERF] cudnn :: median time 0.021 ms; std 0.000 ms; achieved tflops 3238.249 TFLOPs/sec; achieved tb_per_sec 1.606 TB/sec [PERF] auto :: median time 0.021 ms; std 0.000 ms; achieved tflops 3238.249 TFLOPs/sec; achieved tb_per_sec 1.606 TB/sec ## Autotune /flashinfer/benchmarks# python3 flashinfer_benchmark.py --routine mm_fp4 --m 1024 --n 7168 --k 4608 --out_dtype bfloat16 --backends cudnn cutlass trtllm auto --use_128x4_sf_layout --use_nvfp4 --refcheck --autotune 2025-11-11 23:42:43,378 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ... 2025-11-11 23:42:45,451 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends 2025-11-11 23:42:45,451 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ... 2025-11-11 23:42:45,910 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends 2025-11-11 23:42:45,910 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ... 2025-11-11 23:42:45,986 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends 2025-11-11 23:42:45,986 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ... 2025-11-11 23:42:45,993 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends [PERF] cudnn_autotune :: median time 0.021 ms; std 0.000 ms; achieved tflops 3190.355 TFLOPs/sec; achieved tb_per_sec 1.583 TB/sec [PERF] cutlass_autotun:: median time 0.019 ms; std 0.000 ms; achieved tflops 3551.330 TFLOPs/sec; achieved tb_per_sec 1.762 TB/sec [PERF] trtllm_autotune:: median time 0.026 ms; std 0.000 ms; achieved tflops 2621.440 TFLOPs/sec; achieved tb_per_sec 1.300 TB/sec [PERF] auto_autotune :: median time 0.019 ms; std 0.000 ms; achieved tflops 3551.628 TFLOPs/sec; achieved tb_per_sec 1.762 TB/sec flashinfer/benchmarks# python3 flashinfer_benchmark.py --routine mm_fp4 --m 1024 --n 7168 --k 4608 --out_dtype bfloat16 --backends cudnn cutlass trtllm auto --use_128x4_sf_layout --refcheck --autotune [INFO] cutlass backend does not support this configuration: ValueError: Only cudnn and auto FP4 GEMM supports mxfp4 quantization. [INFO] trtllm backend does not support this configuration: ValueError: Only cudnn and auto FP4 GEMM supports mxfp4 quantization. 2025-11-11 23:42:55,176 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ... 2025-11-11 23:42:58,600 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends 2025-11-11 23:42:58,601 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ... 2025-11-11 23:42:58,608 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends [PERF] cudnn_autotune :: median time 0.021 ms; std 0.000 ms; achieved tflops 3238.249 TFLOPs/sec; achieved tb_per_sec 1.606 TB/sec [PERF] auto_autotune :: median time 0.021 ms; std 0.000 ms; achieved tflops 3238.249 TFLOPs/sec; achieved tb_per_sec 1.606 TB/sec ``` On SM120 (RTX 5090) CUDA 13 & cuDNN 9.15 ``` /flashinfer/benchmarks$ python3 flashinfer_benchmark.py --routine mm_fp4 --m 1024 --n 7168 --k 4608 --out_dtype bfloat16 --backends cudnn cutlass trtllm auto --use_128x4_sf_layout --use_nvfp4 --refcheck [INFO] trtllm backend does not support this configuration: BackendSupportedError: mm_fp4 does not support backend 'trtllm' with capability 120 [PERF] cudnn :: median time 0.058 ms; std 0.000 ms; achieved tflops 1167.143 TFLOPs/sec; achieved tb_per_sec 0.579 TB/sec [PERF] cutlass :: median time 0.060 ms; std 0.000 ms; achieved tflops 1135.056 TFLOPs/sec; achieved tb_per_sec 0.563 TB/sec [PERF] auto :: median time 0.058 ms; std 0.000 ms; achieved tflops 1158.952 TFLOPs/sec; achieved tb_per_sec 0.575 TB/sec /flashinfer/benchmarks$ python3 flashinfer_benchmark.py --routine mm_fp4 --m 1024 --n 7168 --k 4608 --out_dtype bfloat16 --backends cudnn cutlass trtllm auto --use_128x4_sf_layout --refcheck [INFO] cutlass backend does not support this configuration: ValueError: Only cudnn and auto FP4 GEMM supports mxfp4 quantization. [INFO] trtllm backend does not support this configuration: BackendSupportedError: mm_fp4 does not support backend 'trtllm' with capability 120 [PERF] cudnn :: median time 0.054 ms; std 0.000 ms; achieved tflops 1241.735 TFLOPs/sec; achieved tb_per_sec 0.616 TB/sec [PERF] auto :: median time 0.054 ms; std 0.000 ms; achieved tflops 1241.735 TFLOPs/sec; achieved tb_per_sec 0.616 TB/sec ```  ## 🔍 Related Issues #1722  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **New Features** * "auto" backend selection for FP4 ops to choose backend at runtime * cuDNN, CUTLASS and TRTLLM selectable as FP4 GEMM backends * CUDA/cuDNN version awareness to guide auto-backend heuristics * **Improvements** * Runtime capability checks replace static backend lists; unsupported backends are removed dynamically * Heuristic-driven auto-backend selection required for automatic mode * Expanded autotuning/warmup across backends and relaxed FP4 validation tolerance * **Tests** * Tests updated and added to exercise auto-backend scenarios and relaxed constraints ✏️ Tip: You can customize this high-level summary in your review settings.

## 📌 Description `bench_mm_fp8.py` was not functioning because `res` was being provided as a fourth positional argument when it should be given as out=res ``` def mm_fp8( a: torch.Tensor, b: torch.Tensor, alpha: Optional[torch.Tensor] = None, out_dtype: torch.dtype = torch.bfloat16, out: Optional[torch.Tensor] = None, backend: Literal["trtllm_low_latency"] = "trtllm_low_latency", ): ``` Output after fix: ``` flashinfer$ python3 benchmarks/bench_mm_fp8.py 2025-11-21 09:38:10,084 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ... 2025-11-21 09:38:10,328 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends mm_fp8 m=1 n=2560 k=16384 in_dtype=torch.float8_e4m3fn out_dtype=torch.bfloat16: 6.36 TFLOPs/s over 0.013199 ms, 3.18 TB/s 2025-11-21 09:38:10,551 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ... 2025-11-21 09:38:10,573 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends mm_fp8 m=1 n=2560 k=32768 in_dtype=torch.float8_e4m3fn out_dtype=torch.bfloat16: 7.28 TFLOPs/s over 0.023040 ms, 3.64 TB/s 2025-11-21 09:38:10,671 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ... 2025-11-21 09:38:10,692 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends mm_fp8 m=1 n=5120 k=16384 in_dtype=torch.float8_e4m3fn out_dtype=torch.bfloat16: 8.31 TFLOPs/s over 0.020191 ms, 4.16 TB/s 2025-11-21 09:38:10,789 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ... 2025-11-21 09:38:10,813 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends mm_fp8 m=1 n=5120 k=32768 in_dtype=torch.float8_e4m3fn out_dtype=torch.bfloat16: 9.40 TFLOPs/s over 0.035696 ms, 4.70 TB/s 2025-11-21 09:38:10,918 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ... 2025-11-21 09:38:10,941 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends mm_fp8 m=1 n=8192 k=16384 in_dtype=torch.float8_e4m3fn out_dtype=torch.bfloat16: 9.16 TFLOPs/s over 0.029312 ms, 4.58 TB/s 2025-11-21 09:38:11,045 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ... 2025-11-21 09:38:11,072 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends mm_fp8 m=1 n=8192 k=32768 in_dtype=torch.float8_e4m3fn out_dtype=torch.bfloat16: 10.14 TFLOPs/s over 0.052959 ms, 5.07 TB/s ... ``` Also changed measurement methodology slightly to use cupti. Previous methodology inflated performance numbers due to not flushing L2 cache or using a rotating buffer to start with a cold cash. Benchmark should produce much accurate performance numbers due to L2 flush with `enable_cupti=True`  ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Chores** * Adjusted benchmark timing settings to shorten warm-up and measurement durations for faster test runs. * Enabled CUPTI profiling for more detailed GPU performance metrics in FP8 matrix-multiplication benchmarks. * Made non-functional parameter/argument updates and clarifying comments; no changes to core computation logic. ✏️ Tip: You can customize this high-level summary in your review settings.

## 📌 Description tl; dr: Current PR adds a logging system for input/output tracking to aid debugging FlashInfer APIs via a `@flashinfer_api` decorator. **This PR does not label `@flashinfer_api` to every FlashInfer API -- many operations are missing labels. Further labeling is left for subsequent work.** This PR introduces a production-ready API logging infrastructure that tracks function calls, arguments, and return values via a simple one-line decorator. Any function can be decorated with the decorator to track the input/output values in the API logger. Key Features: * Logging level controlled by `FLASHINFER_LOGLEVEL` * Log destination set by `FLASHINFER_LOGDEST`; defaults to `stdout` * Zero overhead when disabled (level 0 returns original function) as seen from `benchmarks/bench_logging_overhead.py` Example usage ``` export FLASHINFER_LOGLEVEL=1 export FLASHINFER_LOGDEST="./flashinfer_api.log" python3 benchmarks/flashinfer_benchmark.py --routine BatchDecodeWithPagedKVCacheWrapper --backends fa2 fa2_tc cudnn trtllm-gen trtllm-gen-native --page_size 16 --batch_size 1 --s_qo 1 --s_kv 1024 --num_qo_heads 64 --num_kv_heads 8 --head_dim_qk 128 --head_dim_vo 128 --random_actual_seq_len -vv --refcheck --q_dtype bfloat16 --kv_dtype bfloat16 ``` produces log ``` ================================================================================ [2025-11-20 17:51:18] FlashInfer API Logging - System Information ================================================================================ FlashInfer version: 0.5.2 CUDA toolkit version: 13.0 cuDNN version: 91600 Number of GPUs: 1 GPU 0: NVIDIA B200 Compute capability: 10.0 (SM100) PyTorch version: 2.9.0+cu130 ================================================================================ [2025-11-20 17:51:19] FlashInfer API Call: BatchDecodeWithPagedKVCacheWrapper.__init__ [2025-11-20 17:51:19] FlashInfer API Call: BatchDecodeWithPagedKVCacheWrapper.plan [2025-11-20 17:51:19] FlashInfer API Call: BatchDecodeWithPagedKVCacheWrapper.__init__ [2025-11-20 17:51:19] FlashInfer API Call: BatchDecodeWithPagedKVCacheWrapper.plan [2025-11-20 17:51:19] FlashInfer API Call: BatchDecodeWithPagedKVCacheWrapper.__init__ [2025-11-20 17:51:19] FlashInfer API Call: BatchDecodeWithPagedKVCacheWrapper.plan [2025-11-20 17:51:19] FlashInfer API Call: BatchDecodeWithPagedKVCacheWrapper.run [2025-11-20 17:51:19] FlashInfer API Call: BatchDecodeWithPagedKVCacheWrapper.run ... ``` `export FLASHINFER_LOGLEVEL=3` produces: ``` (System Info same as above) ================================================================================ [2025-11-20 17:51:58] FlashInfer API Call: BatchDecodeWithPagedKVCacheWrapper.__init__ -------------------------------------------------------------------------------- Positional input arguments: arg[0]: <flashinfer.decode.BatchDecodeWithPagedKVCacheWrapper object at 0x1234399e3410> arg[1]: Tensor( shape=(134217728,) stride=(1,) dtype=torch.int8 device=cuda:0 requires_grad=False is_contiguous=True ) arg[2]: 'HND' Keyword input arguments: use_cuda_graph= True use_tensor_cores= False paged_kv_indptr_buffer= Tensor( shape=(2,) stride=(1,) dtype=torch.int32 device=cuda:0 requires_grad=False is_contiguous=True ) paged_kv_indices_buffer= Tensor( shape=(6,) stride=(1,) dtype=torch.int32 device=cuda:0 requires_grad=False is_contiguous=True ) paged_kv_last_page_len_buffer= Tensor( shape=(1,) stride=(1,) dtype=torch.int32 device=cuda:0 requires_grad=False is_contiguous=True ) backend= 'fa2' Default parameters (not explicitly provided): jit_args= [DEFAULT] None Output value: None ================================================================================ ... ``` `export FLASHINFER_LOGLEVEL=5` produces: ``` (System Info same as above) ================================================================================ [2025-11-20 17:52:23] FlashInfer API Call: BatchDecodeWithPagedKVCacheWrapper.__init__ -------------------------------------------------------------------------------- Positional input arguments: arg[0]: <flashinfer.decode.BatchDecodeWithPagedKVCacheWrapper object at 0x7a9fd9a88c0> arg[1]: Tensor( shape=(134217728,) stride=(1,) dtype=torch.int8 device=cuda:0 requires_grad=False is_contiguous=True min=0 max=0 mean=0.000000 ) arg[2]: 'HND' Keyword input arguments: use_cuda_graph= True use_tensor_cores= False paged_kv_indptr_buffer= Tensor( shape=(2,) stride=(1,) dtype=torch.int32 device=cuda:0 requires_grad=False is_contiguous=True min=0 max=6 mean=3.000000 ) paged_kv_indices_buffer= Tensor( shape=(6,) stride=(1,) dtype=torch.int32 device=cuda:0 requires_grad=False is_contiguous=True min=0 max=5 mean=2.500000 ) paged_kv_last_page_len_buffer= Tensor( shape=(1,) stride=(1,) dtype=torch.int32 device=cuda:0 requires_grad=False is_contiguous=True min=4 max=4 mean=4.000000 ) backend= 'fa2' Default parameters (not explicitly provided): jit_args= [DEFAULT] None Output value: None ================================================================================ ... ```  ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [ ] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [ ] I have installed the hooks with `pre-commit install`. - [ ] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit ## Release Notes * **New Features** * Added API logging feature configurable via environment variables (FLASHINFER_LOGLEVEL for level control, FLASHINFER_LOGDEST for destination) * Supports five verbosity levels with function names, inputs, outputs, metadata, and tensor statistics * Zero-overhead operation when disabled * **Tests** * Added comprehensive logging test suite * **Documentation** * Added logging configuration and usage documentation ✏️ Tip: You can customize this high-level summary in your review settings.

## 📌 Description New function to validate that the indices type, when provided, is `int32`. To close flashinfer-ai/flashinfer#2115. There are now two separate functions doing checking in this file. I will move them to the C++ side later when I have some more bandwidth, probably after Thanksgiving. Just a short fix for now. You can close if you'd rather wait for that.  ## 🔍 Related Issues flashinfer-ai/flashinfer#2115  Relevant to the issue. Now running their code: ``` (flashinfer) raayan@uril-1:~/projects/flashinfer$ python test.py tensor([1, 1, 0, 0], device='cuda:0', dtype=torch.int32) Traceback (most recent call last): File "/home/raayan/projects/flashinfer/test.py", line 15, in <module> incorrect_samples = flashinfer.sampling.top_k_top_p_sampling_from_logits( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/raayan/projects/flashinfer/flashinfer/sampling.py", line 1031, in top_k_top_p_sampling_from_logits _check_indices_dtype(indices) File "/home/raayan/projects/flashinfer/flashinfer/sampling.py", line 487, in _check_indices_dtype raise ValueError(f"indices must have dtype torch.int32, got {indices.dtype}") ValueError: indices must have dtype torch.int32, got torch.int64 ``` ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Improvements** * Enforced that indices passed to sampling operations must use int32, adding runtime validation before sampling. * **Documentation** * Clarified docstrings to state the int32 requirement for indices parameters. * **Tests** * Updated and expanded tests to cover the new dtype validation paths and related error cases. ✏️ Tip: You can customize this high-level summary in your review settings.  --------- Signed-off-by: Raayan Dhar raayan.dhar@gmail.com <raayan.dhar@gmail.com>

## 📌 Description Update autotuner input tensor random range from [0,1) to [-5,5) for larger range and closer to real tensor ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Bug Fixes** * Improved tensor initialization used during autotuning: values are now drawn from a symmetric range around zero ([-5, 5]) with a more uniform-like distribution, yielding more consistent and stable parameter tuning results. ✏️ Tip: You can customize this high-level summary in your review settings. ✏️ Tip: You can customize this high-level summary in your review settings.  --------- Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

## 📌 Description Enable xqa with speculative decoding and add mask tensor in trtllm_batch_decode_with_kv_cache.  ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **New Features** * Speculative decoding: multi-token query support (q_seq_len) with optional attention mask threaded end-to-end. * **API** * Public APIs updated to accept q_seq_len and an optional mask; automatic reshaping and runtime checks for multi-token decoding. * **JIT / Build** * JIT now emits SPEC_DEC-enabled variants and includes spec-dec flags in generated specs. * **Backend / Runtime** * Mask propagation and architecture-aware backend selection improved for compatible kernels. * **Tests** * Added helpers and tests to generate causal masks and validate multi-token speculative decoding. ✏️ Tip: You can customize this high-level summary in your review settings.  --------- Signed-off-by: Qidi Sang <200703406+qsang-nv@users.noreply.github.com> Co-authored-by: yzh119 <zihaoy@nvidia.com>

## 📌 Description  ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [ ] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **New Features** * Added optional communication-backend parameter for multi-node memory and buffer allocation to allow using a provided communicator for handle transfer. * **Bug Fixes / Reliability** * Multi-node synchronization now uses the provided communicator's barrier when available, preserving previous behavior otherwise. * **Tests** * Added end-to-end tests covering custom communication backends and multi-node all-reduce synchronization.

## 📌 Description  ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit ## Release Notes * **Bug Fixes** * Lowered minimum cuDNN version requirement for FP8 support from 9.18.0 to 9.17.1, enabling FP8 functionality on earlier cuDNN versions. ✏️ Tip: You can customize this high-level summary in your review settings.

@bobboli

## 📌 Description This is a port of NVIDIA/TensorRT-LLM#9822 which was done by @bobboli This feature is necessary for SGlang integration because some DP workers may have 0 tokens. The workaround to use a dummy token is quite messy and brittle. ## 🔍 Related Issues Follow up to flashinfer-ai/flashinfer#2102 ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [ ] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [ ] I have installed the hooks with `pre-commit install`. - [ ] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Bug Fixes** * Improved robustness of mixture-of-experts all-to-all communication to gracefully handle scenarios with zero local tokens, preventing synchronization failures and ensuring stable operation in edge cases. ✏️ Tip: You can customize this high-level summary in your review settings.

…#2257)  ## 📌 Description Support inplace update output for `get_batch_indices_positions`. User can pre-allocate `batch_indices` and `positions` to avoid additional copies. ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **New Features** * Public API now accepts optional pre-allocated output buffers for batch indices and positions, enabling memory reuse while preserving previous behavior. * Pre-allocated buffers are validated for compatibility; automatic allocation remains the fallback. * **Documentation** * Docstrings clarified to describe the new optional outputs, validation rules (shape/dtype/device), and allocation behavior. ✏️ Tip: You can customize this high-level summary in your review settings.  --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…e N is not divisible by ScaleGranularityN. (#2261)  ## 📌 Description The SM120 CUTLASS blockwise gemm kernel requires dimensions like N to be multiples of 128 due to hardware constraints (https://github.com/NVIDIA/cutlass/blob/3f4c086d09bd1dc55defb955862f333893bbb28b/include/cutlass/gemm/collective/sm120_mma_tma_blockwise_scaling.hpp#L345C5-L346). We met the shape `a: torch.Size([1, 1, 2688]), b: torch.Size([1, 2688, 10304]), scale_a: torch.Size([]), scale_b: torch.Size([]), out: torch.Size([1, 1, 10304]), workspace_buffer: torch.Size([33554432])` from Nemotron-Nano-v3, where 10304 is not a multiple of 128, the cutlass gemm does not work for it properly. In this PR, we add a pad and slice to get it work. ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit ## Release Notes * **Bug Fixes** * FP8 matrix operations on SM120/SM121 GPUs now support arbitrary input dimensions, removing the previous K dimension minimum requirement and enabling broader use cases. * **Tests** * Expanded test coverage for FP8 matrix operations with additional parameter combinations and improved hardware compatibility validation. ✏️ Tip: You can customize this high-level summary in your review settings.

@kahyunnam

…8_append_paged_kv_cache` (#2255)  ## 📌 Description `rope_quantize_fp8_append_paged_kv_cache` is a merged API of `rope_quantize` and `append_paged_kv_cache`(#2037). However, `typename IdType` from `RopeQuantize` and `AppendPagedKVCache` should not be merged into the same one since they could be in different dtype. `AppendPagedKVCache`'s `IdType` is hardcoded to `int32` but `RopeQuantize`'s `IdType` may be `int64` in frameworks. This PR splits `typename IdType` into separated `typename RoPEIdType, typename PagedKVIdType`, and this will fix the accuracy issue when passing int64 `pos_ids`(RoPE part argument that with `RoPEIdType` type) to API. cc @kahyunnam @yzh119 ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Refactor** * Generalized position-encoding and paged KV cache handling to support multiple integer identifier dtypes and improve type consistency. * **Bug Fixes** * Enforced/validated consistent integer dtype for index tensors before processing to reduce dtype-mismatch errors. * **Tests** * Expanded tests to cover different integer index dtypes (e.g., int32 and int64) for ROPE and paged KV workflows. ✏️ Tip: You can customize this high-level summary in your review settings.  --------- Co-authored-by: Zihao Ye <expye@outlook.com>

## 📌 Description  In QKNorm kernel with small batch size, we can reduce the number of blocks launched. This can reduce block launching overhead especially in decode stage. A example result on B200 where (batch_size, num_heads, head_dim) = (128, 8, 128), which is common in Qwen3 model decode stage. Before this PR: 2.448us After this PR: 1.584us ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Refactor** * Optimized GPU kernel grid size calculation to reduce unnecessary block launches and improve overall performance efficiency. ✏️ Tip: You can customize this high-level summary in your review settings.

@zihaoye

## 📌 Description saw some [test failures](https://gitlab-master.nvidia.com/dl/flashinfer/flashinfer-ci/-/jobs/247866505) on Blackwell boards after #2261, all the failed assertions are related to the large value 10304. Use `.float()` to help reduce precision loss during `cosine_similarity` (`dot(x, y) / (||x|| * ||y||)`) check. ``` FAILED tests/gemm/test_bmm_fp8.py::test_bmm_fp8[True-cutlass-res_dtype1-mat2_dtype0-input_dtype0-256-10304-128-16] - AssertionError: assert tensor(0., device='cuda:0') > 0.99 2025-12-24T07:00:08.299846Z 01O FAILED tests/gemm/test_bmm_fp8.py::test_bmm_fp8[False-cudnn-res_dtype1-mat2_dtype0-input_dtype1-256-10304-128-16] - AssertionError: assert tensor(0., device='cuda:0') > 0.99 ... # the failure occurs for all backend (cutlass, cudnn, etc) ``` cc: @zihaoye @bkryu ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Tests** * Improved test accuracy by ensuring tensor comparisons use floating-point precision for cosine similarity calculations. ✏️ Tip: You can customize this high-level summary in your review settings.

## 📌 Description Add support for GEMM with MXFP8 (`bmm_mxfp8`). At this time only cuDNN is supported. Added test `tests/gemm/test_bmm_mxfp8.py` Added routine `bmm_mxfp8` to `flashinfer_benchmark`. Benchmark results for `bmm_mxfp8` (on B200 GPU): ``` python benchmarks/flashinfer_benchmark.py \ --routine bmm_mxfp8 -vv \ --num_iters 30 \ --batch_size 128 \ --m 512 --n 512 --k 4096 \ --out_dtype bfloat16 \ --backends cudnn \ --refcheck [PERF] cudnn :: median time 0.117 ms; std 0.001 ms; achieved tflops 2347.650 TFLOPs/sec; achieved tb_per_sec 0.040 TB/sec ``` And `bmm_fp8` for comparison: ``` python benchmarks/flashinfer_benchmark.py \ --routine bmm_fp8 -vv \ --num_iters 30 \ --batch_size 128 \ --m 512 --n 512 --k 4096 \ --input_dtype fp8_e4m3 \ --mat2_dtype fp8_e4m3 \ --out_dtype bfloat16 \ --backends cudnn \ --refcheck [PERF] cudnn :: median time 0.116 ms; std 0.001 ms; achieved tflops 2369.049 TFLOPs/sec; achieved tb_per_sec 0.041 TB/sec ``` When running `ncu` the kernel `nvjet_sm100_qqtst_128x256_128x6_2x1_2cta_v_bz_Avec32UE8M0_Bvec32UE8M0_NNT` seems to trigger.  ## 🔍 Related Issues flashinfer-ai/flashinfer#2209  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **New Features** * Added MXFP8 (mixed 8-bit float) batched matrix multiplication with cuDNN acceleration and package-level export. * **Tests** * Added parameterized tests validating MXFP8 BMM against reference results across shapes, dtypes, layouts, backends, and autotune modes. * **Chores** * Updated benchmark catalog and backend-support mappings to include MXFP8 BMM. ✏️ Tip: You can customize this high-level summary in your review settings.  --------- Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

## 📌 Description I find the nvfp4 implemantation could only 1.3-1.4x speedup compared to fp8 in deepseek-v3-0324 model . and as the fp4 pflops is twice that of fp8, I think there should be some points that could be optimization. now after applying this pr, we can get an extra 10-15% speedup on fp4. 1369.89/1192.91=1.148 ~= 15% speedup test cmd ``` python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 1000 --random-input 1000 --random-output 1000 --max-concurrency 60 --port 30000 --host 0.0.0.0 ``` accuracy ``` +------------------+-----------+----------+----------+-------+---------+---------+ | Model | Dataset | Metric | Subset | Num | Score | Cat.0 | +==================+===========+==========+==========+=======+=========+=========+ | DeepSeek-V3-0324 | aime24 | mean_acc | default | 300 | 0.5467 | default | +------------------+-----------+----------+----------+-------+---------+---------+ ``` ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Refactor** * Optimized FP4/FP8 quantization paths with improved register efficiency * Enhanced kernel launch configuration to improve GPU occupancy and performance * Streamlined accumulation processes to reduce memory footprint ✏️ Tip: You can customize this high-level summary in your review settings.  Signed-off-by: bruce.xu <bruce.xu@gmicloud.ai> Co-authored-by: bruce.xu <bruce.xu@gmicloud.ai>

## 📌 Description Add CLAUDE.md as contribution guide to agents (and human). Add several skills (adding an CUDA operator to flashinfer, debug, profiling), this list will grow in the future. ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes  Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Co-Authored-By: aleozlx <aleyang@nvidia.com> Co-Authored-By: bkryu <bkryu@nvidia.com> Co-Authored-By: nvmbreughe <nvmbreughe@nvidia.com> Co-Authored-By: jimmyzho <jimmzhou@nvidia.com>  ## Summary by CodeRabbit * **New Features** * Element-wise tensor scaling (in-place/out-of-place) with JIT-backed modules and a simple Python API supporting FP16, BF16, and FP32; AOT pre-generation support. * **Tests** * Unit tests covering FP16, BF16, FP32 across sizes, in-place outputs, and invalid-input handling. * **Chores** * Build integration to pre-generate scale modules and package export of the new API. * **Documentation** * Kernel benchmarking tutorial; CUDA crash debugging guide; comprehensive developer workflow doc. ✏️ Tip: You can customize this high-level summary in your review settings.  --------- Co-authored-by: Alex Yang <aleozlx@gmail.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: jimmzhou <jimmzhou@nvidia.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: bkryu <bkryu@nvidia.com> Co-authored-by: nvmbreughe <nvmbreughe@nvidia.com>

…orm+FP4Quant fusion kernels (#2260)  ## 📌 Description This PR enhances the `rmsnorm_fp4quant` and `add_rmsnorm_fp4quant` CuTe-DSL kernels with two key improvements: * **Optional output allocation**:` y_fp4` and `block_scale` outputs can now be either provided for in-place update or omitted for automatic allocation and return * **Global scale support**: Both fusion patterns now accept an optional `global_scale` tensor (torch.Tensor | None, shape [1], dtype float32) for NVFP4 quantization, enabling proper dynamic range scaling when global_scale is pre-computed. Should not be provided for mxfp4 File Changes: * `rmsnorm_fp4quant.py` / `add_rmsnorm_fp4quant.py`: Added global_scale: torch.Tensor | None = None parameter; kernel now reads global scale from device memory and incorporates it into block scale computation * `bench_cute_dsl_rmsnorm_fp4quant.py` / `bench_cute_dsl_add_rmsnorm_fp4quant.py`: Updated unfused baseline to measure time for (add +) rmsnorm + fp4 quant, instead of measuring separately. * `test_rmsnorm_fp4_quant_cute_dsl.py` / `test_add_rmsnorm_fp4_quant_cute_dsl.py`: Added auto-allocation tests, global scale verification tests, and fused-vs-separate comparison tests. API Changes: ``` # Before: outputs required rmsnorm_fp4quant(x, weight, y_fp4, block_scale, ...) # After: outputs optional, global_scale supported y_fp4, block_scale = rmsnorm_fp4quant(x, weight, global_scale=gs, ...) # auto-allocate rmsnorm_fp4quant(x, weight, y_fp4, block_scale, global_scale=gs, ...) # in-place ``` <details> <summary>B200 (SM100) Benchmarks</summary> ``` $ python3 bench_cute_dsl_rmsnorm_fp4quant.py ================================================================================ Fused RMSNorm + FP4 Quantization Benchmark ================================================================================ GPU Compute Capability: SM100 Running sanity check... OK: (128, 256) - FP4 match 99.8% OK: (512, 1024) - FP4 match 99.8% OK: (1024, 2048) - FP4 match 99.8% ✓ Confirmed: CuTe-DSL output is equivalent to RMSNorm + fp4_quantize Batch Hidden Fused (µs) BW (GB/s) Unfused (µs) Speedup ------------------------------------------------------------------- 1000 1536 4.4 898.5 6.8 1.54x 1000 2048 5.2 1019.4 7.4 1.43x 1000 4096 6.7 1563.1 12.1 1.80x 1000 8192 9.2 2291.5 20.2 2.20x 1000 16384 22.1 1897.4 31.5 1.42x 1000 32768 31.6 2663.3 52.0 1.65x 1024 1536 4.4 920.1 6.8 1.55x 1024 2048 5.1 1050.4 7.4 1.44x 1024 4096 6.8 1593.1 12.2 1.80x 1024 8192 9.2 2342.4 20.3 2.21x 1024 16384 22.9 1880.4 31.8 1.39x 1024 32768 31.9 2697.1 51.9 1.63x 2048 1536 5.5 1465.1 9.9 1.80x 2048 2048 6.5 1663.4 11.6 1.80x 2048 4096 9.1 2357.9 20.1 2.20x 2048 8192 16.8 2562.4 34.6 2.06x 2048 16384 36.5 2357.9 57.3 1.57x 2048 32768 53.5 3217.2 94.1 1.76x 3000 1536 6.5 1818.2 12.7 1.96x 3000 2048 7.7 2033.6 15.2 1.97x 3000 4096 12.3 2563.2 26.9 2.19x 3000 8192 22.4 2816.3 50.4 2.25x 3000 16384 49.0 2569.9 83.1 1.70x 3000 32768 73.2 3443.0 130.5 1.78x 4096 1536 7.5 2153.4 15.4 2.05x 4096 2048 8.8 2434.3 19.3 2.19x 4096 4096 16.5 2606.7 35.4 2.14x 4096 8192 29.2 2943.6 66.8 2.29x 4096 16384 61.3 2803.8 109.1 1.78x 4096 32768 95.8 3591.7 173.8 1.81x 5000 1536 8.5 2312.4 18.2 2.14x 5000 2048 10.4 2531.3 22.9 2.21x 5000 4096 18.7 2803.9 42.3 2.26x 5000 8192 35.2 2982.3 80.0 2.27x 5000 16384 72.7 2889.0 130.0 1.79x 5000 32768 114.1 3680.8 206.1 1.81x 8192 1536 11.6 2776.2 27.1 2.33x 8192 2048 15.6 2747.7 34.3 2.19x 8192 4096 28.6 3002.4 67.6 2.36x 8192 8192 52.4 3279.1 127.2 2.42x 8192 16384 113.9 3021.1 209.4 1.84x 8192 32768 178.5 3854.4 332.1 1.86x 10000 1536 14.1 2783.0 31.6 2.23x 10000 2048 17.8 2944.7 40.3 2.26x 10000 4096 34.5 3038.7 81.3 2.35x 10000 8192 62.1 3380.8 153.1 2.46x 10000 16384 135.2 3106.7 252.2 1.87x 10000 32768 214.7 3911.2 401.1 1.87x 15000 1536 19.4 3044.7 45.8 2.36x 15000 2048 25.2 3126.0 59.7 2.37x 15000 4096 47.4 3322.2 118.1 2.49x 15000 8192 89.0 3539.8 224.8 2.53x 15000 16384 192.3 3274.4 373.5 1.94x 15000 32768 315.1 3997.2 592.1 1.88x 16384 1536 20.9 3086.3 50.2 2.40x 16384 2048 27.2 3165.0 64.8 2.39x 16384 4096 51.0 3371.5 128.2 2.51x 16384 8192 96.3 3570.9 245.7 2.55x 16384 16384 210.2 3272.7 407.1 1.94x 16384 32768 342.7 4014.3 646.6 1.89x 25000 1536 30.4 3231.8 75.1 2.47x 25000 2048 38.7 3392.7 96.8 2.50x 25000 4096 73.0 3596.6 191.8 2.63x 25000 8192 142.4 3686.3 369.4 2.59x 25000 16384 310.0 3386.3 614.6 1.98x 25000 32768 515.6 4071.7 976.8 1.89x 32768 1536 38.2 3378.5 96.8 2.53x 32768 2048 48.2 3568.4 124.3 2.58x 32768 4096 92.8 3705.0 249.0 2.68x 32768 8192 184.0 3739.5 482.0 2.62x 32768 16384 401.8 3424.1 799.3 1.99x 32768 32768 672.9 4088.8 1312.0 1.95x 60000 1536 64.1 3682.7 171.8 2.68x 60000 2048 81.5 3863.4 222.0 2.72x 60000 4096 162.3 3880.2 449.3 2.77x 60000 8192 329.5 3822.1 873.7 2.65x 60000 16384 719.2 3502.5 1458.1 2.03x 60000 32768 1265.2 3982.2 2440.1 1.93x 65536 1536 69.3 3723.3 187.5 2.71x 65536 2048 88.3 3895.6 242.6 2.75x 65536 4096 176.5 3896.3 489.2 2.77x 65536 8192 359.2 3830.4 953.7 2.66x 65536 16384 783.9 3510.1 1590.3 2.03x 65536 32768 1341.8 4101.3 2705.2 2.02x ================================================================================ Geomean speedup vs Unfused (rmsnorm + fp4_quantize): 2.10x ================================================================================ Benchmark Complete ================================================================================ $ python3 bench_cute_dsl_add_rmsnorm_fp4quant.py ================================================================================ Fused Add + RMSNorm + FP4 Quantization Benchmark ================================================================================ GPU Compute Capability: SM100 Running sanity check... OK: (128, 256) - FP4 match 99.9% OK: (512, 1024) - FP4 match 99.9% OK: (1024, 2048) - FP4 match 99.9% ✓ Confirmed: CuTe-DSL output is equivalent to torch.add + RMSNorm + fp4_quantize Batch Hidden Fused (µs) BW (GB/s) Unfused (µs) Speedup ------------------------------------------------------------------- 1000 1536 5.0 1413.5 9.7 1.96x 1000 2048 5.5 1708.4 10.7 1.95x 1000 4096 8.9 2094.1 16.4 1.84x 1000 8192 13.1 2864.0 27.5 2.11x 1000 16384 33.5 2232.1 44.4 1.33x 1000 32768 66.7 2243.9 83.8 1.26x 1024 1536 5.0 1438.2 9.8 1.96x 1024 2048 5.5 1729.1 10.8 1.95x 1024 4096 9.0 2121.5 16.6 1.84x 1024 8192 13.2 2890.1 27.8 2.10x 1024 16384 34.5 2220.9 45.0 1.31x 1024 32768 67.4 2272.1 85.5 1.27x 2048 1536 7.1 2020.8 13.9 1.96x 2048 2048 8.7 2211.3 16.6 1.92x 2048 4096 13.1 2928.4 27.4 2.09x 2048 8192 22.2 3447.5 49.0 2.20x 2048 16384 61.7 2481.9 89.9 1.46x 2048 32768 121.2 2525.8 155.4 1.28x 3000 1536 9.9 2130.1 18.0 1.82x 3000 2048 10.6 2638.8 21.4 2.02x 3000 4096 17.1 3275.2 38.1 2.23x 3000 8192 30.5 3675.4 73.7 2.42x 3000 16384 86.7 2587.8 128.4 1.48x 3000 32768 170.0 2639.4 218.1 1.28x 4096 1536 11.2 2555.9 22.0 1.96x 4096 2048 12.5 3067.1 27.2 2.18x 4096 4096 22.1 3462.1 49.6 2.24x 4096 8192 39.1 3915.4 98.9 2.53x 4096 16384 115.7 2646.4 170.0 1.47x 4096 32768 224.7 2725.9 291.4 1.30x 5000 1536 13.5 2598.1 25.8 1.91x 5000 2048 14.6 3209.1 32.2 2.21x 5000 4096 25.9 3609.7 60.3 2.33x 5000 8192 45.9 4068.6 118.5 2.58x 5000 16384 137.7 2714.0 202.8 1.47x 5000 32768 269.2 2777.2 349.1 1.30x 8192 1536 19.7 2917.3 38.9 1.97x 8192 2048 20.8 3680.3 49.4 2.38x 8192 4096 38.8 3941.0 100.4 2.58x 8192 8192 70.5 4343.5 188.2 2.67x 8192 16384 220.1 2782.6 326.6 1.48x 8192 32768 427.1 2867.7 563.7 1.32x 10000 1536 23.3 3004.2 45.4 1.95x 10000 2048 24.5 3819.6 59.4 2.43x 10000 4096 45.4 4112.9 120.5 2.65x 10000 8192 84.3 4432.8 226.4 2.69x 10000 16384 267.8 2791.0 393.8 1.47x 10000 32768 517.6 2888.4 683.8 1.32x 15000 1536 33.2 3167.9 67.5 2.03x 15000 2048 34.3 4085.9 90.2 2.63x 15000 4096 64.7 4334.6 174.9 2.70x 15000 8192 122.2 4587.7 333.1 2.73x 15000 16384 397.2 2823.4 582.8 1.47x 15000 32768 766.5 2925.8 1014.2 1.32x 16384 1536 36.0 3192.3 74.9 2.08x 16384 2048 36.9 4145.8 98.1 2.66x 16384 4096 69.9 4379.2 189.6 2.71x 16384 8192 132.8 4609.6 363.1 2.73x 16384 16384 433.0 2828.8 635.6 1.47x 16384 32768 837.1 2926.2 1113.5 1.33x 25000 1536 51.3 3417.7 112.4 2.19x 25000 2048 52.0 4496.5 145.2 2.79x 25000 4096 102.8 4546.2 283.1 2.75x 25000 8192 197.7 4725.8 547.1 2.77x 25000 16384 653.6 2859.5 962.6 1.47x 25000 32768 1266.7 2950.7 1726.5 1.36x 32768 1536 64.7 3547.3 144.4 2.23x 32768 2048 66.0 4639.2 186.7 2.83x 32768 4096 132.4 4625.2 367.1 2.77x 32768 8192 256.2 4779.7 713.6 2.78x 32768 16384 856.9 2858.6 1259.6 1.47x 32768 32768 1652.6 2964.4 2267.0 1.37x 60000 1536 112.7 3729.8 255.0 2.26x 60000 2048 115.0 4876.2 331.3 2.88x 60000 4096 235.2 4767.4 662.0 2.81x 60000 8192 462.3 4851.2 1294.6 2.80x 60000 16384 1560.6 2873.9 2311.2 1.48x 60000 32768 3008.9 2981.2 4225.7 1.40x 65536 1536 122.4 3751.8 277.6 2.27x 65536 2048 124.9 4901.8 361.0 2.89x 65536 4096 256.2 4780.9 721.2 2.82x 65536 8192 503.5 4864.7 1412.8 2.81x 65536 16384 1703.0 2876.7 2508.2 1.47x 65536 32768 3288.8 2979.2 4617.6 1.40x ================================================================================ Geomean speedup vs Unfused (add + rmsnorm + fp4_quantize): 1.96x ================================================================================ Benchmark Complete ================================================================================ ``` </details>  ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **New Features** * Added optional global scaling for FP4 quantization; quantization APIs now return quantized output plus block scales and support auto-allocation. * **Benchmark Improvements** * Benchmarks now propagate global_scale, report fused vs unfused timings, and show a single speedup metric versus the unfused path with simplified output formatting. * **Testing** * Expanded tests to cover global-scale paths, auto-allocation, swizzled layouts, large sizes, and introduced two-tier tolerance assertions. ✏️ Tip: You can customize this high-level summary in your review settings.

## 📌 Description Adds workflow that runs weekly on Mondays (and can be triggered manually) to execute scripts/xfails_tracker.py, which scans the test suite for xfail markers and outputs a comprehensive report to reports/xfails_report.txt. If the report has changed, it automatically creates a pull request to commit the updated report to the repository. ## 🔍 Related Issues cont. testing item from november roadmap ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Chores** * Added an automated weekly report generation that detects changes and opens pull requests to incorporate updates. * Updated automation to remove generated-attribution boilerplate from automated commit/PR messages. ✏️ Tip: You can customize this high-level summary in your review settings.

## 📌 Description Based on this [comment](flashinfer-ai/flashinfer#2127 (review)) in flashinfer-ai/flashinfer#2127, we can add support for Int64 indices as well. I decided to do this using `IdType` like it is done in other files.  ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). Test results: ``` (flashinfer) raayan@uril-1:~/projects/flashinfer$ pytest tests/utils/test_sampling.py ============================================================= test session starts ============================================================= platform linux -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 rootdir: /home/raayan/projects/flashinfer configfile: pytest.ini collected 1884 items tests/utils/test_sampling.py .......................................................................................................... [ 5%] ....................................................................................................................................... [ 12%] ....................................................................................................................................... [ 19%] ....................s..s..s..........................................................................sss........................sss.... [ 27%] ....................................................................................................................................... [ 34%] ..........................ssss................................ssss................................ssss................................s [ 41%] sss................................ssss................................ssss................................ssss........................ [ 48%] ........ssss................................ssss................................ssss................................ssss............... [ 55%] .................ssss................................ssss................................ssss................................ssss...... [ 62%] ..........................ssss................................ssss................................ssss................................s [ 70%] sss................................ssss................................ssss................................ssss........................ [ 77%] ........ssss................................ssss................................ssss................................ssss............... [ 84%] .................ssss.................................................................................................................. [ 91%] ........................................................sss............................................................................ [ 98%] ....................... [100%] ================================================ 1764 passed, 120 skipped in 546.33s (0:09:06) ================================================ (flashinfer) raayan@uril-1:~/projects/flashinfer$ ``` ## Reviewer Notes  --------- Signed-off-by: raayandhar <raayan.dhar@gmail.com>

@jiahanc

## 📌 Description This PR adds implementation for Gated Delta Rule (or Gated Delta Net) on Hopper architecture to better support Qwen-next like architecture. ## 🔍 Related Issues #1690 ## 🚀 Pull Request Checklist ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes  Thanks @jiahanc for initiating the kernel integration and implementing the API.  ## Summary by CodeRabbit * **New Features** * SM90-optimized Gated Delta Rule (GDN) prefill: Python API (chunk_gated_delta_rule), host launcher, and FFI export; supports optional alpha/beta gating and returns output and final state. * **Benchmarks & Tests** * New GPU benchmark for GDN prefill reporting runtime, TFLOPs and bandwidth. * Added reference implementations and comprehensive tests validating prefill, chunked prefill, and delta-rule behavior. ✏️ Tip: You can customize this high-level summary in your review settings.  --------- Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Co-authored-by: Zihao Ye <expye@outlook.com>

## 📌 Description 2025 -> 2026. ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Chores** * Updated copyright year ranges to 2026 in project headers and documentation. ✏️ Tip: You can customize this high-level summary in your review settings.

## 📌 Description Update minimal version requirement of nvidia-cutlass-dsl to 4.3.4, which should resolve the arm issue in flashinfer-ai/flashinfer#2279 ## 🔍 Related Issues #2279 ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Chores** * Updated internal dependencies to improve stability and compatibility. ✏️ Tip: You can customize this high-level summary in your review settings.

## 📌 Description  flashinfer-ai/flashinfer#2111 already enabled Hopper FA3 FP8 attention in `prefill.py`. This is just a follow-up PR to make the same change in `decode.py` because `decode.py` actually uses prefill kernels. ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **New Features** * Added selectable backend support (including a new backend option) and explicit output-dtype control for decode/prefill workflows. * **Improvements** * Improved FP8 handling and propagation of scales; runtime checks enforce output-dtype consistency and avoid unnecessary scaling when scale == 1.0. * Backend auto-selection logic enhanced to consider output dtype. * **Documentation** * FP8 guidance updated to allow float16 and bfloat16 outputs. * **Tests** * Added tests validating FP8 paged decoding with the new backend. ✏️ Tip: You can customize this high-level summary in your review settings.  Signed-off-by: Po-Han Huang <pohanh@nvidia.com>

This PR updates the Docker CI image tags to the latest version: `20260105-a97b5d7` Updated images: - flashinfer/flashinfer-ci-cu126:20260105-a97b5d7 - flashinfer/flashinfer-ci-cu128:20260105-a97b5d7 - flashinfer/flashinfer-ci-cu129:20260105-a97b5d7 - flashinfer/flashinfer-ci-cu130:20260105-a97b5d7 Auto-generated by [release-ci-docker workflow](https://github.com/flashinfer-ai/flashinfer/actions/runs/20731143681)  ## Summary by CodeRabbit * **Chores** * Updated Docker image tags in CI/CD pipeline configuration ✏️ Tip: You can customize this high-level summary in your review settings.  Co-authored-by: yzh119 <11773619+yzh119@users.noreply.github.com>

…-ffi for cute-dsl kernels (#2279)  ## 📌 Description cute-dsl adds support of compiling with tvm-ffi since 4.3 release https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/cute_dsl_general/compile_with_tvm_ffi.html, which allows user to pass torch tensors directly with negligible dlpack conversion cost, without the need of manually creating cute tensors from cute pointer. In this PR we refactored the existing cute-dsl kernels to enable tvm-ffi and simplify the torch -> cute-dsl boilerplate. ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [ ] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [ ] I have installed the hooks with `pre-commit install`. - [ ] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Refactor** * FP4 quant kernels (RMSNorm and Add+RMSNorm) now accept TVM-FFI tensors and a generic stream instead of raw pointers, simplifying invocation and runtime flow and improving handling of swizzled vs non‑swizzled scale layouts. * Compilation path updated to use TVM-FFI-friendly scaffolding with symbolic/fake tensors and streams. * **Documentation** * Docstrings and user-facing notes updated to describe the tensor-based inputs, TVM-FFI usage, and swizzle-dependent layout behavior. ✏️ Tip: You can customize this high-level summary in your review settings.  --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

## 📌 Description During `flashinfer_benchmark.py`'s attention benchmark, using `fa2_tc` for "FlashAttention2 with tensor cores enabled" would lead to incorrect backend name "fa2_tc" to wrapper when it should be "fa2". This bug did not cause any issues, but recent commits have caused the bug to surface. Current PR changed the benchmark code to fix the issue. **No library code or unit test code changes so will not trigger unit tests**  ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Tests** * Improved backend configuration handling in batch decoding benchmarks to ensure correct parameter mapping during wrapper instantiation. ✏️ Tip: You can customize this high-level summary in your review settings.

## 📌 Description  Fixed the #2284. In the past, before #1641, the flashinfer used torch default generator `at::cuda::detail::getDefaultCUDAGenerator()` while #1641 will create one new generator instance at a time. This PR recovers the default generator from torch. ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [ ] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [ ] I have installed the hooks with `pre-commit install`. - [ ] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Bug Fixes** * Sampling now uses a device-aware default random generator, ensuring consistent and correct sampling behavior across CPU and GPU when no generator is provided. * **Chores** * Small public API update to accept a device context so sampling routines derive RNG state from the correct device. ✏️ Tip: You can customize this high-level summary in your review settings.

…roughput + speculative decoding (#2265)  ## 📌 Description This MR adds the optimized decode attention kernels for high throughput (large batch size) + speculative decoding (seqlen_q > 1). See below for speedups (collected by `benchmarks/flashinfer_benchmark.py`). The seqlenKv is 16K for all cases. | test case | median_time_ms | median_time_ms (opt) | speedup | |------------------------------------------|----------------|----------------------|----------| | Qwen3-480B-fp8_e4m3-batchSize8-seqLenQ2 | 0.057 | 0.046 | 1.24 | | Qwen3-480B-fp8_e4m3-batchSize16-seqLenQ2 | 0.11 | 0.083 | 1.33 | | Qwen3-480B-fp8_e4m3-batchSize32-seqLenQ2 | 0.213 | 0.168 | 1.27 | | Qwen3-480B-fp8_e4m3-batchSize40-seqLenQ2 | 0.266 | 0.241 | 1.10 | | Qwen3-480B-fp8_e4m3-batchSize64-seqLenQ2 | 0.432 | 0.336 | 1.29 | | Qwen3-480B-fp8_e4m3-batchSize8-seqLenQ4 | 0.109 | 0.048 | 2.27 | | Qwen3-480B-fp8_e4m3-batchSize16-seqLenQ4 | 0.212 | 0.083 | 2.55 | | Qwen3-480B-fp8_e4m3-batchSize32-seqLenQ4 | 0.371 | 0.168 | 2.21 | | Qwen3-480B-fp8_e4m3-batchSize40-seqLenQ4 | 0.472 | 0.245 | 1.93 | | Qwen3-480B-fp8_e4m3-batchSize64-seqLenQ4 | 0.736 | 0.348 | 2.11 | | Qwen3-480B-fp8_e4m3-batchSize8-seqLenQ8 | 0.212 | 0.061 | 3.48 | | Qwen3-480B-fp8_e4m3-batchSize16-seqLenQ8 | 0.37 | 0.106 | 3.49 | | Qwen3-480B-fp8_e4m3-batchSize32-seqLenQ8 | 0.732 | 0.239 | 3.06 | | Qwen3-480B-fp8_e4m3-batchSize40-seqLenQ8 | 0.937 | 0.321 | 2.92 | | Qwen3-480B-fp8_e4m3-batchSize64-seqLenQ8 | 1.456 | 0.484 | 3.01 | | GPT-OSS-fp8_e4m3-batchSize8-seqLenQ2 | 0.051 | 0.03 | 1.70 | | GPT-OSS-fp8_e4m3-batchSize16-seqLenQ2 | 0.098 | 0.054 | 1.81 | | GPT-OSS-fp8_e4m3-batchSize32-seqLenQ2 | 0.188 | 0.104 | 1.81 | | GPT-OSS-fp8_e4m3-batchSize40-seqLenQ2 | 0.234 | 0.15 | 1.56 | | GPT-OSS-fp8_e4m3-batchSize64-seqLenQ2 | 0.332 | 0.199 | 1.67 | | GPT-OSS-fp8_e4m3-batchSize8-seqLenQ4 | 0.099 | 0.038 | 2.61 | | GPT-OSS-fp8_e4m3-batchSize16-seqLenQ4 | 0.188 | 0.07 | 2.69 | | GPT-OSS-fp8_e4m3-batchSize32-seqLenQ4 | 0.332 | 0.136 | 2.44 | | GPT-OSS-fp8_e4m3-batchSize40-seqLenQ4 | 0.418 | 0.2 | 2.09 | | GPT-OSS-fp8_e4m3-batchSize64-seqLenQ4 | 0.647 | 0.265 | 2.44 | | GPT-OSS-fp8_e4m3-batchSize8-seqLenQ8 | 0.188 | 0.039 | 4.82 | | GPT-OSS-fp8_e4m3-batchSize16-seqLenQ8 | 0.332 | 0.065 | 5.11 | | GPT-OSS-fp8_e4m3-batchSize32-seqLenQ8 | 0.647 | 0.126 | 5.13 | | GPT-OSS-fp8_e4m3-batchSize40-seqLenQ8 | 0.83 | 0.185 | 4.49 | | GPT-OSS-fp8_e4m3-batchSize64-seqLenQ8 | 1.29 | 0.245 | 5.27 | ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Bug Fixes** * Generation attention now enforces causal masking during token generation. * **Performance / Refactor** * Improved kernel selection and on-demand loading for better performance and GPU compatibility. * Added finer-grained tuning parameters for tile/grouping, tokens-per-CTA and inflation to enable more optimal kernel choices. * **Chores** * Updated FMHA artifact paths and checksums. * **Tests** * Expanded parameterized tests to cover larger batch decoding scenarios. ✏️ Tip: You can customize this high-level summary in your review settings.  --------- Co-authored-by: yzh119 <zihaoy@nvidia.com> Co-authored-by: Zihao Ye <expye@outlook.com>

## 📌 Description  ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Chores** * Version bumped to 0.6.0 with no functional changes. ✏️ Tip: You can customize this high-level summary in your review settings.

@claude

## 🤖 Installing Claude Code GitHub App This PR adds a GitHub Actions workflow that enables Claude Code integration in our repository. ### What is Claude Code? [Claude Code](https://claude.com/claude-code) is an AI coding agent that can help with: - Bug fixes and improvements - Documentation updates - Implementing new features - Code reviews and suggestions - Writing tests - And more! ### How it works Once this PR is merged, we'll be able to interact with Claude by mentioning @claude in a pull request or issue comment. Once the workflow is triggered, Claude will analyze the comment and surrounding context, and execute on the request in a GitHub action. ### Important Notes - **This workflow won't take effect until this PR is merged** - **@claude mentions won't work until after the merge is complete** - The workflow runs automatically whenever Claude is mentioned in PR or issue comments - Claude gets access to the entire PR or issue context including files, diffs, and previous comments ### Security - Only approved team members can use this feature. - Our Anthropic API key is securely stored as a GitHub Actions secret - Only users with write access to the repository can trigger the workflow - All Claude runs are stored in the GitHub Actions run history - Claude's default tools are limited to reading/writing files and interacting with our repo by creating comments, branches, and commits. - We can add more allowed tools by adding them to the workflow file like: ``` allowed_tools: Bash(npm install),Bash(npm run build),Bash(npm run lint),Bash(npm run test) ``` There's more information in the [Claude Code action repo](https://github.com/anthropics/claude-code-action). After merging this PR, let's try mentioning @claude in a comment on any PR to get started!  ## Summary by CodeRabbit * **New Features** * Added automated AI code review workflows that run on pull requests and when the bot is mentioned in comments. * Reviews can post feedback as PR comments and are configurable with optional prompts. * **Chores** * Workflows verify contributor authorization and only run reviews for authorized users. * Reviews assess quality, bugs, performance, security, and test coverage. ✏️ Tip: You can customize this high-level summary in your review settings.  --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

## 📌 Description  ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [ x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [ x] I have installed the hooks with `pre-commit install`. - [ x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ x] Tests have been added or updated as needed. - [ x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Tests** * Improved test suite with a refined hardware check: an FP8-related test now requires a specific GPU compute capability so it only runs on compatible hardware, reducing false skips and improving reliability. ✏️ Tip: You can customize this high-level summary in your review settings.

bkryu and others added 30 commits November 12, 2025 20:17

hotfix: rename moe/test_utils.py to moe/utils.py (#2106)

1c4b522

[DSV3] Optimized routing kernels dsv3 (#2099)

3a23405

Anerudhan and others added 23 commits December 22, 2025 16:43

chore: Update CODEOWNERS (#2218)

790321b

bugfix: fix claude skills (#2275)

747b0cb

Tiny fix bench tgv gemm (#2277)

ff41a8f

murphymatt requested review from a team and divchenko January 7, 2026 21:54

cyx-6 and others added 5 commits January 7, 2026 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync origin/main with upstream/main#16

Sync origin/main with upstream/main#16
murphymatt wants to merge 137 commits intomainfrom
sync-with-upstream-main

murphymatt commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

murphymatt commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants