Skip to content

[pull] main from NVIDIA:main#598

Merged
pull[bot] merged 4 commits into
phu0ngng:mainfrom
NVIDIA:main
May 8, 2026
Merged

[pull] main from NVIDIA:main#598
pull[bot] merged 4 commits into
phu0ngng:mainfrom
NVIDIA:main

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented May 8, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

cyanguwa and others added 4 commits May 7, 2026 16:34
* remove max512 subbackend

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor tweak for docstring

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* revert fp8 t3hd changes

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove redudant test comparison 0 vs 1 subbackend

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove sub-backend 0 from header docstring

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* remove FP8 v0 legacy code

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove FP8 v0 legacy code

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* replace _impl_v1 with _impl

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor tweaks for docstrings

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor tweaks for docstring

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove T3HD in selection logic

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address review: drop dead 8.9 FP8 guard and stale FP8 docs

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…2758)

* added new implementation of fused_moe_aux_loss_forward kernel

Signed-off-by: Alp Dener <adener@nvidia.com>

* Fix race condition, type-punning, and namespace bugs in fused_moe_aux_loss_v2 kernel

- Accumulate into a float buffer instead of atomicAdd-ing directly into
  aux_loss (which could be fp16/bf16), fixing a buffer overflow and wrong
  results for non-float dtypes
- Zero the accumulator on the host before launch to eliminate the race
  between block 0's init and other blocks' atomicAdds
- Move kernel into fused_router namespace so symbols resolve correctly
- Round block size up to a warp multiple for well-defined shuffles
- Allocate Const_buf with 2 elements to hold both C_coeff and the
  float accumulator

Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* added shared memory check on number of experts

Signed-off-by: Alp Dener <adener@nvidia.com>

* removed duplicate syncwarp

Signed-off-by: Alp Dener <adener@nvidia.com>

* updated TE/JAX primitive for fused MoE aux loss to comply with the new V2 API in TE/common

Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* added missing syncthreads after atomicAdds

Signed-off-by: Alp Dener <adener@nvidia.com>

* restored the small 1grid/1block kernel for casting accumulated float result to DataType

Signed-off-by: Alp Dener <adener@nvidia.com>

* fixed inter-block race on accumulation coefficient

Signed-off-by: Alp Dener <adener@nvidia.com>

* fixed the intermediate coefficient buffer getting passed onto the backward pass correctly

Signed-off-by: Alp Dener <adener@nvidia.com>

* removed old kernel, removed _v2 name from new kernel

Signed-off-by: Alp Dener <adener@nvidia.com>

* removed unused num_experts from kernel

Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Alp Dener <adener@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Adapt initial implementation and make quantization bitwise exact

Signed-off-by: Ziang Li <ziangli@umich.edu>
Co-authored-by: Yigong Qin <qqqyyy1233@outlook.com>

* Add col

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Add fp32

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Clean up tests

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Clean up ref

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Clean up gemm wrapper

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Clean up test

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Clean up

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Rename and reformat

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Avoid partial amax folding in gemm

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Expand test coverage

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Expand more tests

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Turn on test for grouped linear sanity

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Rename pertoken to per_token

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Expand .cu test

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Format after rebase

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Fix test after rebase

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Clean up cpp test

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Extend cpp dequantize test

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Only pass `per_token_activation` to forward activation quantizer and clean up

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Minor fix test

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Improve accuracy by unfolding weight per-tensor fp32

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Fold row-wise quantization

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Drop column wise

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Clean up

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Clean up

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Clean up column wise

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Move shared test helpers

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Minor clean up test

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Readability

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Rename

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Further refactor

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Clean up bias

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Clean up cast

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Avoid silently disable column wise

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Clean up

Signed-off-by: Ziang Li <ziangli@umich.edu>

* `is_quantizable` returns false

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Error out grouped gemm

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Tighten test

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Rename verbose rowwise_amax_is_row_scaled

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Clean up

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Explicitly handle both gemm input and error out

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Minor

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Nits and lint

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Minor fix A100 ci

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Update tests/pytorch/utils.py

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Ziang Li <ziangli@umich.edu>

---------

Signed-off-by: Ziang Li <ziangli@umich.edu>
Co-authored-by: Yigong Qin <qqqyyy1233@outlook.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
@pull pull Bot locked and limited conversation to collaborators May 8, 2026
@pull pull Bot added the ⤵️ pull label May 8, 2026
@pull pull Bot merged commit c74e5aa into phu0ngng:main May 8, 2026
@pull pull Bot had a problem deploying to github-pages May 8, 2026 04:33 Failure
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants