[pull] main from NVIDIA:main#598
Merged
Merged
Conversation
* remove max512 subbackend Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor tweak for docstring Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert fp8 t3hd changes Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove redudant test comparison 0 vs 1 subbackend Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove sub-backend 0 from header docstring Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* remove FP8 v0 legacy code Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove FP8 v0 legacy code Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * replace _impl_v1 with _impl Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor tweaks for docstrings Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor tweaks for docstring Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove T3HD in selection logic Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address review: drop dead 8.9 FP8 guard and stale FP8 docs Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…2758) * added new implementation of fused_moe_aux_loss_forward kernel Signed-off-by: Alp Dener <adener@nvidia.com> * Fix race condition, type-punning, and namespace bugs in fused_moe_aux_loss_v2 kernel - Accumulate into a float buffer instead of atomicAdd-ing directly into aux_loss (which could be fp16/bf16), fixing a buffer overflow and wrong results for non-float dtypes - Zero the accumulator on the host before launch to eliminate the race between block 0's init and other blocks' atomicAdds - Move kernel into fused_router namespace so symbols resolve correctly - Round block size up to a warp multiple for well-defined shuffles - Allocate Const_buf with 2 elements to hold both C_coeff and the float accumulator Signed-off-by: Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * added shared memory check on number of experts Signed-off-by: Alp Dener <adener@nvidia.com> * removed duplicate syncwarp Signed-off-by: Alp Dener <adener@nvidia.com> * updated TE/JAX primitive for fused MoE aux loss to comply with the new V2 API in TE/common Signed-off-by: Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * added missing syncthreads after atomicAdds Signed-off-by: Alp Dener <adener@nvidia.com> * restored the small 1grid/1block kernel for casting accumulated float result to DataType Signed-off-by: Alp Dener <adener@nvidia.com> * fixed inter-block race on accumulation coefficient Signed-off-by: Alp Dener <adener@nvidia.com> * fixed the intermediate coefficient buffer getting passed onto the backward pass correctly Signed-off-by: Alp Dener <adener@nvidia.com> * removed old kernel, removed _v2 name from new kernel Signed-off-by: Alp Dener <adener@nvidia.com> * removed unused num_experts from kernel Signed-off-by: Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Alp Dener <adener@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Adapt initial implementation and make quantization bitwise exact Signed-off-by: Ziang Li <ziangli@umich.edu> Co-authored-by: Yigong Qin <qqqyyy1233@outlook.com> * Add col Signed-off-by: Ziang Li <ziangli@umich.edu> * Add fp32 Signed-off-by: Ziang Li <ziangli@umich.edu> * Clean up tests Signed-off-by: Ziang Li <ziangli@umich.edu> * Clean up ref Signed-off-by: Ziang Li <ziangli@umich.edu> * Clean up gemm wrapper Signed-off-by: Ziang Li <ziangli@umich.edu> * Clean up test Signed-off-by: Ziang Li <ziangli@umich.edu> * Clean up Signed-off-by: Ziang Li <ziangli@umich.edu> * Rename and reformat Signed-off-by: Ziang Li <ziangli@umich.edu> * Avoid partial amax folding in gemm Signed-off-by: Ziang Li <ziangli@umich.edu> * Expand test coverage Signed-off-by: Ziang Li <ziangli@umich.edu> * Expand more tests Signed-off-by: Ziang Li <ziangli@umich.edu> * Turn on test for grouped linear sanity Signed-off-by: Ziang Li <ziangli@umich.edu> * Rename pertoken to per_token Signed-off-by: Ziang Li <ziangli@umich.edu> * Expand .cu test Signed-off-by: Ziang Li <ziangli@umich.edu> * Format after rebase Signed-off-by: Ziang Li <ziangli@umich.edu> * Fix test after rebase Signed-off-by: Ziang Li <ziangli@umich.edu> * Clean up cpp test Signed-off-by: Ziang Li <ziangli@umich.edu> * Extend cpp dequantize test Signed-off-by: Ziang Li <ziangli@umich.edu> * Only pass `per_token_activation` to forward activation quantizer and clean up Signed-off-by: Ziang Li <ziangli@umich.edu> * Minor fix test Signed-off-by: Ziang Li <ziangli@umich.edu> * Improve accuracy by unfolding weight per-tensor fp32 Signed-off-by: Ziang Li <ziangli@umich.edu> * Fold row-wise quantization Signed-off-by: Ziang Li <ziangli@umich.edu> * Drop column wise Signed-off-by: Ziang Li <ziangli@umich.edu> * Clean up Signed-off-by: Ziang Li <ziangli@umich.edu> * Clean up Signed-off-by: Ziang Li <ziangli@umich.edu> * Clean up column wise Signed-off-by: Ziang Li <ziangli@umich.edu> * Move shared test helpers Signed-off-by: Ziang Li <ziangli@umich.edu> * Minor clean up test Signed-off-by: Ziang Li <ziangli@umich.edu> * Readability Signed-off-by: Ziang Li <ziangli@umich.edu> * Rename Signed-off-by: Ziang Li <ziangli@umich.edu> * Further refactor Signed-off-by: Ziang Li <ziangli@umich.edu> * Clean up bias Signed-off-by: Ziang Li <ziangli@umich.edu> * Clean up cast Signed-off-by: Ziang Li <ziangli@umich.edu> * Avoid silently disable column wise Signed-off-by: Ziang Li <ziangli@umich.edu> * Clean up Signed-off-by: Ziang Li <ziangli@umich.edu> * `is_quantizable` returns false Signed-off-by: Ziang Li <ziangli@umich.edu> * Error out grouped gemm Signed-off-by: Ziang Li <ziangli@umich.edu> * Tighten test Signed-off-by: Ziang Li <ziangli@umich.edu> * Rename verbose rowwise_amax_is_row_scaled Signed-off-by: Ziang Li <ziangli@umich.edu> * Clean up Signed-off-by: Ziang Li <ziangli@umich.edu> * Explicitly handle both gemm input and error out Signed-off-by: Ziang Li <ziangli@umich.edu> * Minor Signed-off-by: Ziang Li <ziangli@umich.edu> * Nits and lint Signed-off-by: Ziang Li <ziangli@umich.edu> * Minor fix A100 ci Signed-off-by: Ziang Li <ziangli@umich.edu> * Update tests/pytorch/utils.py Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Ziang Li <ziangli@umich.edu> --------- Signed-off-by: Ziang Li <ziangli@umich.edu> Co-authored-by: Yigong Qin <qqqyyy1233@outlook.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )