[pull] main from NVIDIA:main by pull[bot] · Pull Request #615 · phu0ngng/TransformerEngine

pull · 2026-05-23T04:32:04Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

* Initial implementation Signed-off-by: Ziang Li <ziangli@umich.edu> * Make 4over6 compile time for dequant Signed-off-by: Ziang Li <ziangli@umich.edu> * Expand 1d fwd+bwd test Signed-off-by: Ziang Li <ziangli@umich.edu> * Refactor Signed-off-by: Ziang Li <ziangli@umich.edu> * Clean up Signed-off-by: Ziang Li <ziangli@umich.edu> * Clean up Signed-off-by: Ziang Li <ziangli@umich.edu> * Add gemm test Signed-off-by: Ziang Li <ziangli@umich.edu> * Add more tests and fix offload Signed-off-by: Ziang Li <ziangli@umich.edu> * Fix offload Signed-off-by: Ziang Li <ziangli@umich.edu> * Clean up arg Signed-off-by: Ziang Li <ziangli@umich.edu> * Add more test Signed-off-by: Ziang Li <ziangli@umich.edu> * Add more tests Signed-off-by: Ziang Li <ziangli@umich.edu> * Clean up test Signed-off-by: Ziang Li <ziangli@umich.edu> * Refactor cuh kernel impl Signed-off-by: Ziang Li <ziangli@umich.edu> * Further extract Signed-off-by: Ziang Li <ziangli@umich.edu> * Clean up Signed-off-by: Ziang Li <ziangli@umich.edu> * Add recipe_id Signed-off-by: Ziang Li <ziangli@umich.edu> * Fix failing unit tests Signed-off-by: Ziang Li <ziangli@umich.edu> * Clean up test Signed-off-by: Ziang Li <ziangli@umich.edu> * Clean up Signed-off-by: Ziang Li <ziangli@umich.edu> * Refactor ref Signed-off-by: Ziang Li <ziangli@umich.edu> * Update comments and docs Signed-off-by: Ziang Li <ziangli@umich.edu> * Drop unnecessary test_sanity workaround The following tests passed: `NVTE_GROUPED_LINEAR_SINGLE_PARAM=1 python3 -m pytest --tb=auto tests/pytorch/test_sanity.py ` `NVTE_GROUPED_LINEAR_SINGLE_PARAM=1 NVTE_TEST_NVINSPECT_ENABLED=1 NVTE_TEST_NVINSPECT_CONFIG_FILE=tests/pytorch/debug/test_configs/dummy_feature.yaml NVTE_TEST_NVINSPECT_FEATURE_DIRS=transformer_engine/debug/features PYTORCH_JIT=0 NVTE_TORCH_COMPILE=0 NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 python3 -m pytest --tb=auto tests/pytorch/test_sanity.py ` Signed-off-by: Ziang Li <ziangli@umich.edu> * Refactor `QuantizerRole` Signed-off-by: Ziang Li <ziangli@umich.edu> * Allow separate recipe 4over6 config Signed-off-by: Ziang Li <ziangli@umich.edu> * Support 2d Signed-off-by: Ziang Li <ziangli@umich.edu> * Refactor 2d Signed-off-by: Ziang Li <ziangli@umich.edu> * Clean up anti pattern Signed-off-by: Ziang Li <ziangli@umich.edu> * Enforce 4over6 consistency Signed-off-by: Ziang Li <ziangli@umich.edu> * Update comments Signed-off-by: Ziang Li <ziangli@umich.edu> * Update docs Signed-off-by: Ziang Li <ziangli@umich.edu> * Fix test Signed-off-by: Ziang Li <ziangli@umich.edu> * Drop test_fusible_ops Signed-off-by: Ziang Li <ziangli@umich.edu> * Revert "Drop test_fusible_ops" This reverts commit 69f9ccc. Signed-off-by: Ziang Li <ziangli@umich.edu> * Refactor test_fusible_ops Signed-off-by: Ziang Li <ziangli@umich.edu> * Refactor ref and extend cpp test Signed-off-by: Ziang Li <ziangli@umich.edu> * Clean up cpp test Signed-off-by: Ziang Li <ziangli@umich.edu> * Minor comment Signed-off-by: Ziang Li <ziangli@umich.edu> * Drop doc Signed-off-by: Ziang Li <ziangli@umich.edu> * Explicit handle conditional smem buffer Signed-off-by: Ziang Li <ziangli@umich.edu> * Further clean up Signed-off-by: Ziang Li <ziangli@umich.edu> * More templates Signed-off-by: Ziang Li <ziangli@umich.edu> * Simplify cpp Signed-off-by: Ziang Li <ziangli@umich.edu> * Drop write back lifting Signed-off-by: Ziang Li <ziangli@umich.edu> * Add MAE and dedicated fast math env var Signed-off-by: Ziang Li <ziangli@umich.edu> * Harden cpp test Signed-off-by: Ziang Li <ziangli@umich.edu> * Add warning and err fast math coverage Signed-off-by: Ziang Li <ziangli@umich.edu> * Fold test case and clean up cpp test Signed-off-by: Ziang Li <ziangli@umich.edu> * Initial 448 vs 256 implementation Signed-off-by: Ziang Li <ziangli@umich.edu> * Use e4m3 max instead of boolean, more template Signed-off-by: Ziang Li <ziangli@umich.edu> * Add benchmark script and minor optimization Signed-off-by: Ziang Li <ziangli@umich.edu> * Use standalone kernels Signed-off-by: Ziang Li <ziangli@umich.edu> * Use cp async Signed-off-by: Ziang Li <ziangli@umich.edu> * Add benchmark script Signed-off-by: Ziang Li <ziangli@umich.edu> * Minor fix after rebase Signed-off-by: Ziang Li <ziangli@umich.edu> * Naming consistency Signed-off-by: Ziang Li <ziangli@umich.edu> * Remove 4over6 benchmark Signed-off-by: Ziang Li <ziangli@umich.edu> * Refactor modes Signed-off-by: Ziang Li <ziangli@umich.edu> * Relax tol for `test_layernorm_mlp` for `nvfp4_4over6` Signed-off-by: Ziang Li <ziangli@umich.edu> * Minor fix recipe naming Signed-off-by: Ziang Li <ziangli@umich.edu> * Remove gradient 4over6 quantization and partially allow SR/RHT Signed-off-by: Ziang Li <ziangli@umich.edu> * Allow RHT in pytorch ref Signed-off-by: Ziang Li <ziangli@umich.edu> * Update transformer_engine/pytorch/csrc/quantizer.cpp Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> * Minor fix TODO lint Signed-off-by: Ziang Li <ziangli@umich.edu> * Use standard nvfp4 for grad ref in test_fusible_ops.py since 4over6 is not applied to gradient quantizers Signed-off-by: Ziang Li <ziangli@umich.edu> * Minor fix test-fusible_ops 4over6 helper Signed-off-by: Ziang Li <ziangli@umich.edu> * Default to 256 for 4over6 Signed-off-by: Ziang Li <ziangli@umich.edu> * Reset RNG state for each TE ops test Adding tests affected RNG in unrelated tests. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Remove loosened NVFP4 tols in layernorm MLP test. Make sure tensors are representable in quantized format. Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Ziang Li <ziangli@umich.edu> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…P2P) with FA3 + THD (varlen) (#2596) * [PyTorch] Add pad_between_seqs support for FlashAttention 3 with CP Add support for padding between sequences (pad_between_seqs) in the FlashAttention 3 backend when used with context parallelism (CP). Key changes: - backends.py: Pass fa_pad_between_seqs through to FA3 forward/backward - context_parallel.py: Handle pad_between_seqs in A2A and P2P CP paths, zero FA3 padding garbage in CP forward, fix a2a backward alignment - dot_product_attention.py: Auto-detect pad_between_seqs from cu_seqlens - utils.py: Gate FA3 deterministic backward for hdim>=256, fix flash_attn_supported override for cross-attention and large head_dim, disable UnfusedDotProductAttention for pad_between_seqs, add SM100+ FA3 skip Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [PyTorch] Add pad_between_seqs tests for CP and non-CP FlashAttention Add test parametrization for pad_between_seqs in flash attention tests. Update run_attention_with_cp.py to support the new parameter and fix batch boundary alignment in the non-CP FA3 path. Run tests in parallel when multiple GPUs are available. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [QA] Add CP deterministic tests to L3 and support TE_PATH in FA test Add deterministic CP test runs to L3 FA versions test. Support TE_PATH positional arg and fix GPU threshold for parallel test execution. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [PyTorch] Fix FA3 deterministic gate to match upstream backward constraint The previous check disabled FA3 for deterministic mode whenever head_dim_qk > 128, which was overly conservative — FA3 forward supports deterministic execution at any head dim. The actual constraint from flash_api.cpp is that the backward pass does not support deterministic mode when max(head_size, head_size_v) >= 256. Narrow the gate to only disable FA3 during training (backward) and raise the threshold to >= 256, checking both head_dim_qk and head_dim_v to handle MLA configs with asymmetric head dimensions. Ref: https://github.com/Dao-AILab/flash-attention/blob/ac6f2eb5/hopper/flash_api.cpp#L1370 Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [PyTorch] Disable FlashAttention 4 for pad_between_seqs with THD The pad_between_seqs gate in get_attention_backend only disabled FlashAttention 2, letting FA4 leak through to the test-time fused-vs-flash comparison. On B200 runners that install flash-attn-4, this caused test_dpa_qkv_layout_thd to compare FusedAttention against an FA4 output whose padded positions contain garbage, producing 48 numerics failures in L3_pytorch_FA_versions_test--B200_1GPU. The log message already claimed FA4 would be disabled — this change makes the code match the message: set use_flash_attention_4 = False alongside use_flash_attention_2 when pad_between_seqs is True. FA3 continues to support pad_between_seqs via seqused_q/seqused_k. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [QA] Fix cutlass-dsl utils shadow in FA versions test FA4 install brings in nvidia-cutlass-dsl, whose `import cutlass` adds cutlass/base_dsl/ to sys.path. That directory contains a utils/ package that shadows tests/pytorch/utils.py, breaking collection of test_attention_with_cp.py with: ImportError: cannot import name 'ModelConfig' from 'utils' Prepend $TE_PATH/tests/pytorch to PYTHONPATH so the local utils.py is always resolved first, regardless of what FA4 dependencies install. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * skip tests which OOM in deterministic+backward+hopper+large_configs as its a known cudnn issue Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make cp det and nondet tests run in parallel whenever possible Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [QA] L3: gate CP tests per-arch to avoid CI timeout PR 2596 added deterministic CP runs to the L3 FA-versions matrix, multiplying CP wall time across every FA version and causing CI timeouts (pipeline 50243000). Run CP tests once per arch instead, picking the FA version each arch's CP code path actually supports: - sm90 (H100): FA3 3.0.0b1 - context_parallel.py is FA3-only on Hopper (use_flash_attn_3 threaded throughout, FA4 not wired in; pad_between_seqs gated on use_flash_attn_3 at lines 1038, 1366) - sm>90 (B200): latest FA4 - FA3 is not built/installed for sm>90 Non-CP test_attention.py still runs for every FA version in the array. Also drop FA 2.7.3 from the sm90 list (no longer maintained as a target) and bump the FA4 pin from 4.0.0b8 to 4.0.0b11. b8 has an SM90 backward kernel bug fixed by upstream PR #2513 in b11 (get_smem_store_C() got multiple values for argument 'transpose'). Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [QA] L3: skip pre-installed FA3 build, per-FA junit XMLs Three follow-ups on top of 13ba004 (L3 per-arch CP gating): 1. Skip the inline FA3 source build when flash_attn_interface is already importable. This makes the script a no-op on FA3 install when the base image has FA3 baked in (companion to TE !573 on te_ci, which auto-sets INSTALL_FA3=${RUN_L3_TESTS} so FA3 is preinstalled for L3 pipelines). Saves ~20 min of L3 H100 wall time once both land. Falls back to the existing inline build when FA3 is not pre-installed. 2. Suffix junit XMLs with the FA version (pytest_test_attention_fa2_8_3.xml etc.) so per-iteration results are preserved instead of overwritten. Pipeline 50348672 had no per-FA timing visibility because pytest.xml was clobbered by each loop iteration. 3. Include FA version in test_fail messages so CI dashboards show which FA iteration caused a failure (was "test_attention.py", now "test_attention.py (FA 2.8.3)"). Also fold the CP_FA_VERSION assignment into the same if-block as FA_versions (was a separate if-block immediately after) since the two are arch-keyed in lockstep. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * b200 shouldnt run FA3 even if present Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * L3: drop stale RUN_L3_TESTS=1 note; use flash_attn_3 for FA3 check Address two pending review comments: 1. The "auto-set when RUN_L3_TESTS=1" annotation on the base-image FA3 preinstall is no longer accurate; drop it so readers don't grep for a coupling that doesn't exist. 2. `flash_attn_interface` reads like a generic FA API even though the top-level shim is only created by the FA3 install. Switching to `import flash_attn_3` makes the FA3-specific intent unambiguous and matches the FA3 package layout produced by the source build. Local validation on H100 (sm90) with FA3 active, TE worktree resolving to the editable install (verified via three-layer import check from /tmp): test_attention_with_cp.py parallel det+nondet — 45 passed / 0 failed nondet (3:52), 33 passed / 0 failed det (2:55). 33 pad-True nondet passes + 21 pad-True det passes confirm the FA3+THD+CP path is exercised; 5 det OOM cases skip cleanly via the existing inline guard. Same test scope is exercised by L1_pytorch_distributed_unittest (parallel det+nondet) and the FA3 iteration of L3_pytorch_FA_versions_test; the changes here are L3-only documentation/detection tweaks and do not alter the Python test code, but the L1+L3 CP execution was re-run on the cleaned PR head end-to-end as proof. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * Address review nits: bHSS-gated OOM skip; drop Dockerfile.base specifics 1. Det FusedAttention backward THD/sm90 OOM skip: gate on the actual memory pressure (b*H*S*S) instead of num_heads >= 20. The cuDNN workspace is proportional to bHSS, so a future config with H >= 20 but small b or S would be needlessly skipped under the old guard, while a config with H < 20 but large b*S that hit the same OOM wouldn't be caught. Threshold 1e9 empirically matches the existing 5-case skip set on the test_essential fused subset (cp_2_0, cp_2_2, cp_3_1, cp_4_2, cp_4_3 — bHSS in 1.07B–4.29B) and lets cp_1_0/ cp_2_1/cp_2_4/cp_3_2/cp_3_4 (bHSS ~0.40B) keep running. 2. L3 FA3 install comment: drop the "Dockerfile.base INSTALL_FA3=1" reference. The detection check is the contract; mentioning a specific image variable couples this script to an out-of-tree provisioning detail that may evolve independently. Local validation on H100 (sm90) with FA3 active and TE worktree resolving to editable (verified via /tmp-cwd three-layer import check after reinstall — the /usr/local TE shadow had reappeared between sessions): test_attention_with_cp.py parallel det+nondet — 45 passed / 0 failed nondet (4:09), 33 passed / 0 failed det (3:14). 33 pad-True nondet passes + 21 pad-True det passes; 5 det OOM cases skip via the new bHSS gate — same cases as the old num_heads-only gate. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Name the OOM-skip threshold and explain the 128*bHSS workspace observation Address review nits on the deterministic THD-backward OOM guard: 1. Replace the magic number 1_000_000_000 with the named constant SM90_DET_FUSED_THD_BWD_MAX_BHSS = 1 << 30, so the value is searchable and labeled. 2. Replace the prefatory comment with a short note tying the number to cuDNN's actual workspace request (~128 * bHSS bytes, measured on cuDNN 9.21.0 sm90 — see local sweep). At bHSS = 1<<30 the request is 128 GiB, which doesn't fit on H100's 80 GB. 3. Flag the b>=3 caveat for future readers: cuDNN rounds the batch up internally so workspace grows super-linearly past b=2 (b=4 asks for 4x the b=2 workspace, not 2x). The current fused-essential matrix is all b=2, so the threshold stays correct for what the test exercises; the note is there so the next person doesn't have to rediscover it. Skip set is unchanged — cp_2_0, cp_2_1, cp_3_1, cp_4_2, cp_4_3. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * Reword OOM-skip comment as observations, not cuDNN-internal claims We measured the workspace request from outside cuDNN, so the comment should say "observed" rather than asserting what cuDNN does. Reframes the ~128 * bHSS bytes formula and the super-linear b>=3 behavior as empirical observations from our sweep. No code change. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> --------- Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

zianglih and others added 2 commits May 22, 2026 15:58

pull Bot locked and limited conversation to collaborators May 23, 2026

pull Bot added the ⤵️ pull label May 23, 2026

pull Bot merged commit 80ea313 into phu0ngng:main May 23, 2026

github-actions Bot added the community-contribution label May 23, 2026

pull Bot had a problem deploying to github-pages May 23, 2026 04:33 Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from NVIDIA:main#615

[pull] main from NVIDIA:main#615
pull[bot] merged 2 commits into
phu0ngng:mainfrom
NVIDIA:main

pull Bot commented May 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pull Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pull Bot commented May 23, 2026 •

edited

Loading