Skip to content

[pull] main from NVIDIA:main#615

Merged
pull[bot] merged 2 commits into
phu0ngng:mainfrom
NVIDIA:main
May 23, 2026
Merged

[pull] main from NVIDIA:main#615
pull[bot] merged 2 commits into
phu0ngng:mainfrom
NVIDIA:main

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented May 23, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

zianglih and others added 2 commits May 22, 2026 15:58
* Initial implementation

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Make 4over6 compile time for dequant

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Expand 1d fwd+bwd test

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Refactor

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Clean up

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Clean up

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Add gemm test

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Add more tests and fix offload

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Fix offload

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Clean up arg

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Add more test

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Add more tests

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Clean up test

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Refactor cuh kernel impl

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Further extract

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Clean up

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Add recipe_id

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Fix failing unit tests

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Clean up test

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Clean up

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Refactor ref

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Update comments and docs

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Drop unnecessary test_sanity workaround

The following tests passed:
`NVTE_GROUPED_LINEAR_SINGLE_PARAM=1 python3 -m pytest --tb=auto tests/pytorch/test_sanity.py
`
`NVTE_GROUPED_LINEAR_SINGLE_PARAM=1 NVTE_TEST_NVINSPECT_ENABLED=1 NVTE_TEST_NVINSPECT_CONFIG_FILE=tests/pytorch/debug/test_configs/dummy_feature.yaml NVTE_TEST_NVINSPECT_FEATURE_DIRS=transformer_engine/debug/features PYTORCH_JIT=0 NVTE_TORCH_COMPILE=0 NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 python3 -m pytest --tb=auto tests/pytorch/test_sanity.py
`

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Refactor `QuantizerRole`

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Allow separate recipe 4over6 config

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Support 2d

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Refactor 2d

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Clean up anti pattern

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Enforce 4over6 consistency

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Update comments

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Update docs

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Fix test

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Drop test_fusible_ops

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Revert "Drop test_fusible_ops"

This reverts commit 69f9ccc.

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Refactor test_fusible_ops

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Refactor ref and extend cpp test

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Clean up cpp test

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Minor comment

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Drop doc

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Explicit handle conditional smem buffer

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Further clean up

Signed-off-by: Ziang Li <ziangli@umich.edu>

* More templates

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Simplify cpp

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Drop write back lifting

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Add MAE and dedicated fast math env var

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Harden cpp test

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Add warning and err fast math coverage

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Fold test case and clean up cpp test

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Initial 448 vs 256 implementation

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Use e4m3 max instead of boolean, more template

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Add benchmark script and minor optimization

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Use standalone kernels

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Use cp async

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Add benchmark script

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Minor fix after rebase

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Naming consistency

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Remove 4over6 benchmark

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Refactor modes

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Relax tol for `test_layernorm_mlp` for `nvfp4_4over6`

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Minor fix recipe naming

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Remove gradient 4over6 quantization and partially allow SR/RHT

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Allow RHT in pytorch ref

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Update transformer_engine/pytorch/csrc/quantizer.cpp

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* Minor fix TODO lint

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Use standard nvfp4 for grad ref in test_fusible_ops.py since 4over6 is not applied to gradient quantizers

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Minor fix test-fusible_ops 4over6 helper

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Default to 256 for 4over6

Signed-off-by: Ziang Li <ziangli@umich.edu>

* Reset RNG state for each TE ops test

Adding tests affected RNG in unrelated tests.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Remove loosened NVFP4 tols in layernorm MLP test.

Make sure tensors are representable in quantized format.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…P2P) with FA3 + THD (varlen) (#2596)

* [PyTorch] Add pad_between_seqs support for FlashAttention 3 with CP

Add support for padding between sequences (pad_between_seqs) in the
FlashAttention 3 backend when used with context parallelism (CP).

Key changes:
- backends.py: Pass fa_pad_between_seqs through to FA3 forward/backward
- context_parallel.py: Handle pad_between_seqs in A2A and P2P CP paths,
  zero FA3 padding garbage in CP forward, fix a2a backward alignment
- dot_product_attention.py: Auto-detect pad_between_seqs from cu_seqlens
- utils.py: Gate FA3 deterministic backward for hdim>=256, fix
  flash_attn_supported override for cross-attention and large head_dim,
  disable UnfusedDotProductAttention for pad_between_seqs, add SM100+
  FA3 skip

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [PyTorch] Add pad_between_seqs tests for CP and non-CP FlashAttention

Add test parametrization for pad_between_seqs in flash attention tests.
Update run_attention_with_cp.py to support the new parameter and fix
batch boundary alignment in the non-CP FA3 path. Run tests in parallel
when multiple GPUs are available.

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [QA] Add CP deterministic tests to L3 and support TE_PATH in FA test

Add deterministic CP test runs to L3 FA versions test. Support TE_PATH
positional arg and fix GPU threshold for parallel test execution.

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [PyTorch] Fix FA3 deterministic gate to match upstream backward constraint

The previous check disabled FA3 for deterministic mode whenever
head_dim_qk > 128, which was overly conservative — FA3 forward supports
deterministic execution at any head dim. The actual constraint from
flash_api.cpp is that the backward pass does not support deterministic
mode when max(head_size, head_size_v) >= 256.

Narrow the gate to only disable FA3 during training (backward) and
raise the threshold to >= 256, checking both head_dim_qk and head_dim_v
to handle MLA configs with asymmetric head dimensions.

Ref: https://github.com/Dao-AILab/flash-attention/blob/ac6f2eb5/hopper/flash_api.cpp#L1370

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [PyTorch] Disable FlashAttention 4 for pad_between_seqs with THD

The pad_between_seqs gate in get_attention_backend only disabled
FlashAttention 2, letting FA4 leak through to the test-time
fused-vs-flash comparison. On B200 runners that install flash-attn-4,
this caused test_dpa_qkv_layout_thd to compare FusedAttention against
an FA4 output whose padded positions contain garbage, producing 48
numerics failures in L3_pytorch_FA_versions_test--B200_1GPU.

The log message already claimed FA4 would be disabled — this change
makes the code match the message: set use_flash_attention_4 = False
alongside use_flash_attention_2 when pad_between_seqs is True. FA3
continues to support pad_between_seqs via seqused_q/seqused_k.

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [QA] Fix cutlass-dsl utils shadow in FA versions test

FA4 install brings in nvidia-cutlass-dsl, whose `import cutlass`
adds cutlass/base_dsl/ to sys.path. That directory contains a utils/
package that shadows tests/pytorch/utils.py, breaking collection of
test_attention_with_cp.py with:
  ImportError: cannot import name 'ModelConfig' from 'utils'

Prepend $TE_PATH/tests/pytorch to PYTHONPATH so the local utils.py
is always resolved first, regardless of what FA4 dependencies install.

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* skip tests which OOM in deterministic+backward+hopper+large_configs as its a known cudnn issue

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* make cp det and nondet tests run in parallel whenever possible

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [QA] L3: gate CP tests per-arch to avoid CI timeout

PR 2596 added deterministic CP runs to the L3 FA-versions matrix, multiplying
CP wall time across every FA version and causing CI timeouts (pipeline
50243000). Run CP tests once per arch instead, picking the FA version each
arch's CP code path actually supports:

  - sm90 (H100):  FA3 3.0.0b1   - context_parallel.py is FA3-only on Hopper
                                  (use_flash_attn_3 threaded throughout, FA4
                                  not wired in; pad_between_seqs gated on
                                  use_flash_attn_3 at lines 1038, 1366)
  - sm>90 (B200): latest FA4    - FA3 is not built/installed for sm>90

Non-CP test_attention.py still runs for every FA version in the array.

Also drop FA 2.7.3 from the sm90 list (no longer maintained as a target)
and bump the FA4 pin from 4.0.0b8 to 4.0.0b11. b8 has an SM90 backward
kernel bug fixed by upstream PR #2513 in b11
(get_smem_store_C() got multiple values for argument 'transpose').

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [QA] L3: skip pre-installed FA3 build, per-FA junit XMLs

Three follow-ups on top of 13ba004 (L3 per-arch CP gating):

1. Skip the inline FA3 source build when flash_attn_interface is already
   importable. This makes the script a no-op on FA3 install when the base
   image has FA3 baked in (companion to TE !573 on te_ci, which auto-sets
   INSTALL_FA3=${RUN_L3_TESTS} so FA3 is preinstalled for L3 pipelines).
   Saves ~20 min of L3 H100 wall time once both land. Falls back to the
   existing inline build when FA3 is not pre-installed.

2. Suffix junit XMLs with the FA version (pytest_test_attention_fa2_8_3.xml
   etc.) so per-iteration results are preserved instead of overwritten.
   Pipeline 50348672 had no per-FA timing visibility because pytest.xml
   was clobbered by each loop iteration.

3. Include FA version in test_fail messages so CI dashboards show which
   FA iteration caused a failure (was "test_attention.py", now
   "test_attention.py (FA 2.8.3)").

Also fold the CP_FA_VERSION assignment into the same if-block as
FA_versions (was a separate if-block immediately after) since the two
are arch-keyed in lockstep.

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* b200 shouldnt run FA3 even if present

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* L3: drop stale RUN_L3_TESTS=1 note; use flash_attn_3 for FA3 check

Address two pending review comments:
1. The "auto-set when RUN_L3_TESTS=1" annotation on the base-image FA3
   preinstall is no longer accurate; drop it so readers don't grep for a
   coupling that doesn't exist.
2. `flash_attn_interface` reads like a generic FA API even though the
   top-level shim is only created by the FA3 install. Switching to
   `import flash_attn_3` makes the FA3-specific intent unambiguous and
   matches the FA3 package layout produced by the source build.

Local validation on H100 (sm90) with FA3 active, TE worktree resolving
to the editable install (verified via three-layer import check from
/tmp): test_attention_with_cp.py parallel det+nondet — 45 passed / 0
failed nondet (3:52), 33 passed / 0 failed det (2:55). 33 pad-True
nondet passes + 21 pad-True det passes confirm the FA3+THD+CP path is
exercised; 5 det OOM cases skip cleanly via the existing inline guard.

Same test scope is exercised by L1_pytorch_distributed_unittest
(parallel det+nondet) and the FA3 iteration of L3_pytorch_FA_versions_test;
the changes here are L3-only documentation/detection tweaks and do not
alter the Python test code, but the L1+L3 CP execution was re-run on
the cleaned PR head end-to-end as proof.

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* Address review nits: bHSS-gated OOM skip; drop Dockerfile.base specifics

1. Det FusedAttention backward THD/sm90 OOM skip: gate on the actual
   memory pressure (b*H*S*S) instead of num_heads >= 20. The cuDNN
   workspace is proportional to bHSS, so a future config with H >= 20
   but small b or S would be needlessly skipped under the old guard,
   while a config with H < 20 but large b*S that hit the same OOM
   wouldn't be caught. Threshold 1e9 empirically matches the existing
   5-case skip set on the test_essential fused subset (cp_2_0, cp_2_2,
   cp_3_1, cp_4_2, cp_4_3 — bHSS in 1.07B–4.29B) and lets cp_1_0/
   cp_2_1/cp_2_4/cp_3_2/cp_3_4 (bHSS ~0.40B) keep running.

2. L3 FA3 install comment: drop the "Dockerfile.base INSTALL_FA3=1"
   reference. The detection check is the contract; mentioning a
   specific image variable couples this script to an out-of-tree
   provisioning detail that may evolve independently.

Local validation on H100 (sm90) with FA3 active and TE worktree
resolving to editable (verified via /tmp-cwd three-layer import check
after reinstall — the /usr/local TE shadow had reappeared between
sessions): test_attention_with_cp.py parallel det+nondet — 45 passed /
0 failed nondet (4:09), 33 passed / 0 failed det (3:14). 33 pad-True
nondet passes + 21 pad-True det passes; 5 det OOM cases skip via the
new bHSS gate — same cases as the old num_heads-only gate.

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Name the OOM-skip threshold and explain the 128*bHSS workspace observation

Address review nits on the deterministic THD-backward OOM guard:
1. Replace the magic number 1_000_000_000 with the named constant
   SM90_DET_FUSED_THD_BWD_MAX_BHSS = 1 << 30, so the value is searchable
   and labeled.
2. Replace the prefatory comment with a short note tying the number to
   cuDNN's actual workspace request (~128 * bHSS bytes, measured on
   cuDNN 9.21.0 sm90 — see local sweep). At bHSS = 1<<30 the request is
   128 GiB, which doesn't fit on H100's 80 GB.
3. Flag the b>=3 caveat for future readers: cuDNN rounds the batch up
   internally so workspace grows super-linearly past b=2 (b=4 asks for
   4x the b=2 workspace, not 2x). The current fused-essential matrix is
   all b=2, so the threshold stays correct for what the test exercises;
   the note is there so the next person doesn't have to rediscover it.

Skip set is unchanged — cp_2_0, cp_2_1, cp_3_1, cp_4_2, cp_4_3.

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* Reword OOM-skip comment as observations, not cuDNN-internal claims

We measured the workspace request from outside cuDNN, so the comment should
say "observed" rather than asserting what cuDNN does. Reframes the ~128 *
bHSS bytes formula and the super-linear b>=3 behavior as empirical
observations from our sweep.

No code change.

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

---------

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@pull pull Bot locked and limited conversation to collaborators May 23, 2026
@pull pull Bot added the ⤵️ pull label May 23, 2026
@pull pull Bot merged commit 80ea313 into phu0ngng:main May 23, 2026
@pull pull Bot had a problem deploying to github-pages May 23, 2026 04:33 Failure
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants