[BREAKING][ops, model] feat: GPU-optimal ops defaults + strict NPU validation by TimYangst · Pull Request #716 · ByteDance-Seed/VeOmni

TimYangst · 2026-05-01T00:46:48Z

What does this PR do?

Flip OpsImplementationConfig per-op defaults from "eager" to GPU-optimal
(Liger / Triton / fused_triton), and tighten the validator + per-model
dispatch so misconfigurations raise loudly instead of silently downgrading.

Default changes (GPU):

Field Old New

attn_implementation flash_attention_2 unchanged

moe_implementation eager fused_triton

cross_entropy_loss_implementation chunk_loss liger_kernel

rms_norm_implementation eager liger_kernel

swiglu_mlp_implementation eager liger_kernel

rotary_pos_emb_implementation eager liger_kernel

load_balancing_loss_implementation eager triton

NPU users must now pin every per-op field explicitly — to an
NPU-supported value (npu / chunk_loss / fused_npu / triton) or to
eager when the op has no NPU backend. The validator raises with a
per-op allow-list message at config-parse time.

Builds on #678 (registration-based kernel framework) and #708
(chunk_loss reshape fix).

API and Usage Example

NPU: pin every per-op field. Minimal NPU-friendly config:
model:
  ops_implementation:
    attn_implementation: flash_attention_2
    moe_implementation: fused_npu
    cross_entropy_loss_implementation: chunk_loss
    rms_norm_implementation: npu
    rotary_pos_emb_implementation: npu
    swiglu_mlp_implementation: eager           # no NPU backend
    load_balancing_loss_implementation: eager  # triton-ascend not exposed as `triton`
GPU: existing YAMLs that don't override the per-op fields now pick up
the GPU-optimal kernels automatically. The gpu extra ships
liger-kernel + triton, so no setup change is needed.

Models with structurally-incompatible kernels (Wan rope_apply,
Qwen2-VL multimodal RoPE NPU) opt out via extra_backends[name] = None.
The validator raises "explicitly disabled for {model}" and points the
user at the allowed values. Wan's three YAMLs ship pinned to eager.

Design & Code Changes

Defaults & validation — veomni/arguments/arguments_types.py,
veomni/ops/kernels/{rms_norm,rotary,swiglu,load_balancing_loss}/__init__.py

Flip per-op dataclass defaults; sync each OpSpec.default to match.

Rewrite _validate_implementations:

Check NPU compatibility against an explicit _NPU_ALLOWED allow-list.

Gate load_balancing_loss=triton on is_package_available("triton").

Hard-assert at validation time that every registered op
(list_ops()) has an entry in _NPU_ALLOWED — catches future op
additions that miss the map.

Per-package availability for liger_kernel / torch_npu / triton is
checked at the resolution sites (_check_requires in
apply_per_model_patches / apply_global_ops), not duplicated in the
validator.

Trade-off — hardcoded allow-list: _NPU_ALLOWED is hand-curated rather
than derived from BackendSpec.requires. The name "triton" is reused
with different hardware semantics (load-balancing loss runs on NPU via
triton-ascend, but DeepSeek-V3's batch-invariant RMSNorm /
deterministic RoPE Triton kernels are mainline-CUDA-only), so a
requires-based inference would conflate them. Third-party NPU backends
still extend cleanly via per-model extra_backends in device_patch.py;
the allow-list only gates the hardware-class names exposed on the public
dataclass.

Strict per-model dispatch — veomni/ops/config/registry.py

apply_per_model_patches raises on every missing backend except
eager. extra_backends[name] = None is repurposed as an
explicit-raise opt-out for cases where the global registry default
would silently bind a wrong-signature kernel; the available-list error
excludes the disabled entry.

Safe-fallback path — veomni/models/auto.py

Replace the silent apply_ops_config(OpsImplementationConfig())
fallback (when callers omit ops_implementation) with
OpsImplementationConfig.all_eager() — every per-op field pinned to
"eager". attn_implementation keeps the dataclass default
(flash_attention_2) when flash-attn is importable (the common case
on production GPU hosts), falling back to "eager" when it isn't, so
CPU / minimal-deps environments don't crash on import-time FA loading.
Standalone scripts (inference, weight materialization, dummy-forward
tests) keep working on every accelerator.

Per-model device patches

qwen3_vl/device_patch.py — drop historical liger_kernel: None
opt-outs (verified standard-shape RMSNorm + RoPE; Liger drops in
cleanly).

wan/device_patch.py — keep rotary_pos_emb: liger_kernel: None as
the explicit-raise marker; move Wan's Triton RoPE from the
_custom_wan callback into
extra_backends["rotary_pos_emb"]["triton"] so it survives the new
strict-raise gate (a callback can't run after the gate rejects unknown
backends).

qwen2_vl/device_patch.py — keep rotary_pos_emb: npu: None as the
explicit-raise marker for the multimodal RoPE NPU mismatch.

configs/dit/wan*.yaml (3 files) — pin
rotary_pos_emb_implementation: eager since Wan's rope_apply can't
host the new liger_kernel default.

Test infrastructure — tests/tools/training_utils.py and downstream

New resolve_ops_overrides(model_name) helper emits hardware-aware
--model.ops_implementation.X=Y flags; npu_skip_marker(model_name)
skip factory for models with no NPU+eager path.
_NPU_PER_MODEL_OVERRIDES covers DeepSeek-V3 (no NPU RMSNorm/RoPE)
and Qwen2-VL / Qwen2.5-VL / Qwen2.5-Omni (multimodal RoPE).

model_name threaded through build_torchrun_cmd /
run_training_config and into tests/e2e/utils.py,
tests/distributed/test_fsdp_equivalence.py, tests/checkpoints/utils.py,
tests/data/{test_datasets,test_multisource_dataset,test_dynamic_batching_dataset}.py.

tests/models/test_padded_packed_loss.py, tests/ops/test_moe_hw_gate.py
— pin all-eager kwargs so tests don't couple to the new GPU defaults.

Docs — docs/usage/arguments.md, docs/design/kernel_selection.md

Refresh defaults, NPU validation contract, lifecycle "Ownership"
paragraph (now describes the all_eager() fallback), per-section YAML
examples; replace VEOMNI_USE_LIGER_KERNEL references in the
transformers-v5 comparison table with the corresponding
OpsImplementationConfig field name.

Test

make quality clean.

pytest tests/ops/test_moe_hw_gate.py — 15/15 pass.

GPU host: constructed OpsImplementationConfig() with new defaults —
validates clean; logs show liger_kernel / fused_triton / triton
bound for every op.

Mocked is_torch_npu_available()=True: every GPU-only choice
(Liger rms_norm/rotary/swiglu, Liger CE, Triton RMSNorm via
DeepSeek-V3, fused_triton MoE) raises with the listed alternatives.
All-NPU config (fused_npu + chunk_loss + npu + eager +
triton) passes.

Mocked GPU runtime: fused_npu / npu rms_norm raise with
platform-mismatch errors.

Per-model strict-raise: simulated apply_per_model_patches for
Wan-shape extra_backends confirms rotary_pos_emb=liger_kernel
(default) raises with "explicitly disabled for Wan" +
available=['eager','npu','triton']; explicit rotary_pos_emb=eager
succeeds.

Verified Qwen3-VL Qwen3VLTextRMSNorm + apply_rotary_pos_emb are
HF-standard on transformers v4.57.3 and v5 — the liger_kernel lift
is safe.

Pre-existing tests/ops/test_seqcls_loss.py failure (Triton driver
missing on this CPU dev box) reproduces on main — unrelated.

Pending CI:

gpu_e2e_test.yml, gpu_unit_tests.yml — should pass with the gpu
extra (ships liger-kernel + flash-attn).

npu_e2e_test.yml, npu_unit_tests.yml — resolve_ops_overrides
threaded through every torchrun in tests/{e2e,distributed,checkpoints,data}
emits NPU-supported overrides per model.

Checklist Before Submitting

Read the Contribute Guide
Applied pre-commit checks
Added/updated documentation
Added tests to CI workflow (or explained why not feasible) —
resolve_ops_overrides covers every model NPU CI exercises;
test_moe_hw_gate.py pins all-eager. No new standalone test file: the
change is config-default + validator behavior, exercised end-to-end by
every tests/e2e/test_e2e_parallel.py parametrization.

…lidation Flip OpsImplementationConfig per-op defaults from "eager" to GPU-optimal backends (liger_kernel for cross_entropy / rms_norm / swiglu_mlp / rotary_pos_emb; fused_triton for moe; triton for load_balancing_loss). Attention default unchanged. The validator now raises explicitly on hardware mismatches: - On NPU, GPU-only backends (liger_kernel, fused_triton, fused_quack, triton-for-rms_norm/rotary) raise at OpsImplementationConfig.__post_init__ with a model-agnostic "set to one of [npu / chunk_loss / fused_npu / eager]" message. Allow-list lives in _NPU_COMPATIBLE_BACKENDS_PER_OP and is consistency-checked against the registry at validation time. - On GPU, fused_npu / npu raise with "requires Ascend NPU" messages. - load_balancing_loss=triton without the triton package raises with a clear "install triton or set to eager" message. Per-model dispatch (apply_per_model_patches) is also strict: the only no-backend outcome that does not raise is "eager". extra_backends[name]=None is repurposed as an *explicit-raise opt-out* — used by Wan (rope_apply has non-standard signature) and Qwen2-VL (multimodal RoPE) to prevent the global registry default from silently binding a wrong-signature kernel. The error message tells the user exactly what to pin in YAML. Lifted Qwen3-VL's historical liger_kernel opt-outs (verified Qwen3VLTextRMSNorm + apply_rotary_pos_emb are HF-standard on transformers v4.57 and v5; Liger drops in cleanly). build_foundation_model's silent-fallback path (when callers omit ops_implementation) now installs an all-eager safe config via _build_safe_fallback_ops_config so standalone scripts (inference, weight materialization, dummy-forward tests) keep working everywhere without requiring liger / triton. CI test infra: tests/tools/training_utils.py gains resolve_ops_overrides() and npu_skip_marker() that emit hardware-aware --model.ops_implementation.X=Y flags per-model (NPU-supported backend or eager fallback for ops without NPU kernel; per-model overrides for DeepSeek-V3 RMSNorm/RoPE and the Qwen-VL family multimodal RoPE). model_name threaded through build_torchrun_cmd / run_training_config / prepare_exec_cmd. Wan YAMLs (wan_sft, wan_lora, wan2.1_I2V_1.3B_lora) pin rotary_pos_emb_implementation: eager explicitly since Wan's rope_apply cannot host the new liger_kernel default. Breaking changes: - Existing YAMLs that relied on per-op defaults being "eager" must either install liger-kernel (gpu extra ships it) or set the field to "eager" explicitly. NPU users must explicitly choose npu / chunk_loss / eager for every per-op field. - VEOMNI_USE_LIGER_KERNEL env var was already removed in #678; the comparison-table references in docs/design/kernel_selection.md are updated to point at the OpsImplementationConfig fields. Docs: docs/usage/arguments.md and docs/design/kernel_selection.md updated with the new defaults, NPU validation contract, and silent-fallback path replacement.

gemini-code-assist

Code Review

This pull request updates the default operator implementations to be GPU-optimal (switching from 'eager' to 'liger_kernel' or 'triton') and introduces a strict validation layer to ensure selected backends are compatible with the host hardware, particularly for Ascend NPU. It also implements a conservative all-eager fallback for standalone scripts to maintain portability and updates the test suite with hardware-aware override logic. Feedback was provided regarding the hardcoded NPU allow-list, which currently restricts third-party backend registration without framework modifications, contradicting stated design goals.

Cut tests/tools/training_utils.py from 138 lines back down to ~60 in the ops-overrides region: - Drop _GPU_OPS_DEFAULTS and the GPU branch in resolve_ops_overrides. The new dataclass defaults (flash_attention_2 + fused_triton + Liger / Triton per-op) already match what we were emitting, so resolve_ops_overrides on GPU is now a no-op returning []. - Drop _NPU_SKIP_MODELS + npu_skip_marker. No model uses the skip path today; reintroduce when an actually-incompatible model lands. - Compress the comment blocks. The remaining text is just enough for a reader to understand the per-model NPU eager pinning without the paragraph-long context. No behavior change on either accelerator. NPU CI still gets the full hardware-aware override set per model (verified with a mocked NPU runtime). [ops, model] refactor: simplify OpsImplementationConfig validator The validator was carrying ~150 lines of duplicated work — package availability, per-op registry walks, three special-case sub-blocks for CE / MoE / lb_loss / liger / npu / triton. Most of that already runs at kernel-resolution time (apply_per_model_patches, install_loss_mapping, apply_veomni_fused_moe_patch, KERNEL_REGISTRY.resolve), so the validator fired alongside the resolution-site error rather than instead of it. Cut to a single hardware-mismatch table: _NPU_ALLOWED: per-field allow-list of non-eager values that work on NPU _NPU_REQUIRED: per-field set of values that require NPU and fail on GPU Validator becomes a ~25-line loop. Eager is implicit (always allowed). Anything not in the allow-list raises with the field name and the allowed alternatives. The one pragmatic exception is the triton package check for load_balancing_loss=triton — kept because the alternative (letting ImportError surface from apply_global_ops) is noisy and unhelpful. Dropped: - Per-op walk over _OPS_REGISTRY for liger / torch_npu package checks. These already raise with clear messages from _check_requires inside apply_per_model_patches and apply_global_ops; no need for the validator to duplicate them. - Hard registry-consistency assertion. A future op registered without updating the allow-list will surface at kernel-bind time with the KERNEL_REGISTRY's "Available: [...]" error — acceptable for the hypothetical case. - Verbose docstrings and per-field help texts. Each field now describes its values in 2-3 lines instead of 8-12. Net: 309 lines → 73 lines for the OpsImplementationConfig section. Behavior unchanged on every scenario tested (default GPU, fused_npu on GPU, CE=npu on GPU, default on NPU, liger rms_norm on NPU, triton rms_norm on NPU, valid all-NPU config — all match prior error classifications). [ops] refactor: extract apply_per_model_patches helpers The apply_per_model_patches body had grown a 50-line "no-backend" branch with two overlapping comment blocks explaining the eager / explicit-opt-out / unknown-value cases. The control flow was hard to read because the backend-resolution, error-classification, and patching logic were all inlined together. Extract three focused helpers: - _resolve_backend(op, value, op_overrides) -> BackendSpec | None - _raise_no_backend(model_name, op, value, op_overrides) - _patch_target(hf_module, target_attr, backend) The main loop now reads top-to-bottom: get op → resolve backend → handle no-backend (eager/raise) → check requires → patch → log. The error classification (explicit-disabled vs unknown-name) lives in _raise_no_backend with a single ternary, so the rationale doesn't need a 20-line comment to follow. Compress the extra_backends docstring from 13 lines to 5; the example models stay (Wan rope_apply, Qwen2-VL multimodal RoPE) so future readers know which historical case the opt-out mechanism was designed for. Behavior unchanged on all three branches: - explicit opt-out (Wan rotary=liger_kernel) → "explicitly disabled" raise - unknown value (rotary=banana) → "not a registered backend" raise - eager → continue silently [ops, ci] fix: NPU test_models_patch failure + OpsImplementationConfig.all_eager helper Reported NPU CI failure: tests/models/test_models_patch.py::test_models_patch_fwd_bwd[llama3.1] ValueError: rms_norm_implementation='liger_kernel' is not supported on Ascend NPU. Set to one of ['eager', 'npu']; ... Root cause: the test constructs ``ModelArguments(config_path=...)`` directly without passing ``ops_implementation``. The default ``OpsImplementationConfig()`` now carries the GPU-optimal liger_kernel / fused_triton / triton defaults, so the validator fires at NPU config-parse time before the test's mode-specific ``apply_ops_config`` (driven by ``set_environ_param``) ever runs. The default ops_implementation on this ModelArguments is never consumed at training time. Fix: - Add ``OpsImplementationConfig.all_eager(**overrides)`` classmethod — one-liner for tests / standalone scripts that need a hardware-agnostic config (every per-op = "eager"). ``**overrides`` lets a caller flip individual fields without enumerating the rest. - Pin ``test_models_patch.py:389`` to ``OpsImplementationConfig.all_eager()``. - Refactor the existing eager pins in ``test_padded_packed_loss.py`` and ``test_moe_hw_gate.py`` to use the classmethod (less inline noise). - Replace the private ``_build_safe_fallback_ops_config()`` in ``veomni/models/auto.py`` with the public classmethod — the silent fallback path in ``build_foundation_model`` is the same scenario the helper was designed for. Audited the rest of the NPU CI matrix in ``.github/workflows/{npu_e2e_test, npu_unit_tests}.yml``: every other test that touches ops either goes through ``build_torchrun_cmd`` + ``resolve_ops_overrides`` (e2e), ``build_foundation_model`` without an explicit ops config (now safe via ``all_eager`` fallback), or already uses ``apply_ops_config`` with a mode-specific config. No further changes needed. Ting: fix ci Ting: fix npu ci

TimYangst · 2026-05-01T20:13:57Z

@gemini-code-assist please help review the pr

gemini-code-assist

Code Review

This pull request transitions the default kernel implementations to GPU-optimal backends like Liger and Triton, while providing an "all_eager" configuration for hardware-agnostic fallbacks. It also refactors NPU compatibility validation and updates model-specific patches for non-standard architectures. Feedback indicates that the "all_eager" method should explicitly include "attn_implementation" to avoid Flash Attention dependencies in safe environments, and that a missing validation check for NPU compatibility tables should be implemented as originally intended.

…NPU allow-list coverage Three review-feedback fixes for #716: - Codex P2: Wan rotary_pos_emb=triton was rejected by the new strict-raise gate in apply_per_model_patches before _custom_wan could run. Move Wan's Triton RoPE into extra_backends["rotary_pos_emb"]["triton"] so it resolves through the registry path. _check_requires gains a "triton" branch with an actionable "install or set to eager" message. - Gemini #2: OpsImplementationConfig.all_eager() left attn_implementation at the dataclass default (flash_attention_2), which crashes on hosts without flash-attn. Add a runtime check — flash-attn importable: keep flash_attention_2 (the common GPU case); not importable: fall back to eager. - Gemini #3: PR description claimed a hard-assert that every registered op has an entry in _NPU_ALLOWED, but the code didn't have it. _validate_implementations now asserts {op.config_field for op in list_ops()} ⊆ _NPU_ALLOWED.keys() so future op additions can't silently bypass NPU validation.

github-actions Bot added the ascend everything about Ascend support label May 1, 2026

gemini-code-assist Bot reviewed May 1, 2026

View reviewed changes

Comment thread veomni/arguments/arguments_types.py Outdated

TimYangst changed the title ~~[BREAKING][ops, model] feat: GPU-optimal ops defaults + strict NPU va…~~ [BREAKING][ops, model] feat: GPU-optimal ops defaults + strict NPU validation May 1, 2026

TimYangst added 2 commits May 1, 2026 20:08

Ting: revisit comment and messages

0b6c901

TimYangst force-pushed the tingyang/op/gpu_default branch from 8fa131b to 0b6c901 Compare May 1, 2026 20:10

gemini-code-assist Bot reviewed May 1, 2026

View reviewed changes

Comment thread veomni/arguments/arguments_types.py

Comment thread veomni/arguments/arguments_types.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BREAKING][ops, model] feat: GPU-optimal ops defaults + strict NPU validation#716

[BREAKING][ops, model] feat: GPU-optimal ops defaults + strict NPU validation#716
TimYangst wants to merge 4 commits intomainfrom
tingyang/op/gpu_default

TimYangst commented May 1, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

TimYangst commented May 1, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Field	Old	New
`attn_implementation`	`flash_attention_2`	unchanged
`moe_implementation`	`eager`	`fused_triton`
`cross_entropy_loss_implementation`	`chunk_loss`	`liger_kernel`
`rms_norm_implementation`	`eager`	`liger_kernel`
`swiglu_mlp_implementation`	`eager`	`liger_kernel`
`rotary_pos_emb_implementation`	`eager`	`liger_kernel`
`load_balancing_loss_implementation`	`eager`	`triton`

Conversation

TimYangst commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

API and Usage Example

Design & Code Changes

Test

Checklist Before Submitting

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

TimYangst commented May 1, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TimYangst commented May 1, 2026 •

edited

Loading