Skip to content

[BREAKING][ops, model] feat: GPU-optimal ops defaults + strict NPU validation#716

Draft
TimYangst wants to merge 4 commits intomainfrom
tingyang/op/gpu_default
Draft

[BREAKING][ops, model] feat: GPU-optimal ops defaults + strict NPU validation#716
TimYangst wants to merge 4 commits intomainfrom
tingyang/op/gpu_default

Conversation

@TimYangst
Copy link
Copy Markdown
Collaborator

@TimYangst TimYangst commented May 1, 2026

What does this PR do?

Flip OpsImplementationConfig per-op defaults from "eager" to GPU-optimal
(Liger / Triton / fused_triton), and tighten the validator + per-model
dispatch so misconfigurations raise loudly instead of silently downgrading.

Default changes (GPU):

Field Old New
attn_implementation flash_attention_2 unchanged
moe_implementation eager fused_triton
cross_entropy_loss_implementation chunk_loss liger_kernel
rms_norm_implementation eager liger_kernel
swiglu_mlp_implementation eager liger_kernel
rotary_pos_emb_implementation eager liger_kernel
load_balancing_loss_implementation eager triton

NPU users must now pin every per-op field explicitly — to an
NPU-supported value (npu / chunk_loss / fused_npu / triton) or to
eager when the op has no NPU backend. The validator raises with a
per-op allow-list message at config-parse time.

Builds on #678 (registration-based kernel framework) and #708
(chunk_loss reshape fix).

API and Usage Example

NPU: pin every per-op field. Minimal NPU-friendly config:

model:
  ops_implementation:
    attn_implementation: flash_attention_2
    moe_implementation: fused_npu
    cross_entropy_loss_implementation: chunk_loss
    rms_norm_implementation: npu
    rotary_pos_emb_implementation: npu
    swiglu_mlp_implementation: eager           # no NPU backend
    load_balancing_loss_implementation: eager  # triton-ascend not exposed as `triton`

GPU: existing YAMLs that don't override the per-op fields now pick up
the GPU-optimal kernels automatically. The gpu extra ships
liger-kernel + triton, so no setup change is needed.

Models with structurally-incompatible kernels (Wan rope_apply,
Qwen2-VL multimodal RoPE NPU) opt out via extra_backends[name] = None.
The validator raises "explicitly disabled for {model}" and points the
user at the allowed values. Wan's three YAMLs ship pinned to eager.

Design & Code Changes

Defaults & validationveomni/arguments/arguments_types.py,
veomni/ops/kernels/{rms_norm,rotary,swiglu,load_balancing_loss}/__init__.py

  • Flip per-op dataclass defaults; sync each OpSpec.default to match.
  • Rewrite _validate_implementations:
    1. Check NPU compatibility against an explicit _NPU_ALLOWED allow-list.
    2. Gate load_balancing_loss=triton on is_package_available("triton").
    3. Hard-assert at validation time that every registered op
      (list_ops()) has an entry in _NPU_ALLOWED — catches future op
      additions that miss the map.
  • Per-package availability for liger_kernel / torch_npu / triton is
    checked at the resolution sites (_check_requires in
    apply_per_model_patches / apply_global_ops), not duplicated in the
    validator.

Trade-off — hardcoded allow-list: _NPU_ALLOWED is hand-curated rather
than derived from BackendSpec.requires. The name "triton" is reused
with different hardware semantics (load-balancing loss runs on NPU via
triton-ascend, but DeepSeek-V3's batch-invariant RMSNorm /
deterministic RoPE Triton kernels are mainline-CUDA-only), so a
requires-based inference would conflate them. Third-party NPU backends
still extend cleanly via per-model extra_backends in device_patch.py;
the allow-list only gates the hardware-class names exposed on the public
dataclass.

Strict per-model dispatchveomni/ops/config/registry.py

  • apply_per_model_patches raises on every missing backend except
    eager. extra_backends[name] = None is repurposed as an
    explicit-raise opt-out for cases where the global registry default
    would silently bind a wrong-signature kernel; the available-list error
    excludes the disabled entry.

Safe-fallback pathveomni/models/auto.py

  • Replace the silent apply_ops_config(OpsImplementationConfig())
    fallback (when callers omit ops_implementation) with
    OpsImplementationConfig.all_eager() — every per-op field pinned to
    "eager". attn_implementation keeps the dataclass default
    (flash_attention_2) when flash-attn is importable (the common case
    on production GPU hosts), falling back to "eager" when it isn't, so
    CPU / minimal-deps environments don't crash on import-time FA loading.
    Standalone scripts (inference, weight materialization, dummy-forward
    tests) keep working on every accelerator.

Per-model device patches

  • qwen3_vl/device_patch.py — drop historical liger_kernel: None
    opt-outs (verified standard-shape RMSNorm + RoPE; Liger drops in
    cleanly).
  • wan/device_patch.py — keep rotary_pos_emb: liger_kernel: None as
    the explicit-raise marker; move Wan's Triton RoPE from the
    _custom_wan callback into
    extra_backends["rotary_pos_emb"]["triton"]
    so it survives the new
    strict-raise gate (a callback can't run after the gate rejects unknown
    backends).
  • qwen2_vl/device_patch.py — keep rotary_pos_emb: npu: None as the
    explicit-raise marker for the multimodal RoPE NPU mismatch.
  • configs/dit/wan*.yaml (3 files) — pin
    rotary_pos_emb_implementation: eager since Wan's rope_apply can't
    host the new liger_kernel default.

Test infrastructuretests/tools/training_utils.py and downstream

  • New resolve_ops_overrides(model_name) helper emits hardware-aware
    --model.ops_implementation.X=Y flags; npu_skip_marker(model_name)
    skip factory for models with no NPU+eager path.
    _NPU_PER_MODEL_OVERRIDES covers DeepSeek-V3 (no NPU RMSNorm/RoPE)
    and Qwen2-VL / Qwen2.5-VL / Qwen2.5-Omni (multimodal RoPE).
  • model_name threaded through build_torchrun_cmd /
    run_training_config and into tests/e2e/utils.py,
    tests/distributed/test_fsdp_equivalence.py, tests/checkpoints/utils.py,
    tests/data/{test_datasets,test_multisource_dataset,test_dynamic_batching_dataset}.py.
  • tests/models/test_padded_packed_loss.py, tests/ops/test_moe_hw_gate.py
    — pin all-eager kwargs so tests don't couple to the new GPU defaults.

Docsdocs/usage/arguments.md, docs/design/kernel_selection.md

  • Refresh defaults, NPU validation contract, lifecycle "Ownership"
    paragraph (now describes the all_eager() fallback), per-section YAML
    examples; replace VEOMNI_USE_LIGER_KERNEL references in the
    transformers-v5 comparison table with the corresponding
    OpsImplementationConfig field name.

Test

  • make quality clean.
  • pytest tests/ops/test_moe_hw_gate.py — 15/15 pass.
  • GPU host: constructed OpsImplementationConfig() with new defaults —
    validates clean; logs show liger_kernel / fused_triton / triton
    bound for every op.
  • Mocked is_torch_npu_available()=True: every GPU-only choice
    (Liger rms_norm/rotary/swiglu, Liger CE, Triton RMSNorm via
    DeepSeek-V3, fused_triton MoE) raises with the listed alternatives.
    All-NPU config (fused_npu + chunk_loss + npu + eager +
    triton) passes.
  • Mocked GPU runtime: fused_npu / npu rms_norm raise with
    platform-mismatch errors.
  • Per-model strict-raise: simulated apply_per_model_patches for
    Wan-shape extra_backends confirms rotary_pos_emb=liger_kernel
    (default) raises with "explicitly disabled for Wan" +
    available=['eager','npu','triton']; explicit rotary_pos_emb=eager
    succeeds.
  • Verified Qwen3-VL Qwen3VLTextRMSNorm + apply_rotary_pos_emb are
    HF-standard on transformers v4.57.3 and v5 — the liger_kernel lift
    is safe.
  • Pre-existing tests/ops/test_seqcls_loss.py failure (Triton driver
    missing on this CPU dev box) reproduces on main — unrelated.

Pending CI:

  • gpu_e2e_test.yml, gpu_unit_tests.yml — should pass with the gpu
    extra (ships liger-kernel + flash-attn).
  • npu_e2e_test.yml, npu_unit_tests.ymlresolve_ops_overrides
    threaded through every torchrun in tests/{e2e,distributed,checkpoints,data}
    emits NPU-supported overrides per model.

Checklist Before Submitting

  • Read the Contribute Guide
  • Applied pre-commit checks
  • Added/updated documentation
  • Added tests to CI workflow (or explained why not feasible) —
    resolve_ops_overrides covers every model NPU CI exercises;
    test_moe_hw_gate.py pins all-eager. No new standalone test file: the
    change is config-default + validator behavior, exercised end-to-end by
    every tests/e2e/test_e2e_parallel.py parametrization.

…lidation

Flip OpsImplementationConfig per-op defaults from "eager" to GPU-optimal
backends (liger_kernel for cross_entropy / rms_norm / swiglu_mlp /
rotary_pos_emb; fused_triton for moe; triton for load_balancing_loss).
Attention default unchanged.

The validator now raises explicitly on hardware mismatches:
- On NPU, GPU-only backends (liger_kernel, fused_triton, fused_quack,
  triton-for-rms_norm/rotary) raise at OpsImplementationConfig.__post_init__
  with a model-agnostic "set to one of [npu / chunk_loss / fused_npu /
  eager]" message. Allow-list lives in _NPU_COMPATIBLE_BACKENDS_PER_OP and
  is consistency-checked against the registry at validation time.
- On GPU, fused_npu / npu raise with "requires Ascend NPU" messages.
- load_balancing_loss=triton without the triton package raises with a
  clear "install triton or set to eager" message.

Per-model dispatch (apply_per_model_patches) is also strict: the only
no-backend outcome that does not raise is "eager". extra_backends[name]=None
is repurposed as an *explicit-raise opt-out* — used by Wan (rope_apply has
non-standard signature) and Qwen2-VL (multimodal RoPE) to prevent the
global registry default from silently binding a wrong-signature kernel.
The error message tells the user exactly what to pin in YAML.

Lifted Qwen3-VL's historical liger_kernel opt-outs (verified
Qwen3VLTextRMSNorm + apply_rotary_pos_emb are HF-standard on transformers
v4.57 and v5; Liger drops in cleanly).

build_foundation_model's silent-fallback path (when callers omit
ops_implementation) now installs an all-eager safe config via
_build_safe_fallback_ops_config so standalone scripts (inference, weight
materialization, dummy-forward tests) keep working everywhere without
requiring liger / triton.

CI test infra: tests/tools/training_utils.py gains resolve_ops_overrides()
and npu_skip_marker() that emit hardware-aware --model.ops_implementation.X=Y
flags per-model (NPU-supported backend or eager fallback for ops without
NPU kernel; per-model overrides for DeepSeek-V3 RMSNorm/RoPE and the
Qwen-VL family multimodal RoPE). model_name threaded through
build_torchrun_cmd / run_training_config / prepare_exec_cmd.

Wan YAMLs (wan_sft, wan_lora, wan2.1_I2V_1.3B_lora) pin
rotary_pos_emb_implementation: eager explicitly since Wan's rope_apply
cannot host the new liger_kernel default.

Breaking changes:
- Existing YAMLs that relied on per-op defaults being "eager" must either
  install liger-kernel (gpu extra ships it) or set the field to "eager"
  explicitly. NPU users must explicitly choose npu / chunk_loss / eager
  for every per-op field.
- VEOMNI_USE_LIGER_KERNEL env var was already removed in #678; the
  comparison-table references in docs/design/kernel_selection.md are
  updated to point at the OpsImplementationConfig fields.

Docs: docs/usage/arguments.md and docs/design/kernel_selection.md
updated with the new defaults, NPU validation contract, and
silent-fallback path replacement.
@github-actions github-actions Bot added the ascend everything about Ascend support label May 1, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the default operator implementations to be GPU-optimal (switching from 'eager' to 'liger_kernel' or 'triton') and introduces a strict validation layer to ensure selected backends are compatible with the host hardware, particularly for Ascend NPU. It also implements a conservative all-eager fallback for standalone scripts to maintain portability and updates the test suite with hardware-aware override logic. Feedback was provided regarding the hardcoded NPU allow-list, which currently restricts third-party backend registration without framework modifications, contradicting stated design goals.

Comment thread veomni/arguments/arguments_types.py Outdated
@TimYangst TimYangst changed the title [BREAKING][ops, model] feat: GPU-optimal ops defaults + strict NPU va… [BREAKING][ops, model] feat: GPU-optimal ops defaults + strict NPU validation May 1, 2026
TimYangst added 2 commits May 1, 2026 20:08
Cut tests/tools/training_utils.py from 138 lines back down to ~60 in the
ops-overrides region:

- Drop _GPU_OPS_DEFAULTS and the GPU branch in resolve_ops_overrides.
  The new dataclass defaults (flash_attention_2 + fused_triton + Liger /
  Triton per-op) already match what we were emitting, so resolve_ops_overrides
  on GPU is now a no-op returning [].
- Drop _NPU_SKIP_MODELS + npu_skip_marker. No model uses the skip path
  today; reintroduce when an actually-incompatible model lands.
- Compress the comment blocks. The remaining text is just enough for a
  reader to understand the per-model NPU eager pinning without the
  paragraph-long context.

No behavior change on either accelerator. NPU CI still gets the full
hardware-aware override set per model (verified with a mocked NPU runtime).

[ops, model] refactor: simplify OpsImplementationConfig validator

The validator was carrying ~150 lines of duplicated work — package
availability, per-op registry walks, three special-case sub-blocks for
CE / MoE / lb_loss / liger / npu / triton. Most of that already runs at
kernel-resolution time (apply_per_model_patches, install_loss_mapping,
apply_veomni_fused_moe_patch, KERNEL_REGISTRY.resolve), so the validator
fired alongside the resolution-site error rather than instead of it.

Cut to a single hardware-mismatch table:

    _NPU_ALLOWED:  per-field allow-list of non-eager values that work on NPU
    _NPU_REQUIRED: per-field set of values that require NPU and fail on GPU

Validator becomes a ~25-line loop. Eager is implicit (always allowed).
Anything not in the allow-list raises with the field name and the
allowed alternatives. The one pragmatic exception is the triton package
check for load_balancing_loss=triton — kept because the alternative
(letting ImportError surface from apply_global_ops) is noisy and
unhelpful.

Dropped:
- Per-op walk over _OPS_REGISTRY for liger / torch_npu package checks.
  These already raise with clear messages from _check_requires inside
  apply_per_model_patches and apply_global_ops; no need for the validator
  to duplicate them.
- Hard registry-consistency assertion. A future op registered without
  updating the allow-list will surface at kernel-bind time with the
  KERNEL_REGISTRY's "Available: [...]" error — acceptable for the
  hypothetical case.
- Verbose docstrings and per-field help texts. Each field now describes
  its values in 2-3 lines instead of 8-12.

Net: 309 lines → 73 lines for the OpsImplementationConfig section.
Behavior unchanged on every scenario tested (default GPU, fused_npu on
GPU, CE=npu on GPU, default on NPU, liger rms_norm on NPU, triton
rms_norm on NPU, valid all-NPU config — all match prior error
classifications).

[ops] refactor: extract apply_per_model_patches helpers

The apply_per_model_patches body had grown a 50-line "no-backend" branch
with two overlapping comment blocks explaining the eager / explicit-opt-out
/ unknown-value cases. The control flow was hard to read because the
backend-resolution, error-classification, and patching logic were all
inlined together.

Extract three focused helpers:

- _resolve_backend(op, value, op_overrides)   -> BackendSpec | None
- _raise_no_backend(model_name, op, value, op_overrides)
- _patch_target(hf_module, target_attr, backend)

The main loop now reads top-to-bottom: get op → resolve backend → handle
no-backend (eager/raise) → check requires → patch → log. The error
classification (explicit-disabled vs unknown-name) lives in
_raise_no_backend with a single ternary, so the rationale doesn't need
a 20-line comment to follow.

Compress the extra_backends docstring from 13 lines to 5; the example
models stay (Wan rope_apply, Qwen2-VL multimodal RoPE) so future readers
know which historical case the opt-out mechanism was designed for.

Behavior unchanged on all three branches:
- explicit opt-out (Wan rotary=liger_kernel)  → "explicitly disabled" raise
- unknown value (rotary=banana)               → "not a registered backend" raise
- eager                                        → continue silently

[ops, ci] fix: NPU test_models_patch failure + OpsImplementationConfig.all_eager helper

Reported NPU CI failure:

  tests/models/test_models_patch.py::test_models_patch_fwd_bwd[llama3.1]
  ValueError: rms_norm_implementation='liger_kernel' is not supported on
  Ascend NPU. Set to one of ['eager', 'npu']; ...

Root cause: the test constructs ``ModelArguments(config_path=...)`` directly
without passing ``ops_implementation``. The default ``OpsImplementationConfig()``
now carries the GPU-optimal liger_kernel / fused_triton / triton defaults,
so the validator fires at NPU config-parse time before the test's
mode-specific ``apply_ops_config`` (driven by ``set_environ_param``) ever
runs. The default ops_implementation on this ModelArguments is never
consumed at training time.

Fix:
- Add ``OpsImplementationConfig.all_eager(**overrides)`` classmethod —
  one-liner for tests / standalone scripts that need a hardware-agnostic
  config (every per-op = "eager"). ``**overrides`` lets a caller flip
  individual fields without enumerating the rest.
- Pin ``test_models_patch.py:389`` to ``OpsImplementationConfig.all_eager()``.
- Refactor the existing eager pins in ``test_padded_packed_loss.py`` and
  ``test_moe_hw_gate.py`` to use the classmethod (less inline noise).
- Replace the private ``_build_safe_fallback_ops_config()`` in
  ``veomni/models/auto.py`` with the public classmethod — the silent
  fallback path in ``build_foundation_model`` is the same scenario the
  helper was designed for.

Audited the rest of the NPU CI matrix in ``.github/workflows/{npu_e2e_test,
npu_unit_tests}.yml``: every other test that touches ops either goes
through ``build_torchrun_cmd`` + ``resolve_ops_overrides`` (e2e),
``build_foundation_model`` without an explicit ops config (now safe via
``all_eager`` fallback), or already uses ``apply_ops_config`` with a
mode-specific config. No further changes needed.

Ting: fix  ci

Ting: fix npu ci
@TimYangst TimYangst force-pushed the tingyang/op/gpu_default branch from 8fa131b to 0b6c901 Compare May 1, 2026 20:10
@TimYangst
Copy link
Copy Markdown
Collaborator Author

@gemini-code-assist please help review the pr

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request transitions the default kernel implementations to GPU-optimal backends like Liger and Triton, while providing an "all_eager" configuration for hardware-agnostic fallbacks. It also refactors NPU compatibility validation and updates model-specific patches for non-standard architectures. Feedback indicates that the "all_eager" method should explicitly include "attn_implementation" to avoid Flash Attention dependencies in safe environments, and that a missing validation check for NPU compatibility tables should be implemented as originally intended.

Comment thread veomni/arguments/arguments_types.py
Comment thread veomni/arguments/arguments_types.py
…NPU allow-list coverage

Three review-feedback fixes for #716:

- Codex P2: Wan rotary_pos_emb=triton was rejected by the new strict-raise
  gate in apply_per_model_patches before _custom_wan could run. Move Wan's
  Triton RoPE into extra_backends["rotary_pos_emb"]["triton"] so it
  resolves through the registry path. _check_requires gains a "triton"
  branch with an actionable "install or set to eager" message.

- Gemini #2: OpsImplementationConfig.all_eager() left attn_implementation
  at the dataclass default (flash_attention_2), which crashes on hosts
  without flash-attn. Add a runtime check — flash-attn importable: keep
  flash_attention_2 (the common GPU case); not importable: fall back to
  eager.

- Gemini #3: PR description claimed a hard-assert that every registered
  op has an entry in _NPU_ALLOWED, but the code didn't have it.
  _validate_implementations now asserts {op.config_field for op in
  list_ops()} ⊆ _NPU_ALLOWED.keys() so future op additions can't
  silently bypass NPU validation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ascend everything about Ascend support

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant