[BREAKING][ops, model] feat: GPU-optimal ops defaults + strict NPU validation#716
[BREAKING][ops, model] feat: GPU-optimal ops defaults + strict NPU validation#716
Conversation
…lidation Flip OpsImplementationConfig per-op defaults from "eager" to GPU-optimal backends (liger_kernel for cross_entropy / rms_norm / swiglu_mlp / rotary_pos_emb; fused_triton for moe; triton for load_balancing_loss). Attention default unchanged. The validator now raises explicitly on hardware mismatches: - On NPU, GPU-only backends (liger_kernel, fused_triton, fused_quack, triton-for-rms_norm/rotary) raise at OpsImplementationConfig.__post_init__ with a model-agnostic "set to one of [npu / chunk_loss / fused_npu / eager]" message. Allow-list lives in _NPU_COMPATIBLE_BACKENDS_PER_OP and is consistency-checked against the registry at validation time. - On GPU, fused_npu / npu raise with "requires Ascend NPU" messages. - load_balancing_loss=triton without the triton package raises with a clear "install triton or set to eager" message. Per-model dispatch (apply_per_model_patches) is also strict: the only no-backend outcome that does not raise is "eager". extra_backends[name]=None is repurposed as an *explicit-raise opt-out* — used by Wan (rope_apply has non-standard signature) and Qwen2-VL (multimodal RoPE) to prevent the global registry default from silently binding a wrong-signature kernel. The error message tells the user exactly what to pin in YAML. Lifted Qwen3-VL's historical liger_kernel opt-outs (verified Qwen3VLTextRMSNorm + apply_rotary_pos_emb are HF-standard on transformers v4.57 and v5; Liger drops in cleanly). build_foundation_model's silent-fallback path (when callers omit ops_implementation) now installs an all-eager safe config via _build_safe_fallback_ops_config so standalone scripts (inference, weight materialization, dummy-forward tests) keep working everywhere without requiring liger / triton. CI test infra: tests/tools/training_utils.py gains resolve_ops_overrides() and npu_skip_marker() that emit hardware-aware --model.ops_implementation.X=Y flags per-model (NPU-supported backend or eager fallback for ops without NPU kernel; per-model overrides for DeepSeek-V3 RMSNorm/RoPE and the Qwen-VL family multimodal RoPE). model_name threaded through build_torchrun_cmd / run_training_config / prepare_exec_cmd. Wan YAMLs (wan_sft, wan_lora, wan2.1_I2V_1.3B_lora) pin rotary_pos_emb_implementation: eager explicitly since Wan's rope_apply cannot host the new liger_kernel default. Breaking changes: - Existing YAMLs that relied on per-op defaults being "eager" must either install liger-kernel (gpu extra ships it) or set the field to "eager" explicitly. NPU users must explicitly choose npu / chunk_loss / eager for every per-op field. - VEOMNI_USE_LIGER_KERNEL env var was already removed in #678; the comparison-table references in docs/design/kernel_selection.md are updated to point at the OpsImplementationConfig fields. Docs: docs/usage/arguments.md and docs/design/kernel_selection.md updated with the new defaults, NPU validation contract, and silent-fallback path replacement.
There was a problem hiding this comment.
Code Review
This pull request updates the default operator implementations to be GPU-optimal (switching from 'eager' to 'liger_kernel' or 'triton') and introduces a strict validation layer to ensure selected backends are compatible with the host hardware, particularly for Ascend NPU. It also implements a conservative all-eager fallback for standalone scripts to maintain portability and updates the test suite with hardware-aware override logic. Feedback was provided regarding the hardcoded NPU allow-list, which currently restricts third-party backend registration without framework modifications, contradicting stated design goals.
Cut tests/tools/training_utils.py from 138 lines back down to ~60 in the
ops-overrides region:
- Drop _GPU_OPS_DEFAULTS and the GPU branch in resolve_ops_overrides.
The new dataclass defaults (flash_attention_2 + fused_triton + Liger /
Triton per-op) already match what we were emitting, so resolve_ops_overrides
on GPU is now a no-op returning [].
- Drop _NPU_SKIP_MODELS + npu_skip_marker. No model uses the skip path
today; reintroduce when an actually-incompatible model lands.
- Compress the comment blocks. The remaining text is just enough for a
reader to understand the per-model NPU eager pinning without the
paragraph-long context.
No behavior change on either accelerator. NPU CI still gets the full
hardware-aware override set per model (verified with a mocked NPU runtime).
[ops, model] refactor: simplify OpsImplementationConfig validator
The validator was carrying ~150 lines of duplicated work — package
availability, per-op registry walks, three special-case sub-blocks for
CE / MoE / lb_loss / liger / npu / triton. Most of that already runs at
kernel-resolution time (apply_per_model_patches, install_loss_mapping,
apply_veomni_fused_moe_patch, KERNEL_REGISTRY.resolve), so the validator
fired alongside the resolution-site error rather than instead of it.
Cut to a single hardware-mismatch table:
_NPU_ALLOWED: per-field allow-list of non-eager values that work on NPU
_NPU_REQUIRED: per-field set of values that require NPU and fail on GPU
Validator becomes a ~25-line loop. Eager is implicit (always allowed).
Anything not in the allow-list raises with the field name and the
allowed alternatives. The one pragmatic exception is the triton package
check for load_balancing_loss=triton — kept because the alternative
(letting ImportError surface from apply_global_ops) is noisy and
unhelpful.
Dropped:
- Per-op walk over _OPS_REGISTRY for liger / torch_npu package checks.
These already raise with clear messages from _check_requires inside
apply_per_model_patches and apply_global_ops; no need for the validator
to duplicate them.
- Hard registry-consistency assertion. A future op registered without
updating the allow-list will surface at kernel-bind time with the
KERNEL_REGISTRY's "Available: [...]" error — acceptable for the
hypothetical case.
- Verbose docstrings and per-field help texts. Each field now describes
its values in 2-3 lines instead of 8-12.
Net: 309 lines → 73 lines for the OpsImplementationConfig section.
Behavior unchanged on every scenario tested (default GPU, fused_npu on
GPU, CE=npu on GPU, default on NPU, liger rms_norm on NPU, triton
rms_norm on NPU, valid all-NPU config — all match prior error
classifications).
[ops] refactor: extract apply_per_model_patches helpers
The apply_per_model_patches body had grown a 50-line "no-backend" branch
with two overlapping comment blocks explaining the eager / explicit-opt-out
/ unknown-value cases. The control flow was hard to read because the
backend-resolution, error-classification, and patching logic were all
inlined together.
Extract three focused helpers:
- _resolve_backend(op, value, op_overrides) -> BackendSpec | None
- _raise_no_backend(model_name, op, value, op_overrides)
- _patch_target(hf_module, target_attr, backend)
The main loop now reads top-to-bottom: get op → resolve backend → handle
no-backend (eager/raise) → check requires → patch → log. The error
classification (explicit-disabled vs unknown-name) lives in
_raise_no_backend with a single ternary, so the rationale doesn't need
a 20-line comment to follow.
Compress the extra_backends docstring from 13 lines to 5; the example
models stay (Wan rope_apply, Qwen2-VL multimodal RoPE) so future readers
know which historical case the opt-out mechanism was designed for.
Behavior unchanged on all three branches:
- explicit opt-out (Wan rotary=liger_kernel) → "explicitly disabled" raise
- unknown value (rotary=banana) → "not a registered backend" raise
- eager → continue silently
[ops, ci] fix: NPU test_models_patch failure + OpsImplementationConfig.all_eager helper
Reported NPU CI failure:
tests/models/test_models_patch.py::test_models_patch_fwd_bwd[llama3.1]
ValueError: rms_norm_implementation='liger_kernel' is not supported on
Ascend NPU. Set to one of ['eager', 'npu']; ...
Root cause: the test constructs ``ModelArguments(config_path=...)`` directly
without passing ``ops_implementation``. The default ``OpsImplementationConfig()``
now carries the GPU-optimal liger_kernel / fused_triton / triton defaults,
so the validator fires at NPU config-parse time before the test's
mode-specific ``apply_ops_config`` (driven by ``set_environ_param``) ever
runs. The default ops_implementation on this ModelArguments is never
consumed at training time.
Fix:
- Add ``OpsImplementationConfig.all_eager(**overrides)`` classmethod —
one-liner for tests / standalone scripts that need a hardware-agnostic
config (every per-op = "eager"). ``**overrides`` lets a caller flip
individual fields without enumerating the rest.
- Pin ``test_models_patch.py:389`` to ``OpsImplementationConfig.all_eager()``.
- Refactor the existing eager pins in ``test_padded_packed_loss.py`` and
``test_moe_hw_gate.py`` to use the classmethod (less inline noise).
- Replace the private ``_build_safe_fallback_ops_config()`` in
``veomni/models/auto.py`` with the public classmethod — the silent
fallback path in ``build_foundation_model`` is the same scenario the
helper was designed for.
Audited the rest of the NPU CI matrix in ``.github/workflows/{npu_e2e_test,
npu_unit_tests}.yml``: every other test that touches ops either goes
through ``build_torchrun_cmd`` + ``resolve_ops_overrides`` (e2e),
``build_foundation_model`` without an explicit ops config (now safe via
``all_eager`` fallback), or already uses ``apply_ops_config`` with a
mode-specific config. No further changes needed.
Ting: fix ci
Ting: fix npu ci
8fa131b to
0b6c901
Compare
|
@gemini-code-assist please help review the pr |
There was a problem hiding this comment.
Code Review
This pull request transitions the default kernel implementations to GPU-optimal backends like Liger and Triton, while providing an "all_eager" configuration for hardware-agnostic fallbacks. It also refactors NPU compatibility validation and updates model-specific patches for non-standard architectures. Feedback indicates that the "all_eager" method should explicitly include "attn_implementation" to avoid Flash Attention dependencies in safe environments, and that a missing validation check for NPU compatibility tables should be implemented as originally intended.
…NPU allow-list coverage Three review-feedback fixes for #716: - Codex P2: Wan rotary_pos_emb=triton was rejected by the new strict-raise gate in apply_per_model_patches before _custom_wan could run. Move Wan's Triton RoPE into extra_backends["rotary_pos_emb"]["triton"] so it resolves through the registry path. _check_requires gains a "triton" branch with an actionable "install or set to eager" message. - Gemini #2: OpsImplementationConfig.all_eager() left attn_implementation at the dataclass default (flash_attention_2), which crashes on hosts without flash-attn. Add a runtime check — flash-attn importable: keep flash_attention_2 (the common GPU case); not importable: fall back to eager. - Gemini #3: PR description claimed a hard-assert that every registered op has an entry in _NPU_ALLOWED, but the code didn't have it. _validate_implementations now asserts {op.config_field for op in list_ops()} ⊆ _NPU_ALLOWED.keys() so future op additions can't silently bypass NPU validation.
What does this PR do?
API and Usage Example
Design & Code Changes
Test
Checklist Before Submitting
resolve_ops_overridescovers every model NPU CI exercises;test_moe_hw_gate.pypins all-eager. No new standalone test file: thechange is config-default + validator behavior, exercised end-to-end by
every
tests/e2e/test_e2e_parallel.pyparametrization.