[ops] refactor: replace kernel env vars with args by FoolPlayer · Pull Request #651 · ByteDance-Seed/VeOmni

FoolPlayer · 2026-04-14T13:51:01Z

What does this PR do?

Replace the all-or-nothing VEOMNI_USE_LIGER_KERNEL / USE_GROUP_GEMM env vars with per-op fields on OpsImplementationConfig, and reorganize veomni/ops/ around a unified kernel registry so adding a new kernel/backend is a one-file change.

This started as the env-var → args refactor and grew into a full veomni/ops/ reorganization once the patch surface was visible across every model.

Motivation

Env-var control was binary and global — you couldn't, e.g., turn on Liger RMSNorm but keep eager cross-entropy.
Each model's gpu_patch.py / npu_patch.py had duplicated if ligerkernel: setattr(...) blocks that were drifting out of sync.
New ops/backends (DeepSeek V3 deterministic RoPE, Wan Triton RMSNorm, NPU chunked loss) had no clean home.

Key changes

1. Per-op config fields (veomni/arguments/arguments_types.py)

Five new str fields on OpsImplementationConfig, all defaulting to "eager" (no implicit "auto" — users opt in):

Field	Backends
`cross_entropy_loss_implementation`	`eager`, `liger_kernel`, `npu` (chunked loss)
`rms_norm_implementation`	`eager`, `liger_kernel`, `npu`, `triton`*
`swiglu_mlp_implementation`	`eager`, `liger_kernel`
`rotary_pos_emb_implementation`	`eager`, `liger_kernel`, `npu`, `triton`*
`load_balancing_loss_implementation`	`eager`, `triton`

* triton is registered per-model via extra_backends (DeepSeek V3, Wan).

__post_init__ validates backend availability (liger_kernel / torch_npu packages) up-front instead of failing with a cryptic error at first batch.

2. Removed env vars

VEOMNI_USE_LIGER_KERNEL → split into 4 per-op fields
USE_GROUP_GEMM → moe_implementation
VEOMNI_ENABLE_CHUNK_LOSS → cross_entropy_loss_implementation="npu"
MODELING_BACKEND is kept (still controls the import-time attention patch).

3. New unified registry (veomni/ops/config/)

registry.py: OpSpec / BackendSpec / OpScope + register_op / apply_global_ops / apply_per_model_patches.
singleton.py: bridges the resolved config from BaseTrainer to each model's device_patch.py.

Three dispatch scopes drive every kernel binding:

import-time — attention + HF LOSS_MAPPING install, in apply_ops_patch().
GLOBAL — module-level function pointer (cross-entropy, load-balancing loss).
PER_MODEL — setattr on the HF modeling module, in each model's device_patch.py.
build-time — fused MoE binding, in build_foundation_model().

4. veomni/ops/ reorg (5 phases)

veomni/ops/
├── config/                 dispatch infra (no kernels)
├── kernels/                one subpackage per op
│   ├── attention/  cross_entropy/  load_balancing_loss/
│   ├── rms_norm/   rotary/         swiglu/   moe/
├── platform/npu/           HCCL pre-mul-sum patch
└── batch_invariant_ops/    deterministic-mode toggle

Old paths (flash_attn/, fused_cross_entropy/, fused_moe/, npu_patch/, dit/rope_wan/, dcp_consolidation.py, …) are gone — no shims.

5. Per-model patches unified

All 9 device_patch.py files (llama, qwen2/3, qwen3_moe, seed_oss, qwen2_vl, qwen3_vl, deepseek_v3, wan) now share one pattern:

apply_per_model_patches(
    hf_module=hf_qwen3,
    model_name="Qwen3",
    targets={
        "rms_norm": "Qwen3RMSNorm",
        "rotary_pos_emb": "apply_rotary_pos_emb",
        "swiglu_mlp": "Qwen3MLP",
    },
)

Per-model overrides (DeepSeek V3 deterministic Triton RoPE, Wan Triton RMSNorm, Qwen-VL vision RoPE) go in extra_backends / custom_patches.

6. Trainer wiring

BaseTrainer, VLMTrainer, DitTrainer call apply_ops_config(model_args.ops_implementation) before building the model. apply_ops_patch() (import-time) also installs VeOmni's LOSS_MAPPING so direct build_foundation_model() calls (unit tests, scripts) get the right loss function.

7. Tests + CI fixes

Tests now build OpsImplementationConfig instead of toggling env vars.
NPU CI: gate liger_kernel and triton modes on package availability (NPU image ships triton-ascend, not mainline triton).
Updated tests/special_sanity/check_device_api_usage.py whitelist for new MoE kernel paths.

8. Docs

New veomni/ops/README.md: layout, dispatch model, all ops/backends/per-model coverage, recipes for adding a new backend / new op.
Updated docs/design/kernel_selection.md, docs/usage/support_new_models/*.md, .agents/knowledge/{architecture,constraints}.md.

API and Usage Example

YAML (replaces all env-var setting):

model:
  ops_implementation:
    attn_implementation: flash_attention_2
    moe_implementation: fused
    cross_entropy_loss_implementation: liger_kernel
    load_balancing_loss_implementation: triton
    rms_norm_implementation: liger_kernel
    rotary_pos_emb_implementation: liger_kernel
    swiglu_mlp_implementation: eager   # mix-and-match per op

NPU users get chunked cross-entropy via cross_entropy_loss_implementation: npu.

Breaking changes

VEOMNI_USE_LIGER_KERNEL, USE_GROUP_GEMM, VEOMNI_ENABLE_CHUNK_LOSS no longer recognized — users must set the corresponding model.ops_implementation.* fields.
Default behavior changes from "use Liger if installed" to explicit eager; users who relied on the env var must opt in.

Related: #569.

Test

Local: full pytest tests/ops/, tests/models/test_models_patch.py (excluding qwen3_5 due to GPU memory in dev env).
CI: GPU + Ascend NPU jobs on the PR (gate_liger_kernel / gate_triton commits make the Ascend job green).

Checklist Before Submitting

Read the Contribute Guide
Applied pre-commit checks
Added/updated documentation
If tasks/ training scripts were moved or renamed: updated docs/ examples and verified python3 scripts/ci/check_doc_task_paths.py passes
Added tests to CI workflow (or explained why not feasible)

gemini-code-assist

Code Review

This pull request transitions kernel selection from environment variables to a structured configuration-driven approach using OpsImplementationConfig. It introduces an 'auto' resolution mechanism that selects the most appropriate kernel (Liger, Triton, or Eager) based on hardware and package availability. A global singleton is implemented to allow model-specific patches to access the resolved configuration. Feedback was provided to improve the robustness of the 'auto' resolution for load-balancing loss by explicitly checking for Triton availability to avoid potential crashes in CPU-only or non-Triton environments.

gemini-code-assist · 2026-04-14T13:56:52Z

+        if self.load_balancing_loss_implementation == "auto":
+            self.load_balancing_loss_implementation = "eager" if npu else "triton"


The auto resolution for load_balancing_loss_implementation assumes that if the hardware is not an NPU, it must be a GPU where Triton is available. This will lead to a crash (import error) on CPU-only environments or systems where the triton package is not installed. It is safer to check for Triton availability explicitly, similar to how Liger is checked.

Suggested change

if self.load_balancing_loss_implementation == "auto":

self.load_balancing_loss_implementation = "eager" if npu else "triton"

if self.load_balancing_loss_implementation == "auto":

from ..utils.import_utils import is_fused_moe_available

self.load_balancing_loss_implementation = "triton" if (not npu and is_fused_moe_available()) else "eager"

Luosuu · 2026-04-14T15:58:10Z

LGTM. please fix CI then we can merge.

Luosuu · 2026-04-14T16:13:58Z

I think we are missing same change for test_padded_packed_loss.py?

Luosuu · 2026-04-14T16:17:34Z

+    )
+    # NOTE: fused MoE patch is applied in build_foundation_model() based on
+    # the moe_implementation parameter.
+    logger.info_rank0("✅ VeOmni ops config applied.")


should we also call format_kernel_functions here?

piyifan123 · 2026-04-14T21:41:01Z

 | --- | --- | --- | --- |
 | attn_implementation | `Optional[Literal["eager", "sdpa", "flash_attention_2", "flash_attention_3", "flash_attention_4", "native-sparse"]]` | `"flash_attention_2"` | Attention implementation to use. |
 | moe_implementation | `Optional[Literal["eager", "fused", "fused_quack"]]` | `None` | MoE implementation: `eager` (reference loop), `fused` (Triton), `fused_quack` (Quack CUTLASS, SM90+). |
+| cross_entropy_loss_implementation | `Literal["auto", "eager", "liger_kernel"]` | `"auto"` | Cross-entropy loss: `liger_kernel` for fused linear CE, `eager` for PyTorch. |


nit 1: what does auto mean?

nit 2: should we make it a literal? if not literal but string, it would be easier for people to register their own kernel? and they will still get a "not found error" when have a kernel name typo.

nit 3: how do we deal with NPU? ask to add liger_kernel_npu option? (i'd prefer this to be more explicit) and we can make auto to adapt to liger_kernel or liger_kernel_npu depending on the env?

… fields

The Ascend CI runner doesn't ship with liger-kernel, so the hard-coded [True, False] parametrization produced a ValueError in OpsImplementationConfig.__post_init__ for every 'veomni + use_liger=True' case. Derive _USE_LIGER_KERNEL from is_liger_kernel_available() so those modes are skipped when the package is missing; GPU coverage is unchanged. Made-with: Cursor

…ility The NPU CI image ships triton-ascend, not mainline triton, so veomni/ops/kernels/load_balancing_loss/triton.py fails with ModuleNotFoundError as soon as the registry resolves the triton backend. Fall back to 'eager' when the mainline triton package is absent, mirroring the liger-kernel gating. Made-with: Cursor

Follow-up to #639. PR #639 fixed the NPU SIGABRT in tests/data/test_multisource_dataset.py by forcing pin_memory=False, but tests/data/test_datasets.py still passes the default pin_memory=True and suffers the same flaky crash (seen on PR #651 CI run 24599839122 and on main run 24379970711): terminate called without an active exception failed (exitcode: -6) local_rank: N -- Signal 6 (SIGABRT) Root cause is identical to #639: the DataLoader pin_memory background thread races with HCCL ProcessGroup teardown inside destroy_distributed, invalidating torch_npu global state and crashing the pinning thread; the still-joinable std::thread then triggers std::terminate() at interpreter shutdown. Unlike test_multisource_dataset.py (which overrides _build_dataloader), test_datasets.py inherits the base dataloader from BaseTrainer, so the simplest fix is to pass --data.dataloader.pin_memory=False via the torchrun CLI in build_command. DummyDataset means pin_memory has no performance benefit on GPU either, so this is behaviorally neutral. Made-with: Cursor

gemini-code-assist Bot reviewed Apr 14, 2026

View reviewed changes

Luosuu approved these changes Apr 14, 2026

View reviewed changes

piyifan123 approved these changes Apr 14, 2026

View reviewed changes

TimYangst approved these changes Apr 14, 2026

View reviewed changes

FoolPlayer added 9 commits April 18, 2026 14:56

refactor: replace kernel env vars with per-op OpsImplementationConfig…

bf5f250

… fields

fix args check

8e7fe4f

remove tag

7bd0362

reorg ops patch

d0a3990

refactor ops

d3374e6

fix npu loss_mapping

4edb70e

fix npu sanity ci

d554039

FoolPlayer force-pushed the refactor/ops-implementation-config branch from e8966cc to 9b0989c Compare April 18, 2026 06:57

code fromat

4d705ae

FoolPlayer mentioned this pull request Apr 18, 2026

[ci] fix: disable pin_memory in test_datasets multisource tests for NPU #677

Merged

4 tasks

Luosuu approved these changes Apr 20, 2026

View reviewed changes

TimYangst approved these changes Apr 20, 2026

View reviewed changes

FoolPlayer merged commit 3282afe into main Apr 20, 2026
26 of 28 checks passed

FoolPlayer deleted the refactor/ops-implementation-config branch April 20, 2026 04:20

TimYangst mentioned this pull request Apr 21, 2026

[BREAKING][ops, model] feat: registration-based kernel replacement framework #678

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ops] refactor: replace kernel env vars with args#651

[ops] refactor: replace kernel env vars with args#651
FoolPlayer merged 10 commits intomainfrom
refactor/ops-implementation-config

FoolPlayer commented Apr 14, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Uh oh!

Luosuu commented Apr 14, 2026

Uh oh!

Luosuu Apr 14, 2026

Uh oh!

Luosuu Apr 14, 2026

Uh oh!

piyifan123 Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		if self.load_balancing_loss_implementation == "auto":
		self.load_balancing_loss_implementation = "eager" if npu else "triton"

Conversation

FoolPlayer commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation

Key changes

API and Usage Example

Breaking changes

Test

Checklist Before Submitting

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Luosuu commented Apr 14, 2026

Uh oh!

Luosuu Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Luosuu Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

piyifan123 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

FoolPlayer commented Apr 14, 2026 •

edited

Loading