[model] feat: [transformers-v5] Introduce new registration based kernel replacement.#569
[model] feat: [transformers-v5] Introduce new registration based kernel replacement.#569piyifan123 wants to merge 17 commits intomainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-designed refactoring to a unified, registration-based kernel selection system, moving away from ad-hoc patching. The new KernelRegistry and OpSlot mechanism provides a much cleaner, more extensible, and safer way to manage custom kernels, with a clear migration path for existing models. The design documentation is thorough and the implementation largely follows it. However, I've identified two critical issues: one is a discrepancy between the design and implementation of the configuration for MoE experts, which impacts the user-facing API, and the other is a missing check for duplicate registrations in the kernel registry, which could lead to silent errors. Addressing these will make the new system more robust and consistent.
| @property | ||
| def moe_experts_implementation(self) -> str: | ||
| """Resolve moe_implementation to a kernel registry name for the ``moe_experts`` OpSlot.""" | ||
| raw = self.moe_implementation | ||
| if raw is None: | ||
| return "eager" | ||
| mapped = self._MOE_IMPL_TO_KERNEL.get(raw) | ||
| if mapped is None: | ||
| raise ValueError(f"Unknown moe_implementation='{raw}'. Valid: {list(self._MOE_IMPL_TO_KERNEL.keys())}") | ||
| return mapped |
There was a problem hiding this comment.
The implementation of moe_experts_implementation as a read-only property contradicts the design goal of a unified configuration surface. The design document docs/design/unified_kernel_registry.md specifies moe_experts_implementation as a new, user-configurable field. However, with the current implementation, users cannot set this field in their configuration (e.g., YAML files) and must continue to use the old moe_implementation field with its legacy values (fused, fused_quack). This undermines the goal of having a consistent and direct way to select kernel implementations by their new names.
To align with the design, moe_experts_implementation should be a regular field. Backward compatibility for the old moe_implementation field should be handled, for instance, in the __post_init__ method by mapping the old value to the new field and issuing a deprecation warning.
| _MOE_IMPL_TO_KERNEL = { | ||
| "eager": "eager", | ||
| "fused": "triton_group_gemm", | ||
| "fused_quack": "quack_cutlass", |
There was a problem hiding this comment.
I think just quack is OK
| @@ -0,0 +1,223 @@ | |||
| # Copyright 2025 Bytedance Ltd. and/or its affiliates | |||
There was a problem hiding this comment.
Can we split the default and registry file into the ops dirs? I think it's easier to manage.
| "'eager' for standard PyTorch, 'liger' for LigerKernel fused RMSNorm." | ||
| }, | ||
| ) | ||
| swiglu_mlp_implementation: str = field( |
There was a problem hiding this comment.
How to control some liger ops valid in GPU bu not in NPU.
There was a problem hiding this comment.
I think we can register them separately as
KERNEL_REGISTRY.register(
KernelSpec(
name="liger",
op_name="rms_norm",
variant="qwen3_5",
factory=_liger_rms_norm_qwen3_5_factory,
hardware=HardwareRequirement(device_type="cuda"),
description="LigerKernel fused RMSNorm for Qwen3.5 (1+weight, zeros init, gemma casting)",
)
)
KERNEL_REGISTRY.register(
KernelSpec(
name="liger_npu",
op_name="rms_norm",
variant="qwen3_5",
factory=_liger_rms_norm_qwen3_5_npu_factory,
hardware=HardwareRequirement(device_type="npu"),
description="LigerKernel fused RMSNorm for Qwen3.5 (1+weight, zeros init, gemma casting) for NPU",
)
)
WDYT? Users will not have a footgun as well since if they select rms_norm_impl: liger in NPU env, they will get an error and force them to select rms_norm_impl: liger_npu (and vice-versa). WDYT?
| return get_model_config(config_path, trust_remote_code=trust_remote_code, **config_kwargs) | ||
|
|
||
|
|
||
| def _apply_legacy_moe_patch(config, moe_implementation): |
There was a problem hiding this comment.
Why still hold legacy patch
# Conflicts: # veomni/ops/__init__.py
Resolve conflicts in qwen3_5_moe patch gen config by combining OpSlot-based dispatch (branch) with new imports, model init, and dummy vars from main (#602). Regenerated generated files from merged config. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Excuse me. I've seen in could you help me find out what is going on? |
|
|
||
| @dataclass(frozen=True) | ||
| class KernelSpec: | ||
| """Describes a single kernel implementation.""" |
There was a problem hiding this comment.
need more explanation on each field
Issues and SuggestionsBug: MoE OpSlot binding path order issueIn auto.py:192-200, _bind_veomni_ops is called after the model is constructed. But the legacy path _apply_legacy_moe_patch sets config._moe_implementation before model init (it patches the class-level forward). The new OpSlot path binds after init, which is correct for OpSlot's guard pattern, but if (
ops_implementation is not None
and modeling_module is not None
and _bind_veomni_ops(modeling_module, ops_implementation)
):
...
elif moe_implementation is not None:
_apply_legacy_moe_patch(config, moe_implementation) If ops_implementation is provided but the module has zero OpSlots (e.g., a model not yet migrated), _bind_veomni_ops returns False, and the legacy path runs — good. But if a module has some OpSlots but not MoE (unlikely currently but architecturally possible), the legacy MoE path would be skipped. Bug: moe_experts_implementation property masking moe_implementationIn arguments_types.py, moe_experts_implementation is a @Property that delegates to moe_implementation. However, _resolve_impl_name does getattr(ops_config, f"{op_name}_implementation", "eager") — for op_name="moe_experts", this will correctly resolve to the property. But the property returns Potential issue: Module-global OpSlot shared across model instancesAs noted in the design doc's "Open Questions §4" — OpSlot objects are module-level globals. Two instances of the same model class with different ops_implementation configs would share the same slots. The second call to _bind_veomni_ops would overwrite the first's bindings. This is a real risk in ForCausalLMLoss changes have subtle semanticsIn fused_cross_entropy/init.py, the removal of else: loss_func = _cross_entropy on line 86 means that when hidden_states is provided but cross_entropy_fn is None, loss_func defaults to _cross_entropy (set at module level, possibly eager if apply_veomni_loss_patch was never called). Previously, Ruff F821 suppression is broadAdding F821 (undefined name) to all *_patch_gen_config.py files suppresses a useful lint for real bugs. Consider whether # noqa: F821 on specific lines in the generated code would be more targeted. Test coveragetest_models_patch.py adds _verify_opslot_state which validates that has_kernel matches use_liger_kernel. This is good but only tests the boolean state, not that the bound kernel actually produces correct outputs. Consider adding a numerical test (e.g., small random input through RMSNorm with Nit: _bound field on OpSlot is redundantOpSlot tracks both _kernel and _bound. The _bound flag distinguishes "never called bind()" from "called bind('eager')" — both have _kernel = None. This is only used in repr. Acceptable but slightly over-engineered for a repr; a simpler approach would be a sentinel _UNBOUND = object(). SummaryThis is a well-architected PR with thorough design documentation. The core OpSlot + KernelRegistry pattern is clean, extensible, and minimal in its diff from upstream HF. The main concerns are:
|
|
|
||
| --- | ||
|
|
||
| ## Open Questions |
There was a problem hiding this comment.
Great PR! The shift from monkey-patching to the OpSlot-based registry and dependency injection is a massive improvement for readability, maintainability, and upstream compatibility.
Just a few thoughts, bringing up one potential edge case and adding my two cents to your Open Questions:
-
Potential torch.compile Graph Breaks
Since OpSlot instances are module-level globals and .has_kernel is a dynamic property evaluated during forward, I'm slightly concerned this might introduce graph breaks intorch.compile(PyTorch 2.x Dynamo) . We might want to verify its compilation behavior or see if we need torch.compiler.assume_constant_result for those checks. -
Regarding Q1: Preset Silent Fallback
I highly recommend warning or explicitly failing rather than silently skipping. In large-scale training, if a user thinks they enabled liger to save memory but it silently falls back to eager due to a variant mismatch, the resulting OOM will be incredibly frustrating to debug. -
Regarding Q4: Multi-model processes sharing OpSlots
You are right that module-level slots mean two instances of the same model cannot have different kernel configs in the same process. While this might be a rare edge case now, it could block future workflows like Teacher-Student distillation or reference-model setups (where one uses fused and the other uses eager). Not a blocker for this PR, but definitely a tech debt to keep an eye on.
Overall, the architecture looks really solid!
245f68b to
7a6ba14
Compare
… PR #569) Squash of commits from piyifan/kernel-patch-exp bringing in the kernel selection and patching framework that this branch builds on. Original piyifan commits preserved on origin/piyifan/kernel-patch-exp.
… PR #569) Squash of commits from piyifan/kernel-patch-exp bringing in the kernel selection and patching framework that this branch builds on. Original piyifan commits preserved on origin/piyifan/kernel-patch-exp.
Modeling code
Taking qwen3 as an example, the key part in generated modeling code is
Kernel registry
Created a kernel registry to allow people to register new kernels and specify like following
Check kernel_defaults.py for details.