feat(xpu): VeOmni engine support for Intel XPU#20
Draft
kahlun wants to merge 3 commits intoxpu/e2e-cleanfrom
Draft
feat(xpu): VeOmni engine support for Intel XPU#20kahlun wants to merge 3 commits intoxpu/e2e-cleanfrom
kahlun wants to merge 3 commits intoxpu/e2e-cleanfrom
Conversation
d3f5166 to
12c2e11
Compare
12c2e11 to
2b67bc1
Compare
- Add 'xpu' to EngineRegistry.register device list for VeOmniEngineWithLMHead - Add GRPO VeOmni XPU e2e test script (tests/special_xpu/run_grpo_veomni_xpu.sh) Depends on upstream VeOmni XPU patches: ByteDance-Seed/VeOmni#695
…dules oneCCL (xccl) doesn't support ReduceOp.AVG in reduce_scatter. The FSDP engine already calls set_force_sum_reduction_for_comms(True) on the root module, but VeOmni wraps each layer with fully_shard independently, so the flag must be set on ALL FSDPModule submodules.
… on XPU - Changed hasattr soft check to explicit guard with RuntimeError - Prevents silent gradient corruption from wrong ReduceOp.AVG on oneCCL - Users now see clear error mentioning PyTorch 2.5+ requirement instead of wrong gradients - Pattern matches fsdp_utils.py hard-fail approach This is critical for correctness: a missing ReduceOp fix silent-fails to wrong gradients, worse than a hard crash.
2b67bc1 to
7b47958
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Register VeOmni training engine for XPU device in veRL.
Changes
xputoEngineRegistry.registerdevice list forVeOmniEngineWithLMHeadDependencies
Test