Skip to content

feat(xpu): VeOmni engine support for Intel XPU#20

Draft
kahlun wants to merge 3 commits intoxpu/e2e-cleanfrom
xpu/pr-g-veomni-xpu
Draft

feat(xpu): VeOmni engine support for Intel XPU#20
kahlun wants to merge 3 commits intoxpu/e2e-cleanfrom
xpu/pr-g-veomni-xpu

Conversation

@kahlun
Copy link
Copy Markdown
Owner

@kahlun kahlun commented Apr 28, 2026

Summary

Register VeOmni training engine for XPU device in veRL.

Changes

  • Add xpu to EngineRegistry.register device list for VeOmniEngineWithLMHead
  • Add GRPO VeOmni XPU e2e test script

Dependencies

Test

bash tests/special_xpu/run_grpo_veomni_xpu.sh

@kahlun kahlun force-pushed the xpu/pr-g-veomni-xpu branch from d3f5166 to 12c2e11 Compare April 30, 2026 04:07
@kahlun kahlun force-pushed the xpu/pr-g-veomni-xpu branch from 12c2e11 to 2b67bc1 Compare April 30, 2026 08:15
kahlun added 3 commits April 30, 2026 01:17
- Add 'xpu' to EngineRegistry.register device list for VeOmniEngineWithLMHead
- Add GRPO VeOmni XPU e2e test script (tests/special_xpu/run_grpo_veomni_xpu.sh)

Depends on upstream VeOmni XPU patches: ByteDance-Seed/VeOmni#695
…dules

oneCCL (xccl) doesn't support ReduceOp.AVG in reduce_scatter.
The FSDP engine already calls set_force_sum_reduction_for_comms(True)
on the root module, but VeOmni wraps each layer with fully_shard
independently, so the flag must be set on ALL FSDPModule submodules.
… on XPU

- Changed hasattr soft check to explicit guard with RuntimeError
- Prevents silent gradient corruption from wrong ReduceOp.AVG on oneCCL
- Users now see clear error mentioning PyTorch 2.5+ requirement instead of wrong gradients
- Pattern matches fsdp_utils.py hard-fail approach

This is critical for correctness: a missing ReduceOp fix silent-fails to wrong gradients, worse than a hard crash.
@kahlun kahlun force-pushed the xpu/pr-g-veomni-xpu branch from 2b67bc1 to 7b47958 Compare April 30, 2026 08:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant