Skip to content

[model]feat: Qwen3.5 is compatible with NPU#600

Open
wang-hua-2019 wants to merge 1 commit intoByteDance-Seed:mainfrom
wang-hua-2019:main
Open

[model]feat: Qwen3.5 is compatible with NPU#600
wang-hua-2019 wants to merge 1 commit intoByteDance-Seed:mainfrom
wang-hua-2019:main

Conversation

@wang-hua-2019
Copy link
Copy Markdown
Contributor

What does this PR do?

Qwen3.5模型适配昇腾NPU卡

Checklist Before Starting

  • Search for relative PRs/issues and link here: ...
  • PR title follows [{modules}] {type}: {description} format
    • {modules}: misc, ci, config, docs, data, dist, omni, logging, model, optim, ckpt, release, task, perf, ops, parallel, trainer
    • {type}: feat, fix, refactor, chore, test
    • Breaking changes: prepend [BREAKING] — e.g. [BREAKING][parallel, model] feat: dynamic batching

Test

Validation results (training curves, eval metrics) for changes not covered by CI.

API and Usage Example

Show API changes and usage examples if applicable.

Design & Code Changes

参考GPU patch方案,增加NPU patch代码,主要更改点:1、rmsnorm融合算子接入mojo;2、解决fused_moe为true报错问题

Checklist Before Submitting

  • Read the Contribute Guide
  • Applied pre-commit checks
  • Added/updated documentation
  • If tasks/ training scripts were moved or renamed: updated docs/ examples and verified python3 scripts/ci/check_doc_task_paths.py passes (also enforced by the Check doc task paths CI workflow)
  • Added tests to CI workflow (or explained why not feasible)

@github-actions github-actions Bot added the ascend everything about Ascend support label Mar 23, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces NPU compatibility for the Qwen3.5 and Qwen3.5 MoE models. Key changes include conditional imports for NPU-specific patched models, integration of mojo_opset for optimized RMSNorm and causal convolution, and extensive modifications to support Ulysses Sequence Parallelism (SP) and FSDP-safe multimodal processing. The Qwen3_5GatedDeltaNet and Qwen3_5MoeGatedDeltaNet forward passes have been updated to handle variable-length sequences and SP-aware weight sharding. Additionally, the MoE expert dispatch now supports a fused implementation, and several vision model methods have been optimized for performance and distributed training compatibility. The explicit NotImplementedErrors indicate areas where NPU support is still under development for specific execution paths, and the ValueError for multimodal inputs in the MoE version clarifies current limitations.

Comment on lines +504 to +507
+ # Modification: use out-of-place add instead of `expert_output += shared_expert_output`
+ # to avoid "Output of MergedFc1TritonFusedMoeExpertFunctionBackward is a view and is
+ # being modified inplace" RuntimeError from PyTorch autograd.
+ expert_output = expert_output + shared_expert_output
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Changing expert_output += shared_expert_output to expert_output = expert_output + shared_expert_output is a critical fix to avoid in-place modification errors with PyTorch's autograd system. This prevents potential runtime crashes related to view operations on custom autograd function outputs.

Comment on lines +265 to +266
+ # Modification: keep this disabled until FLA causal_conv1d_update decode path is validated.
+ raise NotImplementedError("use_precomputed_states=True is not supported yet for causal_conv1d_update now.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This NotImplementedError indicates that the use_precomputed_states=True path for causal_conv1d_update is not yet supported for NPU. This could lead to runtime failures if this specific decoding path is triggered in an NPU environment. Consider prioritizing the implementation of this path or providing a clear warning in the documentation about this limitation.

Comment on lines +298 to +299
- mixed_qkv = mixed_qkv.transpose(1, 2)
+ raise NotImplementedError("This path is not supported yet because it can't process varlen now.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This NotImplementedError suggests that the fallback path (when self.causal_conv1d_fn is None) does not support variable-length sequences on NPU. This means that if the mojo_causal_conv1d is not available or if this specific path is taken, varlen processing will fail. It's crucial to ensure that mojo_causal_conv1d is always available or to implement a robust fallback for varlen processing.

Comment on lines +460 to +464
+ cu_seq_lens_q = kwargs.get("cu_seq_lens_q", None)
+ assert cu_seq_lens_q is not None, (
+ "cu_seq_lens_q must be provided to support varlen Flash Linear Attention, varlen Conv1D,"
+ "and to remove the full Flash Attention CPU-NPU sync."
+ )
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The assertion cu_seq_lens_q is not None makes cu_seq_lens_q a mandatory argument for Qwen3_5DecoderLayer.forward when using varlen Flash Linear Attention or Conv1D. While this enforces correct usage, it's important to ensure that all upstream callers consistently provide this argument to prevent runtime crashes. If there are scenarios where cu_seq_lens_q might legitimately be None, a more graceful handling (e.g., falling back to a non-varlen path if possible) might be considered.

Comment on lines +250 to +251
+ # Modification: keep this disabled until FLA causal_conv1d_update decode path is validated.
+ raise NotImplementedError("use_precomputed_states=True is not supported yet for causal_conv1d_update now.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the non-MoE version, this NotImplementedError indicates that use_precomputed_states=True is not supported for causal_conv1d_update in the MoE model on NPU. This is a critical limitation for certain decoding scenarios and should be addressed or clearly documented.

Comment on lines +279 to +280
+ )[0]
else:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This NotImplementedError in the MoE model's Qwen3_5MoeGatedDeltaNet.forward indicates that the fallback path for causal_conv1d_fn does not support variable-length sequences. This is a high-severity issue as it can lead to crashes if the optimized NPU mojo_causal_conv1d is not used or available, and varlen inputs are provided.

Comment on lines +565 to +569
+ cu_seq_lens_q = kwargs.get("cu_seq_lens_q", None)
+ assert cu_seq_lens_q is not None, (
+ "cu_seq_lens_q must be provided to support varlen Flash Linear Attention, varlen Conv1D,"
+ "and to remove the full Flash Attention CPU-GPU sync."
+ )
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the non-MoE version, the assertion cu_seq_lens_q is not None in Qwen3_5MoeDecoderLayer.forward is a strong requirement. Ensure that cu_seq_lens_q is always provided by callers when varlen linear attention or Conv1D is expected, or consider a more robust error handling/fallback mechanism.

Comment on lines +683 to +687
+ if pixel_values is not None or pixel_values_videos is not None:
+ raise ValueError(
+ "Qwen3_5MoeForConditionalGeneration currently supports text-only inputs in VeOmni; "
+ "`pixel_values` and `pixel_values_videos` are not supported yet."
+ )
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The ValueError explicitly states that Qwen3_5MoeForConditionalGeneration currently supports text-only inputs in VeOmni and does not support pixel_values or pixel_values_videos. This is a clear and important limitation. While it prevents incorrect usage, it highlights an area for future development if multimodal MoE is desired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ascend everything about Ascend support

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant