[qwen3_5] evolve qwen3_vl to qwen3_5 by shuhuayu · Pull Request #3371 · pytorch/torchtitan

shuhuayu · 2026-05-15T20:05:01Z

Qwen3.5 supersedes Qwen3-VL with a hybrid attention architecture: 75% GatedDeltaNet (linear attention) + 25% full attention with output gating and partial RoPE.

Model changes:

Hybrid decoder with GatedDeltaNet and Qwen35Attention
Head-sharded TP on GatedDeltaNet projections (ColwiseParallel/RowwiseParallel)
OffsetRMSNorm, RMSNormGated, MoE with shared expert
Removed DeepStack

Parallelisms: fsdp, tp+sp, ep, pp, verified identical logits via numerical tests (scripts/checkpoint_conversion/numerical_tests_qwen3_5_shard.py).

Numerical parity: kl ~3e-7 against hf models (4b, multimodal) and 100% top-1/top-5 match (scripts/checkpoint_conversion/numerical_tests_qwen3_5.py).

Many thanks to @gali-leilei for initiating the effort of enabling qwen3.5 decoder in torchtitan in #2545, some components are reused in this pr.

tianyu-l · 2026-05-15T21:59:51Z

    head_dims: int,
    seq_len: int,
+    *,
+    num_full_attn: int | None = None,


can compute this from model_config right?

correct, removed.

tianyu-l · 2026-05-15T22:05:00Z

+
+End-to-end KL divergence against HuggingFace Transformers (4B, multimodal inputs): **~3e-7** average, with **100% top-1 and top-5 match**.
+
+Parallelism correctness: bitwise identical logits across no-parallel, FSDP, FSDP+EP, FSDP+EP+TP, and FSDP+EP+TP+CP configs.


hmm, how could this be true? Different parallelisms have different reductions

you are right. what the script did is just near identical numerically.

tianyu-l · 2026-05-15T22:11:09Z

+            mesh, plc = x.device_mesh, x.placements
+            w = self.weight
+            if isinstance(w, DTensor):
+                w = w.to_local()


With spmd_types, hopefully we don't need to do this manual conversion.

For now, let's do to_local in the module, similar to GroupedExperts, and use LocalMapConfig to convert inputs, instead of patching forward.

sounds great, refactored to the style used in groupedexperts.

tianyu-l · 2026-05-15T22:11:46Z

+    F.interpolate's decomposition uses _unsafe_index which doesn't support
+    DTensor. Since pos_embed is Replicate, to_local is a no-op for data.
+
+    TODO: Remove once F.interpolate on FSDP2-managed DTensors is fixed upstream.


If this can be fixed soon, let's wait.

tianyu-l · 2026-05-15T22:14:07Z

+        )
+        edp_mesh = parallel_dims.get_optional_mesh(edp_mesh_names)
+
+    apply_fsdp(


do we not need things like _apply_fsdp_to_vision_encoder any more?

This was previously handled in fully_shard(model, **fsdp_config), but as you said we should separate it. Apply fsdp to vision encoder and treat vit as a single unit.

tianyu-l · 2026-05-15T22:40:25Z

+    class Config(Module.Config):
+        layer_type: str  # "full_attn" or "linear_attn"
+        attention: Qwen35Attention.Config | None = None
+        deltanet: GatedDeltaNet.Config | None = None


Suggested change

deltanet: GatedDeltaNet.Config | None = None

delta_net: GatedDeltaNet.Config | None = None

tianyu-l · 2026-05-15T22:45:15Z

+        if self.moe_enabled:
+            moe_out = self.moe(h)
+            if self.shared_expert_enabled:
+                shared_out = torch.sigmoid(self.shared_gate(h)) * self.shared_ffn(h)


instead of doing this, can we extend https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/common/config_utils.py#L153 and use existing shared_expert inside MoE module?

extended in the common/config_utils.py. Currently only qwen3_5 uses this sigmoid gate, but this is a simple extension can be used later.

tianyu-l · 2026-05-15T22:47:54Z

+
+    @dataclass(kw_only=True, slots=True)
+    class Config(Module.Config):
+        layer_type: str  # "full_attn" or "linear_attn"


don't really need this? The config can be built that this block either has attention / deltanet. Refer to how feed_forward vs. moe is selected.

good point, this is redundant since attention and delta_net already indicates this.

tianyu-l · 2026-05-16T00:19:44Z

+            for block in model.layers.values()  # pyrefly: ignore [not-callable]
+            if block.layer_type == "full_attn"  # pyrefly: ignore [missing-attribute]
+        ]
+        if full_attn_inner_modules:


I didn't see an "else" here -- how are you handling sharded activation on linear attention layers?

We used Replicate() for that. but as discussed in a previous thread, current cp is inefficient and beat the purpose of supporting it. cp is removed for now.

tianyu-l · 2026-05-16T00:20:01Z

+    # runs inside the local_map boundary on local tensors.
+    # Applies to full attention layers only — GatedDeltaNet is recurrent
+    # and allgathers the full sequence via cp=Replicate() in sharding.
+    if parallel_dims.cp_enabled:


Since CP is non-trivial, let's just raise NotImplementedError
https://www.internalfb.com/metamate/M4978C

shuhuayu

Left a TODO on conv1d waiting for dtensor support in pytorch/pytorch#186129

shuhuayu · 2026-06-02T06:41:10Z

+        self.kernel = GatedDeltaKernel.Config(backend=config.fla_backend).build()
+
+        self.norm = RMSNormGated.Config(
+            dim=config.value_head_dim,
+            eps=config.norm_eps,
+            param_init=config.norm_init,
+        ).build()
+        self.out_proj = Linear.Config(
+            in_features=value_dim,
+            out_features=config.dim,
+            bias=False,
+            param_init=config.out_proj_init,
+        ).build()


make sense, submodule configs are moved to module.config.

shuhuayu · 2026-06-02T06:48:16Z

+
+    @dataclass(kw_only=True, slots=True)
+    class Config(Module.Config):
+        layer_type: str  # "full_attn" or "linear_attn"


good point, this is redundant since attention and delta_net already indicates this.

shuhuayu · 2026-06-02T06:48:26Z

+    class Config(Module.Config):
+        layer_type: str  # "full_attn" or "linear_attn"
+        attention: Qwen35Attention.Config | None = None
+        deltanet: GatedDeltaNet.Config | None = None


shuhuayu · 2026-06-02T07:01:36Z

+        if self.moe_enabled:
+            moe_out = self.moe(h)
+            if self.shared_expert_enabled:
+                shared_out = torch.sigmoid(self.shared_gate(h)) * self.shared_ffn(h)


extended in the common/config_utils.py. Currently only qwen3_5 uses this sigmoid gate, but this is a simple extension can be used later.

shuhuayu · 2026-06-02T07:04:51Z


 LayerNorm = Module.from_nn_module(nn.LayerNorm)
 GELU = Module.from_nn_module(nn.GELU)

 _compiled_create_block_mask = torch.compile(create_block_mask)


-def get_vision_block_mask_mod(num_patch: torch.Tensor, max_num_patch: int):


yes, this was a bug.

shuhuayu · 2026-06-02T21:09:39Z

+
+End-to-end KL divergence against HuggingFace Transformers (4B, multimodal inputs): **~3e-7** average, with **100% top-1 and top-5 match**.
+
+Parallelism correctness: bitwise identical logits across no-parallel, FSDP, FSDP+EP, FSDP+EP+TP, and FSDP+EP+TP+CP configs.


you are right. what the script did is just near identical numerically.

shuhuayu · 2026-06-02T21:15:27Z

+            mesh, plc = x.device_mesh, x.placements
+            w = self.weight
+            if isinstance(w, DTensor):
+                w = w.to_local()


sounds great, refactored to the style used in groupedexperts.

shuhuayu · 2026-06-02T21:18:42Z

+        wq: Linear.Config,
+        wk: Linear.Config,
+        wv: Linear.Config,
+        proj: Linear.Config,


refactored.

shuhuayu · 2026-06-02T21:21:53Z

        self.norm1 = LayerNorm(dim, eps=layer_norm_eps)
        self.norm2 = LayerNorm(dim, eps=layer_norm_eps)
-        self.attn = VisionAttention(dim, n_heads, qkv=attn_qkv, proj=attn_proj)
+        self.attn = VisionAttention(


refactored.

shuhuayu · 2026-06-02T21:23:00Z

    head_dims: int,
    seq_len: int,
+    *,
+    num_full_attn: int | None = None,


correct, removed.

tianyu-l · 2026-06-04T00:22:39Z

        router: TokenChoiceTopKRouter.Config
        load_balance_coeff: float | None = 1e-3
        shared_experts: FeedForward.Config | None = None
+        shared_expert_gate: Module.Config | None = None


Suggested change

shared_expert_gate: Module.Config | None = None

shared_experts_gate: Module.Config | None = None

more accurate. the hf keys remain unchanged as shared_expert_gate.

tianyu-l · 2026-06-04T00:24:10Z

            enable_ep=enable_ep, enable_sp=enable_sp
        )

+    if getattr(moe_cfg, "shared_expert_gate", None) is not None:


why do we need getattr? It seems always existing (could be None)

indeed. this one and a pre-existing getattr are removed.

tianyu-l · 2026-06-04T00:27:21Z

+            if self.shared_expert_gate is not None:
+                shared_out_BLD = (
+                    torch.sigmoid(self.shared_expert_gate(x_BLD)) * shared_out_BLD
+                )


What's the behavior under TP?
We used to assume on TP mesh shared_out_BLD is Partial, now there will be more collectives??
If TP is not supposed to be used (DP, EP only) as it's not efficient, then in sharding annotation, don't annotate / support TP.

you are right, when tp is on and shared experts are used, Dtensor does not know we have already gathered from Shard(1) for the experts computation itself so it will do it twice and thus waste one collection. I redesigned the shared_experts module which now inherits from FeedForward.

actually when tp is on, there are two duplicated all-gather for w1 and w3, which seems to me unnecessary. i rewrite it so one all gather for three: w1, w3, and optional gate.

tianyu-l · 2026-06-04T00:29:16Z

+def set_deltanet_conv1d_sharding(deltanet_module) -> None:
+    """Set sharding on GatedDeltaNet sub-modules built inline.
+
+    Conv1d modules don't have Config fields, so their sharding must be


Could you do similar things as in https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/common/nn_modules.py

tianyu-l · 2026-06-04T00:30:47Z

+class _Conv1d(nn.Conv1d, Module):
+    pass


put in https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/common/nn_modules.py

tianyu-l · 2026-06-04T00:32:05Z

+except ImportError:
+    _HAS_FLA = False


I think it doesn't make sense to run this model with FLA. Let's put this in model specific requirements.txt and in CI.

maybe you are saying it doesn't make sense to run it without FLA? added the dependency in .ci/docker/requirements-vlm.txt.

since it's in the requirements, can we remove such check, or put the raise here -- if one wants to run qwen3_5, they need to install fla, regardless of if they intend to use native impl or not

tianyu-l · 2026-06-04T00:32:25Z

+    return x * torch.rsqrt((x * x).sum(dim=dim, keepdim=True) + eps)
+
+
+def _torch_naive_gated_delta(


maybe native, not naive

tianyu-l · 2026-06-04T00:34:32Z

+            if isinstance(w, DTensor):
+                w = w.to_local()
+            local_groups = w.size(0)
+            # pyrefly: ignore [no-matching-overload]
+            out = F.conv1d(
+                x.to_local(),
+                w,
+                None,
+                conv.stride,
+                conv.padding,
+                conv.dilation,
+                local_groups,
+            )
+            x = DTensor.from_local(out, mesh, plc, run_check=False)


use local_map, not to_local / from_local
specify gradient placement

tianyu-l · 2026-06-04T00:36:38Z

+                    l.attention
+                    for l in self.model_config.layers
+                    if getattr(l, "attention", None) is not None


can be simplified... to just use getattr

tianyu-l · 2026-06-04T00:37:11Z

+                        l.attention
+                        for l in self.layers
+                        if getattr(l, "attention", None) is not None


given how frequent this is used, we probably should create a property in Decoder config to compute this.

tianyu-l

not sure how popular the shared experts gate would be, so would like to stay conservative

tianyu-l · 2026-06-05T18:13:50Z

        )


+class SharedExperts(FeedForward):


Given the gate thing is very much qwen3_5 specific, I would put this in qwen3_5 folder for now, and all other models still use FeedForward.

tianyu-l · 2026-06-05T18:14:12Z

        router: TokenChoiceTopKRouter.Config
        load_balance_coeff: float | None = 1e-3
-        shared_experts: FeedForward.Config | None = None
+        shared_experts: SharedExperts.Config | None = None


and since it inherits FeedForward, we can keep it unchanged.

tianyu-l · 2026-06-05T18:15:06Z

                    non_blocking_capacity_factor=non_blocking_capacity_factor,
                ),
-                shared_experts=make_ffn_config(
+                shared_experts=make_shared_experts_config(


only do this to qwen3_5 shared experts

tianyu-l · 2026-06-05T18:17:13Z

+_REPLICATE_PARAM = dense_param_placement(tp=Replicate())
+_REPLICATE_STATE = ShardingConfig(
+    state_shardings={"weight": _REPLICATE_PARAM, "bias": _REPLICATE_PARAM}
+)
+_REPLICATE_ACT = dense_activation_placement(tp=Replicate())


not sure if we should share reference among all usages

tianyu-l · 2026-06-05T18:25:03Z

+        out = super().forward(x)
+        if self.gate is not None:
+            # TODO: make the gate activation configurable (e.g. softmax, silu)
+            out = torch.sigmoid(self.gate(x)) * out


self.gate is Replicate
x is sharded
self.gate(x) is sharded -> replicate
out is partial -> final outcome is Partial

sounds correct.

tianyu-l · 2026-06-05T18:25:52Z

+            in_src_shardings={"x": dense_activation_placement(tp=shared_input_layout)},
+            in_dst_shardings={"x": dense_activation_placement(tp=Replicate())},


This is worth fixing even if we split up FeedForward and SharedExperts

shuhuayu · 2026-06-09T09:31:24Z

Thanks for all the comments/suggestions! Some updates: 1) refactored mrope for per layer rope and moved its position building into dataloader, and add mrope_positions to trainer's input_dict in the extra_kwargs part for all pp stages. 2) refactored to spmd types. 3) refactored sharedexperts so it is now only in qwen3_5. 4) redid numerical tests which still passed.

tianyu-l · 2026-06-09T23:46:17Z

        extra_kwargs: dict[str, Any] = {}

        positions = extra_inputs.pop("positions", None)
+        mrope_positions = extra_inputs.pop("mrope_positions", None)


No, this is model detail, shouldn't be exposed in trainer.

tianyu-l · 2026-06-09T23:46:51Z

    head_dims: int,
    seq_len: int,
+    *,
+    num_full_attn: int | None = None,


tianyu-l · 2026-06-09T23:55:01Z

+            raise ValueError("Decoder config does not define RoPE max_seq_len.")
+
+        @property
+        def first_attn_config(self) -> BaseAttention.Config | None:


Suggested change

def first_attn_config(self) -> BaseAttention.Config | None:

def first_attention(self) -> BaseAttention.Config | None:

tianyu-l · 2026-06-09T23:55:34Z

+                    raise ValueError(
+                        "No layer with attention config found for TP validation."
+                    )


why raise, no Attention means all-good?

tianyu-l · 2026-06-09T23:56:41Z

+        assert (
+            attn_config is not None
+        ), "get_attention_masks requires an attention layer"


similar, no attention sounds fine? E.g. some single pipeline stage only has DeltaNet module

tianyu-l · 2026-06-10T00:15:49Z

+    logger.info("Applied fully_shard to the Qwen3.5 model")
+
+    if training.enable_cpu_offload:
+        logger.info("Applied CPU Offloading to the Qwen3.5 model")


tianyu-l · 2026-06-10T00:19:50Z

+        )
+        # Vision encoder lives on the first stage alongside tok_embeddings
+        if hasattr(model, "vision_encoder") and model.vision_encoder is not None:
+            fqn_per_part[0].insert(0, "vision_encoder")


not sure how heavy vision_encoder it is, maybe worth investigating if we should adjust parallelism.pipeline_parallel_first_stage_less_layers later

added comments to reflect this forward-looking point.

tianyu-l · 2026-06-10T00:27:10Z

+            config,
+            **kwargs,
+        ) -> None:
+            Decoder.Config.update_from_config(self, config=config, **kwargs)


move this to bottom after #3595, whichever lands first @wwwjn

tianyu-l · 2026-06-10T07:41:57Z

            global_valid_tokens,
            params,
-            extra_inputs,
+            {},


maybe kill this field

tianyu-l · 2026-06-10T07:48:43Z

        # maskless backend (e.g. the SDPA config used by the graph_trainer
        # tests) still receives positions for RoPE but no masks — it relies on
        # is_causal instead.
-        if isinstance(self.model_config, Decoder.Config) and positions is not None:


positions is not None lost

tianyu-l · 2026-06-10T07:48:59Z

don't need this line

tianyu-l · 2026-06-10T07:54:39Z

+                    (e.g. positions, attention_masks), forwarded to every
+                    pipeline-parallel stage.
        """
        inputs = input_dict["input"]


inputs and labels are really not special and IMO not worth special handling, except for how labels is involved in loss computation. Can delay the general refactor to later.

tianyu-l · 2026-06-10T07:54:58Z

shuhuayu · 2026-06-10T17:26:58Z

thanks for the careful and sharp reviews! let's merge it to avoid more refactors and iterate later for bugs/features.

shuhuayu requested review from fegin, tianyu-l, wconstab and wwwjn as code owners May 15, 2026 20:05

pytorch-bot Bot added the ciflow/8gpu label May 15, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 15, 2026

shuhuayu requested review from acisseJZhong and felipemello1 May 15, 2026 20:21

shuhuayu force-pushed the modeldev branch 2 times, most recently from ee4a27a to af12fc7 Compare May 15, 2026 21:25

pytorch-bot Bot added the ciflow/rl label May 15, 2026

shuhuayu force-pushed the modeldev branch 2 times, most recently from e8fb20e to 6c60af1 Compare May 15, 2026 21:34

tianyu-l reviewed May 16, 2026

View reviewed changes

felipemello1 changed the title ~~[qwen3_5] evovle qwen3_vl to qwen3_5~~ [qwen3_5] evolve qwen3_vl to qwen3_5 May 16, 2026

tianyu-l mentioned this pull request May 18, 2026

[Qwen3.5 MoE] Add hybrid decoder model (GatedDeltaNet + full attention + MoE) #2545

Open

2 tasks

shuhuayu force-pushed the modeldev branch from 6c60af1 to 6fd77bf Compare June 3, 2026 23:46

shuhuayu commented Jun 3, 2026

View reviewed changes

shuhuayu force-pushed the modeldev branch 2 times, most recently from a0f6aed to d16d9e8 Compare June 4, 2026 00:19

tianyu-l reviewed Jun 4, 2026

View reviewed changes

shuhuayu force-pushed the modeldev branch 3 times, most recently from 2adc33e to 2c48e0d Compare June 5, 2026 05:19

tianyu-l reviewed Jun 5, 2026

View reviewed changes

shuhuayu force-pushed the modeldev branch 2 times, most recently from 6ca2b22 to 9b68f71 Compare June 9, 2026 09:23

shuhuayu force-pushed the modeldev branch 2 times, most recently from 8cc31e7 to 72c923e Compare June 9, 2026 17:03

tianyu-l reviewed Jun 10, 2026

View reviewed changes

shuhuayu added 6 commits June 9, 2026 20:58

evovle qwen3_vl to qwen3_5

ab174be

rebase refactor

03912f2

rewrite sharedexperts and use local map

541c1b6

refactor optimizer config

8c92617

refactors of shared experts, mrope

c08c4d6

refactor spmd types

355fed8

shuhuayu force-pushed the modeldev branch from 72c923e to 59bb8da Compare June 10, 2026 07:13

shuhuayu requested review from IvanKobzarev, SherlockNoMad, aditvenk, sanketpurandare and xmfan as code owners June 10, 2026 07:13

shuhuayu force-pushed the modeldev branch from 59bb8da to a59e7d6 Compare June 10, 2026 07:47

tianyu-l approved these changes Jun 10, 2026

View reviewed changes

shuhuayu force-pushed the modeldev branch from a59e7d6 to 1644365 Compare June 10, 2026 08:52

merge extra_inputs into extra_kwargs in trainer

b917c7f

shuhuayu force-pushed the modeldev branch from 1644365 to b917c7f Compare June 10, 2026 09:05

shuhuayu merged commit fd712e8 into pytorch:main Jun 10, 2026
25 of 26 checks passed


		End-to-end KL divergence against HuggingFace Transformers (4B, multimodal inputs): ~3e-7 average, with 100% top-1 and top-5 match.

		Parallelism correctness: bitwise identical logits across no-parallel, FSDP, FSDP+EP, FSDP+EP+TP, and FSDP+EP+TP+CP configs.

	deltanet: GatedDeltaNet.Config \| None = None
	delta_net: GatedDeltaNet.Config \| None = None

	shared_expert_gate: Module.Config \| None = None
	shared_experts_gate: Module.Config \| None = None

		return x * torch.rsqrt((x * x).sum(dim=dim, keepdim=True) + eps)


		def _torch_naive_gated_delta(

		in_src_shardings={"x": dense_activation_placement(tp=shared_input_layout)},
		in_dst_shardings={"x": dense_activation_placement(tp=Replicate())},

	def first_attn_config(self) -> BaseAttention.Config \| None:
	def first_attention(self) -> BaseAttention.Config \| None:

Conversation

shuhuayu commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shuhuayu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

shuhuayu commented May 15, 2026 •

edited

Loading