[Bug fixes] Align clamped SwiGLU computation order and data types with Megatron, and reduce redundancy testcase by adam-xiaoyao · Pull Request #1078 · PaddlePaddle/PaddleFleet

adam-xiaoyao · 2026-05-30T15:28:34Z

PR Category

Operator Mechanism

PR Types

Bug fixes

Description

Align clamped SwiGLU computation order and data types with Megatron, and reduce redundancy testcase

是否引起精度变化

是，只用造成clampswiglu的精度变化。

risemeup1111

本轮审查发现需要先修复的问题，主要集中在 fused_swiglu_scale_backward 的 fallback 返回形状与 CUDA/既有行为不一致；另外 MoE clamp 分支的判定条件也需要和新 wrapper 语义统一。具体建议已放在行内评论中。

当前 CI 中常规检查大多已通过，仍有构建任务在运行，且 approval 检查未满足。

^{Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.}

risemeup1111 · 2026-05-30T15:46:54Z

+            d_scale = paddle.sum(
+                swiglu_val.cast(x.dtype) * out_grad.cast(scale_dtype),
+                axis=-1,
+                keepdim=True,
+            ).cast(scale_dtype)


优先级：P1

这里把 keepdim 固定为 True 后，CPU/XPU fallback 在 scale 是一维时会返回 [rows, 1]，而不是原来的 [rows]；CUDA 扩展的 FusedGradInferShape 也按 scale_shape 返回 DScale。这样同一个 fused_swiglu_scale_backward(x, scale, ..., clamp_value=...) 会因为是否走 fallback 得到不同的 d_scale 形状，现有一维 scale 调用/测试会被破坏。请按 scale 的 rank 决定是否保留最后一维，并在 non-clamp 分支做同样处理。

Suggested change

d_scale = paddle.sum(

swiglu_val.cast(x.dtype) * out_grad.cast(scale_dtype),

axis=-1,

keepdim=True,

).cast(scale_dtype)

d_scale = paddle.sum(

swiglu_val.cast(x.dtype) * out_grad.cast(scale_dtype),

axis=-1,

keepdim=scale.ndim > 1,

).cast(scale_dtype)

risemeup1111 · 2026-05-30T15:47:00Z

+            d_scale = paddle.sum(
+                out_grad.cast(paddle.float32) * swiglu_val.cast(paddle.float32),
+                axis=-1,
+                keepdim=True,
+            ).cast(scale.dtype)


优先级：P1

non-clamp fallback 也被改成了始终 keepdim=True，这会把一维 scale 的 d_scale 从 [rows] 变成 [rows, 1]，与已有 CPU fallback 行为、仓库中一维 scale 测试以及 CUDA DScale infer shape 都不一致。请保持与 scale 的 rank 对齐，避免仅 fallback 路径改变公共返回形状。

Suggested change

d_scale = paddle.sum(

out_grad.cast(paddle.float32) * swiglu_val.cast(paddle.float32),

axis=-1,

keepdim=True,

).cast(scale.dtype)

d_scale = paddle.sum(

out_grad.cast(paddle.float32) * swiglu_val.cast(paddle.float32),

axis=-1,

keepdim=scale.ndim > 1,

).cast(scale.dtype)

risemeup1111 · 2026-05-30T15:47:06Z


        if self.clamp_value is not None:
-            o2 = fused_swiglu_scale_clamp_forward(
+            o2 = fused_swiglu_scale_forward(


优先级：P2

新的 fused_swiglu_scale_forward 只有在 clamp_value > 0 时才启用 clamp，但这里仍然用 is not None 选择 clamp 分支。这样 activation_func_clamp_value=0.0 或负数时，forward 会通过 wrapper 实际走 non-clamp，而同类的 backward/FP8 分支仍按 is not None 调用 clamp kernel，容易造成前后向语义不一致。请把 MoE 相关分支统一成和 wrapper 一样的正数判定。

Suggested change

o2 = fused_swiglu_scale_forward(

if self.clamp_value is not None and self.clamp_value > 0:

Paddle-CI-Bot · 2026-05-30T16:39:13Z

PaddleFleet Log Analysis

Run #26691975498 · Attempt 1

日志分析报告

PR #1078 · [Bug fixes] Align clamped SwiGLU computation order and data types with Megatron, and reduce redundancy testcase
branch: fix_clamp → develop | 是否引起精度变化: 是

流水线名称	问题标签	修复建议	日志片段
Unit test (single card)	`fused_swiglu_scale_backward` 输出 shape 不符	`fused_swiglu_scale_backward` 返回的 `ds` 形状变为 1-D `[rows]`，测试期望 2-D `[rows, 1]`；需在 `fused_swiglu_scale.py` 或 CUDA kernel 中 unsqueeze(-1) 保持输出 shape 一致	报错代码
Unit test (multi-card)	PP+MoE 梯度数值不对齐	`_layers.shared_layers.embed.embedding.embed_tokens.weight` 的梯度 MD5 与 baseline 不符，两个 PP-MoE test 均复现；PR 修改了 `fused_swiglu_scale.py` / `mlp.py` / `fp8_utils.py` 影响梯度路径，需更新 `test_gpt_pp_with_moe*.py` baseline 或修复梯度计算	报错代码
Integration test (H20, multi-card)	精度变更未获审批（exit 6）	PR 描述明确标注「是否引起精度变化：是」，但缺少必要 reviewer approve；请联系 `XieYunshen`/`From00`/`risemeup1`/`tianlef`（Group1）、`lugimzzz`/`zjjlivein`/`tianlef`（Group2）、`tianlef`/`swgu98`（Group3）中的人进行 approve，所有集成测试 Loss 均通过，仅审批未到位	报错代码

失败的测试 case:

# Unit test (single card)
tests/single_card_tests/ai_edited_test/fusions/test_ai_clampswiglu_align.py::TestFusedSwiGLUScaleBackwardClampCPU::test_basic_shapes_and_order
tests/single_card_tests/ai_edited_test/fusions/test_ai_clampswiglu_align.py::TestFusedSwiGLUScaleBackwardClampCPU::test_d_scale_numeric
tests/single_card_tests/ai_edited_test/fusions/test_ai_clampswiglu_align.py::TestDScaleAlignment::test_no_clamp_fp32
tests/single_card_tests/ai_edited_test/fusions/test_ai_clampswiglu_align.py::TestDScaleAlignment::test_with_clamp_fp32_and_bf16_scale
tests/single_card_tests/ai_edited_test/fusions/test_ai_clampswiglu_align.py::TestClampEdgeCases::test_zero_rows
tests/single_card_tests/ai_edited_test/fusions/test_ai_clampswiglu_align.py::TestClampLargeTensor::test_fused_swiglu_scale_backward_large_both

# Unit test (multi-card)
/paddle/tests/multi_card_tests/pipeline_parallel/test_gpt_pp_with_moe_with_mtp.py::TestPP::test_pp
/paddle/tests/multi_card_tests/pipeline_parallel/test_gpt_pp_with_moe.py::TestPP::test_pp

根本原因分析:

PR 修改了 fused_swiglu_scale.py 和 CUDA kernel（fuse_swiglu_scale.cu）以对齐 Megatron 的 clamp 顺序与数据类型，导致 fused_swiglu_scale_backward 返回的 ds 形状从 [rows, 1] 变为 1-D [rows]，新增测试 test_ai_clampswiglu_align.py 断言的期望 shape 与实际 kernel 输出不匹配；同一改动还影响了 PP+MoE 场景下 embed token weight 的梯度数值（baseline MD5 失效）；集成测试 Loss 全部通过，集成流水线报错原因是 PR 自报精度变更但尚未获得精度 reviewer 审批（exit code 6）。

修复建议:

单卡 shape 错误：检查 src/paddlefleet/fusions/fused_swiglu_scale.py 中 backward 接口，确认 ds 是否应 unsqueeze(-1) 保持 [rows, 1]，或统一修改测试 test_ai_clampswiglu_align.py 中所有 self.assertEqual(ds.shape, [rows, 1]) 为 [rows]——选择哪一侧取决于接口契约，建议与 Megatron 对齐后以测试侧为准。
多卡梯度 baseline 失效：PR 改变了 SwiGLU backward 的数值路径，embed_tokens.weight 的梯度 MD5 已改变。在确认新数值正确后，重新跑 test_gpt_pp_with_moe.py / test_gpt_pp_with_moe_with_mtp.py 并更新 baseline 字典中对应的 MD5 值。
精度审批：请在 PR [Bug fixes] Align clamped SwiGLU computation order and data types with Megatron, and reduce redundancy testcase #1078 上请 tianlef（同时属于 Group1/2/3）或对应 Group 的其他 reviewer 进行 Approved review，CI 精度审批 check 即可通过，无需代码改动。

_{🔄 每次 Re-run 后自动更新}

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-31 02:00:20

📋 Review 摘要

PR 概述：对齐 clamped SwiGLU 的计算顺序和数据类型与 Megatron 一致，合并冗余的 clamp/non-clamp 代码路径，精简测试用例。
变更范围：fusions/、transformer/mlp.py、transformer/moe/fp8_utils.py、paddlefleet_ops/_extensions/fuse_swiglu_scale.cu
影响面 Tag：Fusions MoE OP

问题

未发现阻塞性问题。代码变更逻辑正确，精度对齐目标明确。

历史 Findings 修复情况

Finding	问题	状态
F1	`weights_grad` CPU fallback 累加精度与 CUDA kernel 存在差异	✅ 已修复

F1 修复说明：CPU 侧 clamped_weighted_swiglu_back 现在使用 clamped_swiglu(y, clamp_value) * g.cast(w_dtype) 计算 weights_grad，与 CUDA kernel 的 sum(swiglu_val.cast(dtype) * d_out.cast(scale_dtype)) 公式对齐，消除了原先 CPU 全程 float32 与 CUDA native-type 乘法之间的精度差异。

📝 PR 规范检查

✓ 标题格式合规（[Bug fixes] Tag 匹配 diff 内容），描述结构完整（含 PR Category / PR Types / Description / 精度变化说明）。

总体评价

本 PR 通过统一 clamp/non-clamp 代码路径（移除 ClampedWeightedSwiGLUFunction、合并 clamped_weighted_bias_swiglu_impl 到 weighted_bias_swiglu_impl）显著降低了维护复杂度。CUDA kernel 和 Python fallback 的 d_scale 计算均对齐到 sum(swiglu_val.cast(dtype) * d_out.cast(scale_dtype)) 公式，与 Megatron 参考实现保持 bit-exact 一致。代码质量良好，可合入。

codecov-commenter · 2026-05-30T18:47:05Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@d007cc7). Learn more about missing BASE report.

Additional details and impacted files

@@             Coverage Diff             @@
##             develop     #1078   +/-   ##
===========================================
  Coverage           ?   100.00%           
===========================================
  Files              ?         4           
  Lines              ?        99           
  Branches           ?        12           
===========================================
  Hits               ?        99           
  Misses             ?         0           
  Partials           ?         0

Flag	Coverage Δ
coverage_combine	`100.00% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/paddlefleet/fusions/fused_bias_swiglu.py	`100.00% <100.00%> (ø)`
src/paddlefleet/fusions/fused_swiglu_scale.py	`100.00% <100.00%> (ø)`
src/paddlefleet/transformer/mlp.py	`100.00% <100.00%> (ø)`
src/paddlefleet/transformer/moe/fp8_utils.py	`100.00% <100.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

adam-xiaoyao force-pushed the fix_clamp branch from c335d13 to 1c26d3a Compare May 30, 2026 15:33

This comment was marked as outdated.

Sign in to view

risemeup1111 suggested changes May 30, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

adam-xiaoyao force-pushed the fix_clamp branch from 1c26d3a to f04880f Compare May 30, 2026 17:07

This comment was marked as outdated.

Sign in to view

fix clamp swiglu

ed199ec

adam-xiaoyao force-pushed the fix_clamp branch from f04880f to ed199ec Compare May 30, 2026 17:42

PaddlePaddle-bot reviewed May 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug fixes] Align clamped SwiGLU computation order and data types with Megatron, and reduce redundancy testcase#1078

[Bug fixes] Align clamped SwiGLU computation order and data types with Megatron, and reduce redundancy testcase#1078
adam-xiaoyao wants to merge 1 commit into
PaddlePaddle:developfrom
adam-xiaoyao:fix_clamp

adam-xiaoyao commented May 30, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

risemeup1111 left a comment

Uh oh!

risemeup1111 May 30, 2026

Uh oh!

risemeup1111 May 30, 2026

Uh oh!

risemeup1111 May 30, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

Paddle-CI-Bot commented May 30, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

codecov-commenter commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	o2 = fused_swiglu_scale_forward(
	if self.clamp_value is not None and self.clamp_value > 0:

Conversation

adam-xiaoyao commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

是否引起精度变化

Uh oh!

This comment was marked as outdated.

Uh oh!

risemeup1111 left a comment

Choose a reason for hiding this comment

Uh oh!

risemeup1111 May 30, 2026

Choose a reason for hiding this comment

Uh oh!

risemeup1111 May 30, 2026

Choose a reason for hiding this comment

Uh oh!

risemeup1111 May 30, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Paddle-CI-Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PaddleFleet Log Analysis

日志分析报告

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

历史 Findings 修复情况

📝 PR 规范检查

总体评价

Uh oh!

codecov-commenter commented May 30, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

adam-xiaoyao commented May 30, 2026 •

edited

Loading

Paddle-CI-Bot commented May 30, 2026 •

edited

Loading