Skip to content

[Bug fixes] Align clamped SwiGLU computation order and data types with Megatron, and reduce redundancy testcase#1078

Open
adam-xiaoyao wants to merge 1 commit into
PaddlePaddle:developfrom
adam-xiaoyao:fix_clamp
Open

[Bug fixes] Align clamped SwiGLU computation order and data types with Megatron, and reduce redundancy testcase#1078
adam-xiaoyao wants to merge 1 commit into
PaddlePaddle:developfrom
adam-xiaoyao:fix_clamp

Conversation

@adam-xiaoyao
Copy link
Copy Markdown
Contributor

@adam-xiaoyao adam-xiaoyao commented May 30, 2026

PR Category

Operator Mechanism

PR Types

Bug fixes

Description

Align clamped SwiGLU computation order and data types with Megatron, and reduce redundancy testcase

是否引起精度变化

是, 只用造成clampswiglu的精度变化。

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@risemeup1111 risemeup1111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

本轮审查发现需要先修复的问题,主要集中在 fused_swiglu_scale_backward 的 fallback 返回形状与 CUDA/既有行为不一致;另外 MoE clamp 分支的判定条件也需要和新 wrapper 语义统一。具体建议已放在行内评论中。

当前 CI 中常规检查大多已通过,仍有构建任务在运行,且 approval 检查未满足。

Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.

Comment on lines +120 to +124
d_scale = paddle.sum(
swiglu_val.cast(x.dtype) * out_grad.cast(scale_dtype),
axis=-1,
keepdim=True,
).cast(scale_dtype)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 优先级:P1

这里把 keepdim 固定为 True 后,CPU/XPU fallback 在 scale 是一维时会返回 [rows, 1],而不是原来的 [rows];CUDA 扩展的 FusedGradInferShape 也按 scale_shape 返回 DScale。这样同一个 fused_swiglu_scale_backward(x, scale, ..., clamp_value=...) 会因为是否走 fallback 得到不同的 d_scale 形状,现有一维 scale 调用/测试会被破坏。请按 scale 的 rank 决定是否保留最后一维,并在 non-clamp 分支做同样处理。

Suggested change
d_scale = paddle.sum(
swiglu_val.cast(x.dtype) * out_grad.cast(scale_dtype),
axis=-1,
keepdim=True,
).cast(scale_dtype)
d_scale = paddle.sum(
swiglu_val.cast(x.dtype) * out_grad.cast(scale_dtype),
axis=-1,
keepdim=scale.ndim > 1,
).cast(scale_dtype)

Comment on lines +158 to +162
d_scale = paddle.sum(
out_grad.cast(paddle.float32) * swiglu_val.cast(paddle.float32),
axis=-1,
keepdim=True,
).cast(scale.dtype)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 优先级:P1

non-clamp fallback 也被改成了始终 keepdim=True,这会把一维 scaled_scale[rows] 变成 [rows, 1],与已有 CPU fallback 行为、仓库中一维 scale 测试以及 CUDA DScale infer shape 都不一致。请保持与 scale 的 rank 对齐,避免仅 fallback 路径改变公共返回形状。

Suggested change
d_scale = paddle.sum(
out_grad.cast(paddle.float32) * swiglu_val.cast(paddle.float32),
axis=-1,
keepdim=True,
).cast(scale.dtype)
d_scale = paddle.sum(
out_grad.cast(paddle.float32) * swiglu_val.cast(paddle.float32),
axis=-1,
keepdim=scale.ndim > 1,
).cast(scale.dtype)


if self.clamp_value is not None:
o2 = fused_swiglu_scale_clamp_forward(
o2 = fused_swiglu_scale_forward(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 优先级:P2

新的 fused_swiglu_scale_forward 只有在 clamp_value > 0 时才启用 clamp,但这里仍然用 is not None 选择 clamp 分支。这样 activation_func_clamp_value=0.0 或负数时,forward 会通过 wrapper 实际走 non-clamp,而同类的 backward/FP8 分支仍按 is not None 调用 clamp kernel,容易造成前后向语义不一致。请把 MoE 相关分支统一成和 wrapper 一样的正数判定。

Suggested change
o2 = fused_swiglu_scale_forward(
if self.clamp_value is not None and self.clamp_value > 0:

PaddlePaddle-bot

This comment was marked as outdated.

@Paddle-CI-Bot
Copy link
Copy Markdown

Paddle-CI-Bot commented May 30, 2026

PaddleFleet Log Analysis

Run #26691975498 · Attempt 1

日志分析报告

PR #1078 · [Bug fixes] Align clamped SwiGLU computation order and data types with Megatron, and reduce redundancy testcase
branch: fix_clamp → develop | 是否引起精度变化:

流水线名称 问题标签 修复建议 日志片段
Unit test (single card) fused_swiglu_scale_backward 输出 shape 不符 fused_swiglu_scale_backward 返回的 ds 形状变为 1-D [rows],测试期望 2-D [rows, 1];需在 fused_swiglu_scale.py 或 CUDA kernel 中 unsqueeze(-1) 保持输出 shape 一致 报错代码
Unit test (multi-card) PP+MoE 梯度数值不对齐 _layers.shared_layers.embed.embedding.embed_tokens.weight 的梯度 MD5 与 baseline 不符,两个 PP-MoE test 均复现;PR 修改了 fused_swiglu_scale.py / mlp.py / fp8_utils.py 影响梯度路径,需更新 test_gpt_pp_with_moe*.py baseline 或修复梯度计算 报错代码
Integration test (H20, multi-card) 精度变更未获审批(exit 6) PR 描述明确标注「是否引起精度变化:是」,但缺少必要 reviewer approve;请联系 XieYunshen/From00/risemeup1/tianlef(Group1)、lugimzzz/zjjlivein/tianlef(Group2)、tianlef/swgu98(Group3)中的人进行 approve,所有集成测试 Loss 均通过,仅审批未到位 报错代码

失败的测试 case:

# Unit test (single card)
tests/single_card_tests/ai_edited_test/fusions/test_ai_clampswiglu_align.py::TestFusedSwiGLUScaleBackwardClampCPU::test_basic_shapes_and_order
tests/single_card_tests/ai_edited_test/fusions/test_ai_clampswiglu_align.py::TestFusedSwiGLUScaleBackwardClampCPU::test_d_scale_numeric
tests/single_card_tests/ai_edited_test/fusions/test_ai_clampswiglu_align.py::TestDScaleAlignment::test_no_clamp_fp32
tests/single_card_tests/ai_edited_test/fusions/test_ai_clampswiglu_align.py::TestDScaleAlignment::test_with_clamp_fp32_and_bf16_scale
tests/single_card_tests/ai_edited_test/fusions/test_ai_clampswiglu_align.py::TestClampEdgeCases::test_zero_rows
tests/single_card_tests/ai_edited_test/fusions/test_ai_clampswiglu_align.py::TestClampLargeTensor::test_fused_swiglu_scale_backward_large_both

# Unit test (multi-card)
/paddle/tests/multi_card_tests/pipeline_parallel/test_gpt_pp_with_moe_with_mtp.py::TestPP::test_pp
/paddle/tests/multi_card_tests/pipeline_parallel/test_gpt_pp_with_moe.py::TestPP::test_pp

根本原因分析:

PR 修改了 fused_swiglu_scale.py 和 CUDA kernel(fuse_swiglu_scale.cu)以对齐 Megatron 的 clamp 顺序与数据类型,导致 fused_swiglu_scale_backward 返回的 ds 形状从 [rows, 1] 变为 1-D [rows]新增测试 test_ai_clampswiglu_align.py 断言的期望 shape 与实际 kernel 输出不匹配;同一改动还影响了 PP+MoE 场景下 embed token weight 的梯度数值(baseline MD5 失效);集成测试 Loss 全部通过,集成流水线报错原因是 PR 自报精度变更但尚未获得精度 reviewer 审批(exit code 6)。


修复建议:

  1. 单卡 shape 错误:检查 src/paddlefleet/fusions/fused_swiglu_scale.py 中 backward 接口,确认 ds 是否应 unsqueeze(-1) 保持 [rows, 1],或统一修改测试 test_ai_clampswiglu_align.py 中所有 self.assertEqual(ds.shape, [rows, 1])[rows]——选择哪一侧取决于接口契约,建议与 Megatron 对齐后以测试侧为准。

  2. 多卡梯度 baseline 失效:PR 改变了 SwiGLU backward 的数值路径,embed_tokens.weight 的梯度 MD5 已改变。在确认新数值正确后,重新跑 test_gpt_pp_with_moe.py / test_gpt_pp_with_moe_with_mtp.py 并更新 baseline 字典中对应的 MD5 值。

  3. 精度审批:请在 PR [Bug fixes] Align clamped SwiGLU computation order and data types with Megatron, and reduce redundancy testcase #1078 上请 tianlef(同时属于 Group1/2/3)或对应 Group 的其他 reviewer 进行 Approved review,CI 精度审批 check 即可通过,无需代码改动。


🔄 每次 Re-run 后自动更新

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-31 02:00:20

📋 Review 摘要

PR 概述:对齐 clamped SwiGLU 的计算顺序和数据类型与 Megatron 一致,合并冗余的 clamp/non-clamp 代码路径,精简测试用例。
变更范围fusions/transformer/mlp.pytransformer/moe/fp8_utils.pypaddlefleet_ops/_extensions/fuse_swiglu_scale.cu
影响面 TagFusions MoE OP

问题

未发现阻塞性问题。代码变更逻辑正确,精度对齐目标明确。

历史 Findings 修复情况

Finding 问题 状态
F1 weights_grad CPU fallback 累加精度与 CUDA kernel 存在差异 ✅ 已修复

F1 修复说明:CPU 侧 clamped_weighted_swiglu_back 现在使用 clamped_swiglu(y, clamp_value) * g.cast(w_dtype) 计算 weights_grad,与 CUDA kernel 的 sum(swiglu_val.cast(dtype) * d_out.cast(scale_dtype)) 公式对齐,消除了原先 CPU 全程 float32 与 CUDA native-type 乘法之间的精度差异。

📝 PR 规范检查

✓ 标题格式合规([Bug fixes] Tag 匹配 diff 内容),描述结构完整(含 PR Category / PR Types / Description / 精度变化说明)。

总体评价

本 PR 通过统一 clamp/non-clamp 代码路径(移除 ClampedWeightedSwiGLUFunction、合并 clamped_weighted_bias_swiglu_implweighted_bias_swiglu_impl)显著降低了维护复杂度。CUDA kernel 和 Python fallback 的 d_scale 计算均对齐到 sum(swiglu_val.cast(dtype) * d_out.cast(scale_dtype)) 公式,与 Megatron 参考实现保持 bit-exact 一致。代码质量良好,可合入。

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@d007cc7). Learn more about missing BASE report.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##             develop     #1078   +/-   ##
===========================================
  Coverage           ?   100.00%           
===========================================
  Files              ?         4           
  Lines              ?        99           
  Branches           ?        12           
===========================================
  Hits               ?        99           
  Misses             ?         0           
  Partials           ?         0           
Flag Coverage Δ
coverage_combine 100.00% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/paddlefleet/fusions/fused_bias_swiglu.py 100.00% <100.00%> (ø)
src/paddlefleet/fusions/fused_swiglu_scale.py 100.00% <100.00%> (ø)
src/paddlefleet/transformer/mlp.py 100.00% <100.00%> (ø)
src/paddlefleet/transformer/moe/fp8_utils.py 100.00% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants