Skip to content

[New features][Bug fixes] Add TileLang CSA compressed indexer for DSv4 and fix ColumnLinear stop_gradient backward#1052

Open
YJMSTR wants to merge 9 commits into
PaddlePaddle:developfrom
YJMSTR:dsv4-tilelang-indexer-nondet
Open

[New features][Bug fixes] Add TileLang CSA compressed indexer for DSv4 and fix ColumnLinear stop_gradient backward#1052
YJMSTR wants to merge 9 commits into
PaddlePaddle:developfrom
YJMSTR:dsv4-tilelang-indexer-nondet

Conversation

@YJMSTR
Copy link
Copy Markdown

@YJMSTR YJMSTR commented May 27, 2026

PR Category

Operator Mechanism

PR Types

New features, Bug fixes

Description

  1. Bug fix(tensor_parallel/layers.py:修复 ColumnLinearWithGradAccum PyLayer backward 的合约问题。当 forward 输入 input.stop_gradient=True 时,backward 对应位置必须返回 None(Paddle PyLayer 合约要求)。在 forward 缓存 input_stop_gradient 标志,在 backward 中跳过 grad_input 计算及输入侧通信(all_reduce / reduce_scatter)。

  2. 新功能(tilelang_ops/indexer/:新增 DeepSeek V4 CSA 压缩索引器 TileLang 融合算子:

    • csa_indexer_topk_fwd:流式 top-k 前向,输出索引和 softmax 概率
    • csa_indexer_bwd:融合反向,计算 IndexQ / Weights / IndexKComp 梯度
    • csa_attn_target_reducesum:indexer KL loss 的 attention target 分布
    • 新增 TileLangCSAIndexerLoss PyLayer 将 fwd/bwd/loss 融合为单一算子
  3. 集成(transformer/csa_attention.py:通过 dsv4_tilelang_backend / dsv4_tilelang_enable_csa_indexer / dsv4_tilelang_enable_backward 三个 TransformerConfig 字段控制 TileLang 路径启用。默认关闭,不影响现有行为。

  4. 简化:移除 attn_mask_startend_row_indices 多文档边界支持(目前 DSv4 场景下不使用)。

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@YJMSTR YJMSTR force-pushed the dsv4-tilelang-indexer-nondet branch from 3ef98d2 to 8a5f69f Compare May 28, 2026 09:17
PaddlePaddle-bot

This comment was marked as outdated.

@YJMSTR YJMSTR force-pushed the dsv4-tilelang-indexer-nondet branch from 6b15251 to 4584c34 Compare May 28, 2026 10:58
@Enigmatisms Enigmatisms changed the title Dsv4 tilelang indexer nondet [DeepSeekV4] Support fused indexer TileLang kernel May 28, 2026
PaddlePaddle-bot

This comment was marked as outdated.

@YJMSTR YJMSTR force-pushed the dsv4-tilelang-indexer-nondet branch from 4584c34 to c23cd33 Compare May 28, 2026 11:10
PaddlePaddle-bot

This comment was marked as outdated.

@YJMSTR YJMSTR changed the title [DeepSeekV4] Support fused indexer TileLang kernel [New features][Bug fixes] Add TileLang CSA compressed indexer for DSv4 and fix ColumnLinear stop_gradient backward May 28, 2026
@YJMSTR YJMSTR marked this pull request as ready for review May 28, 2026 12:26
Copy link
Copy Markdown

@risemeup1111 risemeup1111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已完成初轮审查,发现两处需要先修复的问题,细节已放在行级评论中。CI 当前通过,但这两处会分别改变 TileLang 兼容后端的 sparse attention 前向语义,以及让 packed/multi-document 边界 mask 被静默忽略,建议修复后再合入。

Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.

Comment thread src/paddlefleet/transformer/csa_attention.py Outdated
@@ -225,9 +225,6 @@ def forward(
attention_mask,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 优先级:P1

这里删除了 attn_mask_startend_row_indices 的下传,但 DSv4HybridSelfAttention.forward() 仍然通过 **kwargs 静默接收这个参数;上层 TransformerLayer 在 packed sequence / 多文档场景会继续把该边界 mask 传给 self-attention。结果是调用方以为文档边界仍然生效,CSA 实际却按整段序列生成 window/compressed indices,可能跨文档互相 attend。若本 PR 确认不再支持 DSv4 packed/multi-document 边界,建议在这里显式 fail fast;否则需要恢复边界 mask 到 CompressedSparseAttention 的索引计算。

参考修复形态:

if kwargs.get("attn_mask_startend_row_indices") is not None:
    raise NotImplementedError(
        "DSv4 Hybrid CSA no longer supports attn_mask_startend_row_indices; "
        "please disable packed/multi-document boundary masks or restore CSA boundary handling."
    )

Comment thread src/paddlefleet/transformer/csa_attention.py Outdated
Comment thread src/paddlefleet/tilelang_ops/indexer/csa_indexer.py Outdated
Comment thread src/paddlefleet/tilelang_ops/indexer/csa_indexer.py Outdated
Comment thread src/paddlefleet/tilelang_ops/indexer/csa_indexer.py Outdated
Comment thread src/paddlefleet/tilelang_ops/indexer/csa_indexer_bwd.py Outdated
YJMSTR added 3 commits May 29, 2026 00:07
Add TileLang CSA indexer forward/backward and attention-target kernels, wire them into CSA loss, and preserve detached-input behavior in tensor-parallel linear backward.
Remove the test-local sys.path override and paddlefleet module cleanup so the tests use the installed package environment, including editable installs in CI.
Rename DSv4-scoped TileLang switches to CSA-scoped controls, make attention_paddle_compat drive both CSA indexer and sparse
  attention by default, and rename the sparse attention export to csa_sparse_attn.

Replace public TileLang CSA indexer asserts with explicit TypeError/ValueError checks and cover the new config
  and validation behavior in tests.
@YJMSTR YJMSTR force-pushed the dsv4-tilelang-indexer-nondet branch from dde2ed2 to 9c89548 Compare May 28, 2026 17:21
Copy link
Copy Markdown

@risemeup1111 risemeup1111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已复查当前 head。之前两条 P1 行级问题在当前 diff 中仍由原线程覆盖,本次不重复开新线程;CI 当前通过。另有一个非行级 PR 描述问题需要同步更新:

  • P3 优先级:P3 非行级:PR 描述与当前配置字段不一致。描述第 3 点仍写 dsv4_tilelang_backend / dsv4_tilelang_enable_csa_indexer / dsv4_tilelang_enable_backward,但当前代码已改为 csa_tilelang_backend / csa_tilelang_enable_indexer / csa_tilelang_enable_sparse_attn,并且不再保留 dsv4_tilelang_enable_backward。这会误导使用者按无效字段配置 TileLang 路径。建议把该段改为:通过 csa_tilelang_backend / csa_tilelang_enable_indexer / csa_tilelang_enable_sparse_attn 三个 TransformerConfig 字段控制 TileLang 路径启用。默认关闭,不影响现有行为。
Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.

YJMSTR added 4 commits May 29, 2026 01:41
 Keep dsa_indexer_use_sparse_loss=False scoped to the indexer loss by using the
 full compressed range only for TileLang KL loss computation. Trim the returned
 indices back to min(index_topk, n_compressed) before feeding main CSA sparse
 attention so Paddle and TileLang paths keep the same attention range.

 Add resolver coverage to guard the phase2 loss topk and attention topk split.
Attach TileLang CSA indexer loss gradients directly through the
attention output autoscaler. This avoids building TileLang loss
backward state during no-grad recompute forwards while keeping the
indexer loss top-k independent from the main sparse attention top-k.
PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@Paddle-CI-Bot
Copy link
Copy Markdown

Paddle-CI-Bot commented May 29, 2026

PaddleFleet Log Analysis

Run #26690554189 · Attempt 1

日志分析报告

流水线名称 问题标签 修复建议 日志片段
Integration test (A100) — Qwen vl moe Qwen3VL MoE A100 超时 排查 qwen3vl_sft_moe_a100.yaml step 10 eval/save 阶段的 hang,缩短 CI timeout 或优化 eval 路径 报错代码

失败的测试case:

Integration test (A100) / Qwen vl moe
  - qwen3vl_sft.sh moe a100 (config: qwen3vl_sft_moe_a100.yaml)
  - 训练9/10步正常完成,step 10 卡死约8.5分钟后被 GitHub Actions 10分钟 timeout 强杀
  - ##[error]The action 'Qwen vl moe' has timed out after 10 minutes.

以下任务均通过(有 exit -6 / exit 250,但 log check + loss check 均通过):
GLM4.5 pt / sft / lora / dpo / dpo_lora、Qwen3 multi-card pt/sft/lora


根本原因分析:

PR #1052dsv4-tilelang-indexer-nondet)在 transformer/csa_attention.pytilelang_ops/indexer/ 新增了 DSv4 CSA 压缩索引器 TileLang 融合算子,并在 transformer_config.py 中新增 dsv4_tilelang_backend / dsv4_tilelang_enable_csa_indexer / dsv4_tilelang_enable_backward 三个字段。同时 tensor_parallel/layers.py 修复了 ColumnLinearWithGradAccum backward 合约问题。

Qwen3VL MoE SFT 测试在 max_steps=10eval_steps=10 的配置下,step 1-9 正常完成(单步耗时 ~3-9s),而 step 10(同时触发 eval + checkpoint save)卡死约 8.5 分钟,最终被 10 分钟 timeout 终止。结合 nvidia-smi 多次超时(3s)的警告,推断 ColumnLinearWithGradAccum backward stop_gradient 分支的 tensor_parallel 通信与 eval 阶段的 NCCL allreduce 在 A100 上存在 通信死锁或 CUDA 流同步 hang,而该行为在 H20 上未复现(Integration test H20 multi-card 通过)。


修复建议:

  1. 本地复现:在 A100 环境执行 bash PaddleFormers/tests/integration_test/qwen3vl_sft.sh moe a100,设置 NCCL_DEBUG=WARN,观察 step 10 是否卡在 NCCL allreduce / barrier。

  2. 排查 ColumnLinearWithGradAccum backward:确认在 input.stop_gradient=True 路径下,跳过 all_reduce / reduce_scatter 后,TP 通信组的 barrier 是否仍需显式同步;检查 eval 阶段是否与 backward 残留通信流发生竞争。

  3. 验证 eval/save 路径:将 eval_stepsmax_steps 解耦(如将 eval_steps 改为 null 跳过 eval)单独测试是否 hang,定位 hang 点为 eval、save 还是两者间的 allreduce。

  4. A100 特有路径:对比 H20 配置与 A100 配置差异(qwen3vl_sft_moe.yaml vs qwen3vl_sft_moe_a100.yaml),若 A100 开启了 moe_token_dispatcher_type: deepep 或更大 sharding_parallel_size,需确认 step 10 的 deepep dispatch 在 eval 模式下是否存在 hang。

  5. 短期 workaround:在 CI yaml 中将 Qwen vl moe 步骤的 timeout 由 10min 上调至 20min,先保证 CI 不误杀,同时继续排查根因。


🔄 每次 Re-run 后自动更新

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-31 01:24:29

📋 Review 摘要

PR 概述:修复 ColumnLinearWithGradAccum 对 detached input 的 PyLayer 合约问题,并新增 TileLang CSA 压缩索引器融合算子
变更范围tensor_parallel/layers.pytilelang_ops/indexer/transformer/csa_attention.pytransformer/transformer_config.py
影响面 TagTP OP Config

问题

级别 文件 概述
❓ 疑问 PR Description PR 描述中的 Config 字段名与实际代码不一致(3/3 名称偏差)

📝 描述-代码一致性:PR Description 第 3 点声明通过 dsv4_tilelang_backend / dsv4_tilelang_enable_csa_indexer / dsv4_tilelang_enable_backward 三个字段控制 TileLang 路径,但实际代码中的字段名为 csa_tilelang_backend / csa_tilelang_enable_indexer / csa_tilelang_enable_sparse_attn,全部不匹配。建议更新 PR Description 中对应段落为实际字段名,避免后续配置使用时产生困惑。

历史 Findings 修复情况

Finding 问题 状态
F1 tilelang_ops/__init__.py 缺换行符及版权头 ✅ 已修复
F2 csa_indexer.py 缺 Apache License 版权头 ✅ 已修复
F3 csa_indexer_bwd.py 中 assert 用于运行时校验 ⚠️ 仍存在
F4 测试文件 sys.path.insert hack ⚠️ 仍存在

📝 PR 规范检查

✓ 标题格式 [New features][Bug fixes] ... 符合规范;PR Category / PR Types / Description 三个必填字段均已填写。

总体评价

Bug fix 逻辑正确且有对应回归测试;新增 TileLang CSA indexer 模块结构清晰、验证充分、配置字段有合理默认值和校验。建议同步修正 PR Description 中的字段名,使文档与代码一致。

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 82.09719% with 140 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@085444e). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...addlefleet/tilelang_ops/indexer/csa_indexer_bwd.py 77.41% 14 Missing and 21 partials ⚠️
src/paddlefleet/transformer/csa_attention.py 80.48% 18 Missing and 14 partials ⚠️
...addlefleet/tilelang_ops/indexer/csa_indexer_fwd.py 84.21% 11 Missing and 16 partials ⚠️
...addlefleet/tilelang_ops/indexer/csa_attn_target.py 84.24% 11 Missing and 12 partials ⚠️
...rc/paddlefleet/tilelang_ops/indexer/csa_indexer.py 84.61% 9 Missing and 9 partials ⚠️
src/paddlefleet/tensor_parallel/layers.py 78.57% 1 Missing and 2 partials ⚠️
src/paddlefleet/tilelang_ops/__init__.py 81.81% 1 Missing and 1 partial ⚠️

❌ Your patch status has failed because the patch coverage (82.09%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             develop    #1052   +/-   ##
==========================================
  Coverage           ?   82.05%           
==========================================
  Files              ?       11           
  Lines              ?      808           
  Branches           ?      167           
==========================================
  Hits               ?      663           
  Misses             ?       67           
  Partials           ?       78           
Flag Coverage Δ
coverage_combine 82.05% <82.09%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...paddlefleet/tilelang_ops/compressed_sparse_attn.py 100.00% <100.00%> (ø)
src/paddlefleet/tilelang_ops/indexer/__init__.py 100.00% <100.00%> (ø)
src/paddlefleet/transformer/transformer_config.py 84.61% <ø> (ø)
src/paddlefleet/tilelang_ops/__init__.py 81.81% <81.81%> (ø)
src/paddlefleet/tensor_parallel/layers.py 78.57% <78.57%> (ø)
...rc/paddlefleet/tilelang_ops/indexer/csa_indexer.py 84.61% <84.61%> (ø)
...addlefleet/tilelang_ops/indexer/csa_attn_target.py 84.24% <84.24%> (ø)
...addlefleet/tilelang_ops/indexer/csa_indexer_fwd.py 84.21% <84.21%> (ø)
src/paddlefleet/transformer/csa_attention.py 79.76% <80.48%> (ø)
...addlefleet/tilelang_ops/indexer/csa_indexer_bwd.py 77.41% <77.41%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants