[New features][Bug fixes] Add TileLang CSA compressed indexer for DSv4 and fix ColumnLinear stop_gradient backward by YJMSTR · Pull Request #1052 · PaddlePaddle/PaddleFleet

YJMSTR · 2026-05-27T10:26:27Z

PR Category

Operator Mechanism

PR Types

New features, Bug fixes

Description

Bug fix（tensor_parallel/layers.py）：修复 ColumnLinearWithGradAccum PyLayer backward 的合约问题。当 forward 输入 input.stop_gradient=True 时，backward 对应位置必须返回 None（Paddle PyLayer 合约要求）。在 forward 缓存 input_stop_gradient 标志，在 backward 中跳过 grad_input 计算及输入侧通信（all_reduce / reduce_scatter）。
新功能（tilelang_ops/indexer/）：新增 DeepSeek V4 CSA 压缩索引器 TileLang 融合算子：
- csa_indexer_topk_fwd：流式 top-k 前向，输出索引和 softmax 概率
- csa_indexer_bwd：融合反向，计算 IndexQ / Weights / IndexKComp 梯度
- csa_attn_target_reducesum：indexer KL loss 的 attention target 分布
- 新增 TileLangCSAIndexerLoss PyLayer 将 fwd/bwd/loss 融合为单一算子
集成（transformer/csa_attention.py）：通过 dsv4_tilelang_backend / dsv4_tilelang_enable_csa_indexer / dsv4_tilelang_enable_backward 三个 TransformerConfig 字段控制 TileLang 路径启用。默认关闭，不影响现有行为。
简化：移除 attn_mask_startend_row_indices 多文档边界支持（目前 DSv4 场景下不使用）。

risemeup1111

已完成初轮审查，发现两处需要先修复的问题，细节已放在行级评论中。CI 当前通过，但这两处会分别改变 TileLang 兼容后端的 sparse attention 前向语义，以及让 packed/multi-document 边界 mask 被静默忽略，建议修复后再合入。

^{Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.}

risemeup1111 · 2026-05-28T13:17:14Z

@@ -225,9 +225,6 @@ def forward(
            attention_mask,


优先级：P1

这里删除了 attn_mask_startend_row_indices 的下传，但 DSv4HybridSelfAttention.forward() 仍然通过 **kwargs 静默接收这个参数；上层 TransformerLayer 在 packed sequence / 多文档场景会继续把该边界 mask 传给 self-attention。结果是调用方以为文档边界仍然生效，CSA 实际却按整段序列生成 window/compressed indices，可能跨文档互相 attend。若本 PR 确认不再支持 DSv4 packed/multi-document 边界，建议在这里显式 fail fast；否则需要恢复边界 mask 到 CompressedSparseAttention 的索引计算。

参考修复形态：

if kwargs.get("attn_mask_startend_row_indices") is not None: raise NotImplementedError( "DSv4 Hybrid CSA no longer supports attn_mask_startend_row_indices; " "please disable packed/multi-document boundary masks or restore CSA boundary handling." )

Add TileLang CSA indexer forward/backward and attention-target kernels, wire them into CSA loss, and preserve detached-input behavior in tensor-parallel linear backward.

Remove the test-local sys.path override and paddlefleet module cleanup so the tests use the installed package environment, including editable installs in CI.

Rename DSv4-scoped TileLang switches to CSA-scoped controls, make attention_paddle_compat drive both CSA indexer and sparse attention by default, and rename the sparse attention export to csa_sparse_attn. Replace public TileLang CSA indexer asserts with explicit TypeError/ValueError checks and cover the new config and validation behavior in tests.

risemeup1111

已复查当前 head。之前两条 P1 行级问题在当前 diff 中仍由原线程覆盖，本次不重复开新线程；CI 当前通过。另有一个非行级 PR 描述问题需要同步更新：

优先级：P3 非行级：PR 描述与当前配置字段不一致。描述第 3 点仍写 dsv4_tilelang_backend / dsv4_tilelang_enable_csa_indexer / dsv4_tilelang_enable_backward，但当前代码已改为 csa_tilelang_backend / csa_tilelang_enable_indexer / csa_tilelang_enable_sparse_attn，并且不再保留 dsv4_tilelang_enable_backward。这会误导使用者按无效字段配置 TileLang 路径。建议把该段改为：通过 csa_tilelang_backend / csa_tilelang_enable_indexer / csa_tilelang_enable_sparse_attn 三个 TransformerConfig 字段控制 TileLang 路径启用。默认关闭，不影响现有行为。

^{Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.}

Keep dsa_indexer_use_sparse_loss=False scoped to the indexer loss by using the full compressed range only for TileLang KL loss computation. Trim the returned indices back to min(index_topk, n_compressed) before feeding main CSA sparse attention so Paddle and TileLang paths keep the same attention range. Add resolver coverage to guard the phase2 loss topk and attention topk split.

Attach TileLang CSA indexer loss gradients directly through the attention output autoscaler. This avoids building TileLang loss backward state during no-grad recompute forwards while keeping the indexer loss top-k independent from the main sparse attention top-k.

Paddle-CI-Bot · 2026-05-29T15:52:29Z

PaddleFleet Log Analysis

Run #26690554189 · Attempt 1

日志分析报告

流水线名称	问题标签	修复建议	日志片段
Integration test (A100) — `Qwen vl moe`	Qwen3VL MoE A100 超时	排查 `qwen3vl_sft_moe_a100.yaml` step 10 eval/save 阶段的 hang，缩短 CI timeout 或优化 eval 路径	报错代码

失败的测试case:

Integration test (A100) / Qwen vl moe
  - qwen3vl_sft.sh moe a100 (config: qwen3vl_sft_moe_a100.yaml)
  - 训练9/10步正常完成，step 10 卡死约8.5分钟后被 GitHub Actions 10分钟 timeout 强杀
  - ##[error]The action 'Qwen vl moe' has timed out after 10 minutes.

以下任务均通过（有 exit -6 / exit 250，但 log check + loss check 均通过）：
GLM4.5 pt / sft / lora / dpo / dpo_lora、Qwen3 multi-card pt/sft/lora

根本原因分析:

PR #1052（dsv4-tilelang-indexer-nondet）在 transformer/csa_attention.py 及 tilelang_ops/indexer/ 新增了 DSv4 CSA 压缩索引器 TileLang 融合算子，并在 transformer_config.py 中新增 dsv4_tilelang_backend / dsv4_tilelang_enable_csa_indexer / dsv4_tilelang_enable_backward 三个字段。同时 tensor_parallel/layers.py 修复了 ColumnLinearWithGradAccum backward 合约问题。

Qwen3VL MoE SFT 测试在 max_steps=10、eval_steps=10 的配置下，step 1-9 正常完成（单步耗时 ~3-9s），而 step 10（同时触发 eval + checkpoint save）卡死约 8.5 分钟，最终被 10 分钟 timeout 终止。结合 nvidia-smi 多次超时（3s）的警告，推断 ColumnLinearWithGradAccum backward stop_gradient 分支的 tensor_parallel 通信与 eval 阶段的 NCCL allreduce 在 A100 上存在 通信死锁或 CUDA 流同步 hang，而该行为在 H20 上未复现（Integration test H20 multi-card 通过）。

修复建议:

本地复现：在 A100 环境执行 bash PaddleFormers/tests/integration_test/qwen3vl_sft.sh moe a100，设置 NCCL_DEBUG=WARN，观察 step 10 是否卡在 NCCL allreduce / barrier。
排查 ColumnLinearWithGradAccum backward：确认在 input.stop_gradient=True 路径下，跳过 all_reduce / reduce_scatter 后，TP 通信组的 barrier 是否仍需显式同步；检查 eval 阶段是否与 backward 残留通信流发生竞争。
验证 eval/save 路径：将 eval_steps 与 max_steps 解耦（如将 eval_steps 改为 null 跳过 eval）单独测试是否 hang，定位 hang 点为 eval、save 还是两者间的 allreduce。
A100 特有路径：对比 H20 配置与 A100 配置差异（qwen3vl_sft_moe.yaml vs qwen3vl_sft_moe_a100.yaml），若 A100 开启了 moe_token_dispatcher_type: deepep 或更大 sharding_parallel_size，需确认 step 10 的 deepep dispatch 在 eval 模式下是否存在 hang。
短期 workaround：在 CI yaml 中将 Qwen vl moe 步骤的 timeout 由 10min 上调至 20min，先保证 CI 不误杀，同时继续排查根因。

_{🔄 每次 Re-run 后自动更新}

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-31 01:24:29

📋 Review 摘要

PR 概述：修复 ColumnLinearWithGradAccum 对 detached input 的 PyLayer 合约问题，并新增 TileLang CSA 压缩索引器融合算子
变更范围：tensor_parallel/layers.py、tilelang_ops/indexer/、transformer/csa_attention.py、transformer/transformer_config.py
影响面 Tag：TP OP Config

问题

级别	文件	概述
❓ 疑问	PR Description	PR 描述中的 Config 字段名与实际代码不一致（3/3 名称偏差）

📝 描述-代码一致性：PR Description 第 3 点声明通过 dsv4_tilelang_backend / dsv4_tilelang_enable_csa_indexer / dsv4_tilelang_enable_backward 三个字段控制 TileLang 路径，但实际代码中的字段名为 csa_tilelang_backend / csa_tilelang_enable_indexer / csa_tilelang_enable_sparse_attn，全部不匹配。建议更新 PR Description 中对应段落为实际字段名，避免后续配置使用时产生困惑。

历史 Findings 修复情况

Finding	问题	状态
F1	`tilelang_ops/__init__.py` 缺换行符及版权头	✅ 已修复
F2	`csa_indexer.py` 缺 Apache License 版权头	✅ 已修复
F3	`csa_indexer_bwd.py` 中 assert 用于运行时校验	⚠️ 仍存在
F4	测试文件 `sys.path.insert` hack	⚠️ 仍存在

📝 PR 规范检查

✓ 标题格式 [New features][Bug fixes] ... 符合规范；PR Category / PR Types / Description 三个必填字段均已填写。

总体评价

Bug fix 逻辑正确且有对应回归测试；新增 TileLang CSA indexer 模块结构清晰、验证充分、配置字段有合理默认值和校验。建议同步修正 PR Description 中的字段名，使文档与代码一致。

codecov-commenter · 2026-05-30T17:40:25Z

Codecov Report

❌ Patch coverage is 82.09719% with 140 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@085444e). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...addlefleet/tilelang_ops/indexer/csa_indexer_bwd.py	77.41%	14 Missing and 21 partials ⚠️
src/paddlefleet/transformer/csa_attention.py	80.48%	18 Missing and 14 partials ⚠️
...addlefleet/tilelang_ops/indexer/csa_indexer_fwd.py	84.21%	11 Missing and 16 partials ⚠️
...addlefleet/tilelang_ops/indexer/csa_attn_target.py	84.24%	11 Missing and 12 partials ⚠️
...rc/paddlefleet/tilelang_ops/indexer/csa_indexer.py	84.61%	9 Missing and 9 partials ⚠️
src/paddlefleet/tensor_parallel/layers.py	78.57%	1 Missing and 2 partials ⚠️
src/paddlefleet/tilelang_ops/__init__.py	81.81%	1 Missing and 1 partial ⚠️

❌ Your patch status has failed because the patch coverage (82.09%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #1052   +/-   ##
==========================================
  Coverage           ?   82.05%           
==========================================
  Files              ?       11           
  Lines              ?      808           
  Branches           ?      167           
==========================================
  Hits               ?      663           
  Misses             ?       67           
  Partials           ?       78

Flag	Coverage Δ
coverage_combine	`82.05% <82.09%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...paddlefleet/tilelang_ops/compressed_sparse_attn.py	`100.00% <100.00%> (ø)`
src/paddlefleet/tilelang_ops/indexer/__init__.py	`100.00% <100.00%> (ø)`
src/paddlefleet/transformer/transformer_config.py	`84.61% <ø> (ø)`
src/paddlefleet/tilelang_ops/__init__.py	`81.81% <81.81%> (ø)`
src/paddlefleet/tensor_parallel/layers.py	`78.57% <78.57%> (ø)`
...rc/paddlefleet/tilelang_ops/indexer/csa_indexer.py	`84.61% <84.61%> (ø)`
...addlefleet/tilelang_ops/indexer/csa_attn_target.py	`84.24% <84.24%> (ø)`
...addlefleet/tilelang_ops/indexer/csa_indexer_fwd.py	`84.21% <84.21%> (ø)`
src/paddlefleet/transformer/csa_attention.py	`79.76% <80.48%> (ø)`
...addlefleet/tilelang_ops/indexer/csa_indexer_bwd.py	`77.41% <77.41%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

This comment was marked as outdated.

Sign in to view

YJMSTR force-pushed the dsv4-tilelang-indexer-nondet branch from 3ef98d2 to 8a5f69f Compare May 28, 2026 09:17

This comment was marked as outdated.

Sign in to view

YJMSTR force-pushed the dsv4-tilelang-indexer-nondet branch from 6b15251 to 4584c34 Compare May 28, 2026 10:58

Enigmatisms changed the title ~~Dsv4 tilelang indexer nondet~~ [DeepSeekV4] Support fused indexer TileLang kernel May 28, 2026

This comment was marked as outdated.

Sign in to view

YJMSTR force-pushed the dsv4-tilelang-indexer-nondet branch from 4584c34 to c23cd33 Compare May 28, 2026 11:10

This comment was marked as outdated.

Sign in to view

YJMSTR changed the title ~~[DeepSeekV4] Support fused indexer TileLang kernel~~ [New features][Bug fixes] Add TileLang CSA compressed indexer for DSv4 and fix ColumnLinear stop_gradient backward May 28, 2026

YJMSTR marked this pull request as ready for review May 28, 2026 12:26

risemeup1111 suggested changes May 28, 2026

View reviewed changes

Enigmatisms reviewed May 28, 2026

View reviewed changes

YJMSTR added 3 commits May 29, 2026 00:07

add TileLang CSA indexer kernels

3fd303a

Add TileLang CSA indexer forward/backward and attention-target kernels, wire them into CSA loss, and preserve detached-input behavior in tensor-parallel linear backward.

test: remove local path injection from TileLang CSA indexer tests

4a5ea08

Remove the test-local sys.path override and paddlefleet module cleanup so the tests use the installed package environment, including editable installs in CI.

YJMSTR force-pushed the dsv4-tilelang-indexer-nondet branch from dde2ed2 to 9c89548 Compare May 28, 2026 17:21

risemeup1111 reviewed May 28, 2026

View reviewed changes

YJMSTR added 4 commits May 29, 2026 01:41

use fast kl softmax grad

9a1446b

cleanup code

d29855d

This comment was marked as outdated.

Sign in to view

cleanup csa topk idx code

d024901

This comment was marked as outdated.

Sign in to view

[Chore][Test] Add more tests for better coverage

8457456

PaddlePaddle-bot reviewed May 30, 2026

View reviewed changes

Conversation

YJMSTR commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

risemeup1111 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

risemeup1111 May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

risemeup1111 left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Paddle-CI-Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PaddleFleet Log Analysis

日志分析报告

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

历史 Findings 修复情况

📝 PR 规范检查

总体评价

Uh oh!

codecov-commenter commented May 30, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

YJMSTR commented May 27, 2026 •

edited

Loading

Paddle-CI-Bot commented May 29, 2026 •

edited

Loading