Add Textract OP#460
Conversation
| index_row: pto.i32, index_col: pto.i32, | ||
| dst: pto.Tile): | ||
| m, k = dst.valid_shape | ||
| pto.mte_l1_l0a(src.as_ptr(), dst.as_ptr(), m, k) |
There was a problem hiding this comment.
这里和下面的mat2right缺少transpose支持
template <typename DstTileData, typename SrcTileData>
PTO_INTERNAL void TExtractToLeft(DstTileData &dst, SrcTileData &src, uint16_t indexRow, uint16_t indexCol)
{
static_assert((SrcTileData::SFractal == SLayout::ColMajor && SrcTileData::isRowMajor) ||
(SrcTileData::SFractal == SLayout::RowMajor && !SrcTileData::isRowMajor) ||
(SrcTileData::Rows == 1 && SrcTileData::isRowMajor),
"TExtract: SrcTile Invalid Fractal");
static_assert(DstTileData::SFractal == SLayout::RowMajor && !DstTileData::isRowMajor,
"TExtract: DstTile Invalid Fractal");
constexpr bool isFp4Type = std::is_same<typename SrcTileData::DType, float4_e2m1x2_t>::value ||
std::is_same<typename SrcTileData::DType, float4_e1m2x2_t>::value;
if constexpr (SrcTileData::Rows == 1 && SrcTileData::isRowMajor) {
TExtractToAVector<DstTileData, SrcTileData, isFp4Type>(dst.data(), src.data(), indexRow, indexCol,
dst.GetValidCol());
} else if constexpr (DstTileData::SFractal == SrcTileData::SFractal) {
if constexpr (DstTileData::Compact == CompactMode::Normal) {
TExtractToACompact<DstTileData, SrcTileData, isFp4Type>(dst.data(), src.data(), indexRow, indexCol,
dst.GetValidRow(), dst.GetValidCol());
} else {
TExtractToA<DstTileData, SrcTileData, false, isFp4Type>(dst.data(), src.data(), indexRow, indexCol);
}
} else {
if constexpr (DstTileData::Compact == CompactMode::Normal || sizeof(typename SrcTileData::DType) == 1) {
TExtractToATransCompact<DstTileData, SrcTileData, isFp4Type>(dst.data(), src.data(), indexRow, indexCol,
dst.GetValidRow(), dst.GetValidCol());
} else {
TExtractToA<DstTileData, SrcTileData, true, isFp4Type>(dst.data(), src.data(), indexRow, indexCol);
}
}
}| src_stride = src.shape[0] | ||
| dst_stride = dst.shape[0] | ||
| pto.mte_l0c_l1(src.as_ptr(), dst.as_ptr(), m, n, src_stride, dst_stride, | ||
| pre_quant=(fp.as_ptr(), "qf322b8_pre_vec")) |
There was a problem hiding this comment.
index_row/index_col参数没有被使用,逻辑和C++实现没对上
There was a problem hiding this comment.
问题分析
您指出的问题完全正确。我们的 Python 模板接收了 index_row, index_col 参数但没传给底层指令。经排查,PTO-ISA 实际上是支持 start-position 的,只是我们用错了层级。
完整的调用链路
TileLang DSL 模板 (textract_template.py)
→ pto.mte_l1_l0a(src, dst, m, k) ← 无 start-position 参数
→ VPTOExpandWrapperOps lowering
→ load_cbuf_to_ca(src, dst, mStart, kStart, mStep, kStep, ...) ← 有 m_start/k_start 参数!
关键发现:
mte_l1_l0a/l0b(DSL 层 API)没有 start-position 参数(VPTOOps.td:2669-2687 只有 source, destination, m, k, transpose)- 但
load_cbuf_to_ca(硬件层指令)有m_start, k_start参数(VPTOOps.td:1142-1164) VPTOExpandWrapperOps.cpp:772的deriveLoadCbufToCaControl函数硬编码mStart=0, kStart=0:return LoadCbufToCbControl{zero, zero, mStep, kStep, stride, stride}; // mStart=0 kStart=0
- C++ Cube API (
TExtractToACompact) 传indexRow, indexCol是通过硬件层 API(类似load_cbuf_to_ca),而不是通过mte_l1_l0a这层
为什么没对上
C++ 实现直接调用硬件层 API load_cbuf_to_ca(mStart=indexRow, kStart=indexCol),而我们用的是 DSL 层 mte_l1_l0a,它是一个 wrapper op,expand 后才变成 load_cbuf_to_ca。这个 wrapper op 目前没有传递 start-position 的通道,导致 expand 阶段的 deriveLoadCbufToCaControl 硬编码为 0。
结论:不是 PTO-ISA 不支持,是我们用了不完整的 wrapper op。硬件层 load_cbuf_to_ca/cb 已经有 m_start/k_start,需要让 mte_l1_l0a/l0b wrapper 把 start-position 透传下去。
Fix-pipe 路径(Acc → Mat):mte_l0c_l1
同样的情况:
mte_l0c_l1(wrapper)没有 start-positioncopy_matrix_cc_to_cbuf(硬件层)也没有显式的 start-position 参数(它通过xm/xtpacked 配置控制,整个 tile 从 L0C 提取是隐式的)- C++ 的 FP 路径实际上也不使用 indexRow/indexCol 做偏移
需要的修改
方案 A(推荐):为 mte_l1_l0a/l0b 增加 start_position keyword
让 mte_l1_l0a/l0b wrapper op 增加 start_row/start_col keyword 参数(类似 transpose keyword),然后在 VPTOExpandWrapperOps.cpp 的 deriveLoadCbufToCaControl 中使用这些值替代硬编码的 zero。
改动点:
- VPTOOps.td —
MteL1L0aOp/MteL1L0bOp增加Optional<I64>:$start_row和Optional<I64>:$start_col参数(默认值 0) - semantic.py —
_analyze_mte_stage_load增加start_row/start_colkeyword 支持 - frontend_ast.py —
"mte_l1_l0a": frozenset({"transpose", "start_row", "start_col"}) - VPTOExpandWrapperOps.cpp —
deriveLoadCbufToCaControl接收 start_row/start_col 参数,传给LoadCbufToCaOp的m_start/k_start - textract_template.py — 模板体中传
start_row=index_row, start_col=index_col,移除 constraint 中!=0的检查
方案 B(绕路):直接在模板中使用 load_cbuf_to_ca
不用 mte_l1_l0a wrapper,直接在 textract 模板中用 pto.load_cbuf_to_ca(src, dst, index_row, index_col, mStep, kStep, srcStride, dstStride)。
问题:需要手动计算 mStep/kStep/stride(目前 deriveLoadCbufToCaControl 做的事),而且 load_cbuf_to_ca 不是 TileLang DSL v1 的 public API(不在 semantic.py 的 surface 列表中)。
推荐方案 A,因为它对齐了 transpose 的设计模式(wrapper op 透传硬件配置),且改动集中可控。
| // | ||
| // RUN: ptoas --pto-arch=a5 --pto-backend=vpto --enable-tile-op-expand --emit-vpto %s -o - 2>/dev/null | FileCheck %s | ||
|
|
||
| // CHECK-LABEL: func.func @TEXTRACT_FP_F32_BF16 |
There was a problem hiding this comment.
这些针对不同数据类型的lit用例可以考虑合并一下
d62030c to
e55804e
Compare
- textract_template.py: 5 constraints + 12 templates (Mat2Left/Right, FP, Vec2Vec ND) - 17 positive lit tests: mat2left/right (f16/bf16/f32/cm_rm), fp (f32→f16/bf16/si8/ui8, si32→*), vec2vec_nd (f32/bf16/f16/offset) - 20 verify lit tests: invalid dst/src/fp loc, dtype mismatch/pair, layout, offset, vec_left, vec2mat loc pair, vec2vec layout/dtype/unsupported - 1 nonzero offset A5 verify: textract_verify_a5_mat_lr_nonzero_offset
- textract/: Mat2Left/Right ST (pto.textract + tmatmul identity readback) - textract_fp/: FP Acc→Mat ST (textract_fp f32→f16 + mte_l0c_gm acc readback) - textract_v2v/: Vec2Vec ND ST (mte_gm_ub + textract + mte_ub_gm) - CMakeLists.txt: register textract/textract_fp/textract_v2v in ALL_TESTCASES
…rain FP index_row/index_col=0 - Split mat2left/mat2right into same-fractal (transpose=False) and cross-fractal (transpose=True) templates, matching C++ TExtractToA / TExtractToATransCompact dispatch logic - Add index_row=0 / index_col=0 constraints to FP constraint functions (currently unused in lowering since mte_l0c_l1 has no start-position parameter; PTO-ISA support pending) - Update module docstring to explain transpose dispatch and FP index limitation
- Combine 9 individual FP dtype pair tests into expand_tile_op_tilelang_textract_fp_all_dtypes.pto using CHECK-DAG to verify all function names - Fix SI8/UI8 dst tile cols=32 (innerCols requirement), src/fp cols=32 for matching dimensions - Remove 8 redundant individual FP dtype pair lit files
- Replace low-level DMA operations with high-level tload/tstore - Use partition_tensor_view for proper GM-UB data transfer - Fix castptr lowering failure in VPTOPtrNormalize pass - Align with other vec ST tests (e.g., tcolmax.pto)
- Add addr = %c0_i64 for src_vec alloc_tile - Add addr = %c1024_i64 for dst_vec alloc_tile (avoid overlap) - Required when --pto-level=level3 is specified in CMakeLists.txt
- Fix typo in textract/main.cpp, textract_fp/main.cpp, textract_v2v/main.cpp - aclrtFinalize() does not exist in CANN API, correct name is aclFinalize() - Aligns with other ST tests (tcolmax, tadd, etc.)
…nspose - Change right_tile from cross-fractal (row_major, col_major) to same-fractal (col_major, row_major) - Remove transpose=true from mte_l1_l0b call - This ensures kernel computes A × B (not A × B^T) to match golden reference - Fixes simulation test failure: max diff=5.258651971817017
- Restore right_tile to cross-fractal layout (row_major, col_major) as required by A5 tmatmul - Restore transpose=true in mte_l1_l0b for cross-fractal layout - Fix golden computation in gen_data.py: use matmul(A, B.T) instead of matmul(A, B) - This matches the actual kernel behavior where cross-fractal layout with transpose computes A × B^T
…ruption Root cause: a_mat (addr=0, 512 bytes) and b_mat (addr=256, 512 bytes) were overlapping in L1 memory, causing data corruption for the second half of the output matrix (rows 8-15). Fix: - Change b_mat address from 256 to 512 (no overlap with a_mat) - Change fb_mat address from 768 to 1024 (no overlap with b_mat) - Fix golden computation: use matmul(A, B) instead of matmul(A, B.T) The kernel computes A×B (not A×B^T) despite transpose=true on mte_l1_l0b, because transpose only affects the L1→L0B data layout, not the matmul semantics. Verified locally with CANN simulator: all 16 rows now match golden exactly.
…tart_col - Update textract_template.py to pass index_row/index_col parameters to mte_l1_l0a/l0b as start_row/start_col keywords (PR mouliangyu#469) - Remove index_row/index_col != 0 constraint checks from all 4 cube templates (same/cross fractal for left/right) - Update DSL semantic.py to accept i32 type in _require_i64_like_expr (in addition to i64 and index) - Update docstring to reflect offset extraction support This resolves review comment mouliangyu#2 (r3328107569) which identified that index_row/index_col parameters were not being used in the templates.
- Add 'except' to line 4 of license header in all textract-related files - Fixes CI license header check failure - Affected files: textract_template.py and all ST test files
- Remove indexRow=0/indexCol=0 check in PTO.cpp:5169-5177 - Non-zero offsets now supported via start_row/start_col (PR mouliangyu#469) - Update test/lit/pto/textract_verify_a5_mat_lr_nonzero_offset.pto to positive test - Update test/lit/vpto/cube/textract_verify_invalid_offset.pto to positive test - All 35 cube tests pass
- Update docstring to explain Vec→Vec implementation choice - vlds/vsts supports arbitrary offset (index_row/index_col) - copy_ubuf_to_ubuf is available but unsuitable for sub-window extraction - Remove misleading 'not yet available' note from blocked paths
No description provided.