Skip to content

Add Textract OP#460

Open
bingmeiyou wants to merge 17 commits into
mouliangyu:feature-vpto-backendfrom
bingmeiyou:textract_task
Open

Add Textract OP#460
bingmeiyou wants to merge 17 commits into
mouliangyu:feature-vpto-backendfrom
bingmeiyou:textract_task

Conversation

@bingmeiyou

Copy link
Copy Markdown
Collaborator

No description provided.

Comment thread lib/TileOps/textract_template.py Outdated
index_row: pto.i32, index_col: pto.i32,
dst: pto.Tile):
m, k = dst.valid_shape
pto.mte_l1_l0a(src.as_ptr(), dst.as_ptr(), m, k)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里和下面的mat2right缺少transpose支持

template <typename DstTileData, typename SrcTileData>
PTO_INTERNAL void TExtractToLeft(DstTileData &dst, SrcTileData &src, uint16_t indexRow, uint16_t indexCol)
{
    static_assert((SrcTileData::SFractal == SLayout::ColMajor && SrcTileData::isRowMajor) ||
                      (SrcTileData::SFractal == SLayout::RowMajor && !SrcTileData::isRowMajor) ||
                      (SrcTileData::Rows == 1 && SrcTileData::isRowMajor),
                  "TExtract: SrcTile Invalid Fractal");
    static_assert(DstTileData::SFractal == SLayout::RowMajor && !DstTileData::isRowMajor,
                  "TExtract: DstTile Invalid Fractal");
    constexpr bool isFp4Type = std::is_same<typename SrcTileData::DType, float4_e2m1x2_t>::value ||
                               std::is_same<typename SrcTileData::DType, float4_e1m2x2_t>::value;
    if constexpr (SrcTileData::Rows == 1 && SrcTileData::isRowMajor) {
        TExtractToAVector<DstTileData, SrcTileData, isFp4Type>(dst.data(), src.data(), indexRow, indexCol,
                                                               dst.GetValidCol());
    } else if constexpr (DstTileData::SFractal == SrcTileData::SFractal) {
        if constexpr (DstTileData::Compact == CompactMode::Normal) {
            TExtractToACompact<DstTileData, SrcTileData, isFp4Type>(dst.data(), src.data(), indexRow, indexCol,
                                                                    dst.GetValidRow(), dst.GetValidCol());
        } else {
            TExtractToA<DstTileData, SrcTileData, false, isFp4Type>(dst.data(), src.data(), indexRow, indexCol);
        }
    } else {
        if constexpr (DstTileData::Compact == CompactMode::Normal || sizeof(typename SrcTileData::DType) == 1) {
            TExtractToATransCompact<DstTileData, SrcTileData, isFp4Type>(dst.data(), src.data(), indexRow, indexCol,
                                                                         dst.GetValidRow(), dst.GetValidCol());
        } else {
            TExtractToA<DstTileData, SrcTileData, true, isFp4Type>(dst.data(), src.data(), indexRow, indexCol);
        }
    }
}

src_stride = src.shape[0]
dst_stride = dst.shape[0]
pto.mte_l0c_l1(src.as_ptr(), dst.as_ptr(), m, n, src_stride, dst_stride,
pre_quant=(fp.as_ptr(), "qf322b8_pre_vec"))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index_row/index_col参数没有被使用,逻辑和C++实现没对上

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

问题分析

您指出的问题完全正确。我们的 Python 模板接收了 index_row, index_col 参数但没传给底层指令。经排查,PTO-ISA 实际上是支持 start-position 的,只是我们用错了层级

完整的调用链路

TileLang DSL 模板 (textract_template.py)
  → pto.mte_l1_l0a(src, dst, m, k)     ← 无 start-position 参数
    → VPTOExpandWrapperOps lowering
      → load_cbuf_to_ca(src, dst, mStart, kStart, mStep, kStep, ...)  ← 有 m_start/k_start 参数!

关键发现

  1. mte_l1_l0a/l0b(DSL 层 API)没有 start-position 参数(VPTOOps.td:2669-2687 只有 source, destination, m, k, transpose)
  2. load_cbuf_to_ca(硬件层指令)m_start, k_start 参数(VPTOOps.td:1142-1164)
  3. VPTOExpandWrapperOps.cpp:772deriveLoadCbufToCaControl 函数硬编码 mStart=0, kStart=0
    return LoadCbufToCbControl{zero, zero, mStep, kStep, stride, stride};
    //                      mStart=0  kStart=0
  4. C++ Cube API (TExtractToACompact) 传 indexRow, indexCol 是通过硬件层 API(类似 load_cbuf_to_ca),而不是通过 mte_l1_l0a 这层

为什么没对上

C++ 实现直接调用硬件层 API load_cbuf_to_ca(mStart=indexRow, kStart=indexCol),而我们用的是 DSL 层 mte_l1_l0a,它是一个 wrapper op,expand 后才变成 load_cbuf_to_ca。这个 wrapper op 目前没有传递 start-position 的通道,导致 expand 阶段的 deriveLoadCbufToCaControl 硬编码为 0。

结论:不是 PTO-ISA 不支持,是我们用了不完整的 wrapper op。硬件层 load_cbuf_to_ca/cb 已经有 m_start/k_start,需要让 mte_l1_l0a/l0b wrapper 把 start-position 透传下去。

Fix-pipe 路径(Acc → Mat):mte_l0c_l1

同样的情况:

  • mte_l0c_l1(wrapper)没有 start-position
  • copy_matrix_cc_to_cbuf(硬件层)也没有显式的 start-position 参数(它通过 xm/xt packed 配置控制,整个 tile 从 L0C 提取是隐式的)
  • C++ 的 FP 路径实际上也不使用 indexRow/indexCol 做偏移

需要的修改

方案 A(推荐):为 mte_l1_l0a/l0b 增加 start_position keyword

mte_l1_l0a/l0b wrapper op 增加 start_row/start_col keyword 参数(类似 transpose keyword),然后在 VPTOExpandWrapperOps.cppderiveLoadCbufToCaControl 中使用这些值替代硬编码的 zero

改动点

  1. VPTOOps.tdMteL1L0aOp/MteL1L0bOp 增加 Optional<I64>:$start_rowOptional<I64>:$start_col 参数(默认值 0)
  2. semantic.py_analyze_mte_stage_load 增加 start_row/start_col keyword 支持
  3. frontend_ast.py"mte_l1_l0a": frozenset({"transpose", "start_row", "start_col"})
  4. VPTOExpandWrapperOps.cppderiveLoadCbufToCaControl 接收 start_row/start_col 参数,传给 LoadCbufToCaOpm_start/k_start
  5. textract_template.py — 模板体中传 start_row=index_row, start_col=index_col,移除 constraint 中 !=0 的检查

方案 B(绕路):直接在模板中使用 load_cbuf_to_ca

不用 mte_l1_l0a wrapper,直接在 textract 模板中用 pto.load_cbuf_to_ca(src, dst, index_row, index_col, mStep, kStep, srcStride, dstStride)

问题:需要手动计算 mStep/kStep/stride(目前 deriveLoadCbufToCaControl 做的事),而且 load_cbuf_to_ca 不是 TileLang DSL v1 的 public API(不在 semantic.py 的 surface 列表中)。

推荐方案 A,因为它对齐了 transpose 的设计模式(wrapper op 透传硬件配置),且改动集中可控。

//
// RUN: ptoas --pto-arch=a5 --pto-backend=vpto --enable-tile-op-expand --emit-vpto %s -o - 2>/dev/null | FileCheck %s

// CHECK-LABEL: func.func @TEXTRACT_FP_F32_BF16

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些针对不同数据类型的lit用例可以考虑合并一下

@bingmeiyou bingmeiyou force-pushed the textract_task branch 5 times, most recently from d62030c to e55804e Compare June 3, 2026 03:52
User added 14 commits June 4, 2026 10:35
- textract_template.py: 5 constraints + 12 templates (Mat2Left/Right, FP, Vec2Vec ND)
- 17 positive lit tests: mat2left/right (f16/bf16/f32/cm_rm), fp (f32→f16/bf16/si8/ui8, si32→*), vec2vec_nd (f32/bf16/f16/offset)
- 20 verify lit tests: invalid dst/src/fp loc, dtype mismatch/pair, layout, offset, vec_left, vec2mat loc pair, vec2vec layout/dtype/unsupported
- 1 nonzero offset A5 verify: textract_verify_a5_mat_lr_nonzero_offset
- textract/: Mat2Left/Right ST (pto.textract + tmatmul identity readback)
- textract_fp/: FP Acc→Mat ST (textract_fp f32→f16 + mte_l0c_gm acc readback)
- textract_v2v/: Vec2Vec ND ST (mte_gm_ub + textract + mte_ub_gm)
- CMakeLists.txt: register textract/textract_fp/textract_v2v in ALL_TESTCASES
…rain FP index_row/index_col=0

- Split mat2left/mat2right into same-fractal (transpose=False) and
  cross-fractal (transpose=True) templates, matching C++ TExtractToA
  / TExtractToATransCompact dispatch logic
- Add index_row=0 / index_col=0 constraints to FP constraint functions
  (currently unused in lowering since mte_l0c_l1 has no start-position
  parameter; PTO-ISA support pending)
- Update module docstring to explain transpose dispatch and FP index
  limitation
- Combine 9 individual FP dtype pair tests into
  expand_tile_op_tilelang_textract_fp_all_dtypes.pto
  using CHECK-DAG to verify all function names
- Fix SI8/UI8 dst tile cols=32 (innerCols requirement),
  src/fp cols=32 for matching dimensions
- Remove 8 redundant individual FP dtype pair lit files
- Replace low-level DMA operations with high-level tload/tstore
- Use partition_tensor_view for proper GM-UB data transfer
- Fix castptr lowering failure in VPTOPtrNormalize pass
- Align with other vec ST tests (e.g., tcolmax.pto)
- Add addr = %c0_i64 for src_vec alloc_tile
- Add addr = %c1024_i64 for dst_vec alloc_tile (avoid overlap)
- Required when --pto-level=level3 is specified in CMakeLists.txt
- Fix typo in textract/main.cpp, textract_fp/main.cpp, textract_v2v/main.cpp
- aclrtFinalize() does not exist in CANN API, correct name is aclFinalize()
- Aligns with other ST tests (tcolmax, tadd, etc.)
…nspose

- Change right_tile from cross-fractal (row_major, col_major) to same-fractal (col_major, row_major)
- Remove transpose=true from mte_l1_l0b call
- This ensures kernel computes A × B (not A × B^T) to match golden reference
- Fixes simulation test failure: max diff=5.258651971817017
- Restore right_tile to cross-fractal layout (row_major, col_major) as required by A5 tmatmul
- Restore transpose=true in mte_l1_l0b for cross-fractal layout
- Fix golden computation in gen_data.py: use matmul(A, B.T) instead of matmul(A, B)
- This matches the actual kernel behavior where cross-fractal layout with transpose computes A × B^T
…ruption

Root cause: a_mat (addr=0, 512 bytes) and b_mat (addr=256, 512 bytes) were
overlapping in L1 memory, causing data corruption for the second half of the
output matrix (rows 8-15).

Fix:
- Change b_mat address from 256 to 512 (no overlap with a_mat)
- Change fb_mat address from 768 to 1024 (no overlap with b_mat)
- Fix golden computation: use matmul(A, B) instead of matmul(A, B.T)

The kernel computes A×B (not A×B^T) despite transpose=true on mte_l1_l0b,
because transpose only affects the L1→L0B data layout, not the matmul semantics.

Verified locally with CANN simulator: all 16 rows now match golden exactly.
…tart_col

- Update textract_template.py to pass index_row/index_col parameters to
  mte_l1_l0a/l0b as start_row/start_col keywords (PR mouliangyu#469)
- Remove index_row/index_col != 0 constraint checks from all 4 cube templates
  (same/cross fractal for left/right)
- Update DSL semantic.py to accept i32 type in _require_i64_like_expr
  (in addition to i64 and index)
- Update docstring to reflect offset extraction support

This resolves review comment mouliangyu#2 (r3328107569) which identified that
index_row/index_col parameters were not being used in the templates.
User added 3 commits June 4, 2026 10:46
- Add 'except' to line 4 of license header in all textract-related files
- Fixes CI license header check failure
- Affected files: textract_template.py and all ST test files
- Remove indexRow=0/indexCol=0 check in PTO.cpp:5169-5177
- Non-zero offsets now supported via start_row/start_col (PR mouliangyu#469)
- Update test/lit/pto/textract_verify_a5_mat_lr_nonzero_offset.pto to positive test
- Update test/lit/vpto/cube/textract_verify_invalid_offset.pto to positive test
- All 35 cube tests pass
- Update docstring to explain Vec→Vec implementation choice
- vlds/vsts supports arbitrary offset (index_row/index_col)
- copy_ubuf_to_ubuf is available but unsuitable for sub-window extraction
- Remove misleading 'not yet available' note from blocked paths
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants