Rms Norm Impl by and0d0 · Pull Request #868 · hw-native-sys/PTOAS

and0d0 · 2026-06-25T09:10:48Z

未完成：

reduce all文档
板测有不通过场景 thread=64/32

需要人工再确认一下：

已完成：

kernel对齐cpp与.pto参考
Load store extend
pto.vec
alloc_buffer
文档整改
with pto.simt(threads, 1, 1)
load compute store之间的wait & set
lanuch测试修改了runtime/codegen.py 显式传入了ub所需要的空间大小(dyn_shared_bytes), 我认为我不该改，从测试侧绕过 or 弥补(测试侧绕过)
cleancode

…l-reduce Implement the pto.simt_allreduce_sum frontend interface as designed in mission/483/483_docs.md. Pure Python MLIR IR emission with three dispatch strategies: warp_reduce (<=32 threads, pow2), cross_warp_reduce (>32, pow2), ub_reduce (fallback). Supports f32 and f16. - ptodsl/ptodsl/_allreduce.py: new — 674 lines - ptodsl/ptodsl/pto.py: export simt_allreduce_sum (+3 lines) - ptodsl/tests/test_allreduce.py: new — 533 lines, all passing Co-Authored-By: Claude <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces explicit scratch buffer allocation via pto.alloc_buffer, contiguous vector memory operations for scalar.load and scalar.store, and SIMT cross-workitem all-reduce helpers (simt_allreduce_sum). The feedback highlights three robustness improvements: safely retrieving the type attribute in scratch validation and vector binary operations to prevent unhandled AttributeErrors when raw Python values are passed, and explicitly handling local/private pointer types in address space mapping to avoid incorrect compilation outputs.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-25T09:12:52Z

+    try:
+        ptr_type = _pto.PtrType(raw_scratch.type)
+    except Exception:
+        raise TypeError(
+            f"all_reduce {context}: scratch must be a !pto.ptr buffer, "
+            f"got {raw_scratch.type}"
+        ) from None


If raw_scratch does not have a type attribute (for example, if a non-surface Python value is passed), accessing raw_scratch.type inside the except block will raise an AttributeError. This masks the original exception and prevents a clean TypeError from being raised.

Using getattr or checking for the attribute beforehand avoids this secondary exception.

Suggested change

try:

ptr_type = _pto.PtrType(raw_scratch.type)

except Exception:

raise TypeError(

f"all_reduce {context}: scratch must be a !pto.ptr buffer, "

f"got {raw_scratch.type}"

) from None

scratch_type = getattr(raw_scratch, "type", None)

if scratch_type is None:

raise TypeError(

f"all_reduce {context}: scratch must be a !pto.ptr buffer, "

f"got {type(raw_scratch).__name__}"

)

try:

ptr_type = _pto.PtrType(scratch_type)

except Exception:

raise TypeError(

f"all_reduce {context}: scratch must be a !pto.ptr buffer, "

f"got {scratch_type}"

) from None

gemini-code-assist · 2026-06-25T09:12:52Z

+    text = str(ptr_type)
+    if ", ub>" in text or ", vec>" in text:
+        return 6
+    if ", gm>" in text or text.endswith(">"):
+        return 1


The fallback check text.endswith(">") is too generic and will match local/private pointer types (e.g., !pto.ptr<f32, local>), incorrectly returning 1 (Global Memory address space) instead of 0 (Local/Private address space). This address space mismatch with llvm.alloca (which allocates in address space 0) can lead to LLVM compilation failures or invalid memory accesses.

Adding explicit checks for ", local>" and ", private>" before the fallback ensures they are correctly mapped to address space 0.

Suggested change

text = str(ptr_type)

if ", ub>" in text or ", vec>" in text:

return 6

if ", gm>" in text or text.endswith(">"):

return 1

text = str(ptr_type)

if ", ub>" in text or ", vec>" in text:

return 6

if ", local>" in text or ", private>" in text:

return 0

if ", gm>" in text or text.endswith(">"):

return 1

gemini-code-assist · 2026-06-25T09:12:52Z

+def _emit_vec_binary_op(op_name: str, lhs, rhs):
+    lhs_raw = unwrap_surface_value(lhs)
+    rhs_raw = unwrap_surface_value(rhs)
+    if not (VectorType.isinstance(lhs_raw.type) and VectorType.isinstance(rhs_raw.type)):
+        raise TypeError("PTODSL VecValue arithmetic expects compatible vector operands")


If either lhs or rhs is a raw Python scalar (e.g., 2.0 or 3), unwrap_surface_value returns the raw Python value, which does not have a type attribute. Accessing .type directly on it will raise an AttributeError, masking the intended TypeError with a cryptic traceback.

Using getattr to safely retrieve the types prevents this issue.

Suggested change

def _emit_vec_binary_op(op_name: str, lhs, rhs):

lhs_raw = unwrap_surface_value(lhs)

rhs_raw = unwrap_surface_value(rhs)

if not (VectorType.isinstance(lhs_raw.type) and VectorType.isinstance(rhs_raw.type)):

raise TypeError("PTODSL VecValue arithmetic expects compatible vector operands")

def _emit_vec_binary_op(op_name: str, lhs, rhs):

lhs_raw = unwrap_surface_value(lhs)

rhs_raw = unwrap_surface_value(rhs)

lhs_type = getattr(lhs_raw, "type", None)

rhs_type = getattr(rhs_raw, "type", None)

if lhs_type is None or rhs_type is None or not (VectorType.isinstance(lhs_type) and VectorType.isinstance(rhs_type)):

raise TypeError("PTODSL VecValue arithmetic expects compatible vector operands")

# Conflicts: # ptodsl/docs/user_guide/06-scalar-and-pointer-ops.md # ptodsl/ptodsl/_ops.py # ptodsl/ptodsl/_runtime_scalar_ops.py # ptodsl/ptodsl/_surface_values.py # ptodsl/ptodsl/_tracing/session.py # ptodsl/ptodsl/pto.py # ptodsl/tests/test_jit_compile.py

and0d0 · 2026-06-25T10:50:47Z

/review

and0d0 · 2026-06-25T11:01:27Z

/review

and0d0 · 2026-06-25T11:08:45Z

+
+| Scope | Storage | Returned value | Typical use | Layout notes |
+|-------|---------|----------------|-------------|--------------|
+| `"ub"` | Function-level Unified Buffer scratch | Typed `!pto.ptr<T, ub>` | MTE source/destination buffers, cross-SIMT scratch such as reductions | Contributes to `dyn_shared_memory_buf`; the frontend may insert alignment padding between allocations |


考虑直接删除ub部分

不删了，改回手写读起来有点困难

HecreReed · 2026-06-26T03:19:41Z

/run a3

reedhecre · 2026-06-26T03:20:11Z

已接收 /run a3，A3 板测器会处理这条请求。

进度页：http://154.9.227.233/ptoas-board-dashboard/#board-a3
当前状态：板测器空闲，这条请求会在本轮轮询启动。

页面会自动刷新，可以直接看当前阶段、排队情况和最近结果。

reedhecre · 2026-06-26T05:33:30Z

A3 板测失败详情：PR #868

rowexpanddiv

stage=run info=exit=2

[ERROR] Mismatch: golden_v3.bin vs v3.bin, max diff=18.47865390777588 at idx=963 (golden=10.328862190246582, out=-8.149791717529297, dtype=float32)
[ERROR] compare failed
[2026-06-26 13:14:55] ERROR: testcase failed (exit 2): rowexpanddiv

xor

stage=run info=exit=2

[ERROR] Mismatch: golden_v2.bin vs v2.bin, max diff=255.0 at idx=312 (golden=0, out=-255, dtype=int16)
[ERROR] compare failed
[2026-06-26 13:18:37] ERROR: testcase failed (exit 2): xor

rowexpandmul

stage=run info=exit=2

[ERROR] Mismatch: golden_v3.bin vs v3.bin, max diff=14.702873229980469 at idx=206 (golden=8.388465881347656, out=-6.3144073486328125, dtype=float32)
[ERROR] compare failed
[2026-06-26 13:22:43] ERROR: testcase failed (exit 2): rowexpandmul

quant_asym

stage=run info=exit=2

/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TQuant.hpp:53:58: error: member reference type 'pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *' is a pointer; did you mean to use '->'?
    TASSIGN_IMPL(src_s32, reinterpret_cast<uintptr_t>(tmp.data()));
                                                      ~~~^
                                                         ->
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr.hpp:2290:5: note: in instantiation of function template specialization 'pto::TQUANT_IMPL<pto::QuantType::INT8_ASYM, pto::Tile<pto::TileType::Vec, unsigned char, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *>' requested here
    TQUANT_IMPL<quant_type, TileDataOut, TileDataSrc, TileDataPara>(dst, src, scale, offset);
    ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/Quant/quant_asym/quant_asym_kernel.cpp:117:3: note: in instantiation of function template specialization 'pto::TQUANT<pto::QuantType::INT8_ASYM, pto::Tile<pto::TileType::Vec, unsigned char, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>>' requested here
  TQUANT<pto::QuantType::INT8_ASYM, Tile<TileType::Vec, uint8_t, 32, 32, BLayout::RowMajor, 32, 32, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>, Tile<TileType::Vec, float, 32, 32, BLayout::RowMajor, 32, 32, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>, Tile<TileType::Vec, float, 32, 1, BLayout::ColMajor, 32, 1, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>>(v29, v23, v25, v31);
  ^
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/Quant/quant_asym/quant_asym_kernel.cpp:32:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/pto-inst.hpp:30:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr.hpp:18:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr_impl.hpp:141:
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandMul.hpp:98:54: error: member reference type 'pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *' is a pointer; did you mean to use '->'?
            dst.data(), src0.data(), src1.data(), tmp.data(), validRow, validCol);
                                                  ~~~^
                                                     ->
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TQuant.hpp:42:5: note: in instantiation of function template specialization 'pto::TROWEXPANDMUL_IMPL<pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *>' requested here
    TROWEXPANDMUL_IMPL(src, src, scale, tmp);
    ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr.hpp:2290:5: note: in instantiation of function template specialization 'pto::TQUANT_IMPL<pto::QuantType::INT8_ASYM, pto::Tile<pto::TileType::Vec, unsigned char, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *>' requested here
    TQUANT_IMPL<quant_type, TileDataOut, TileDataSrc, TileDataPara>(dst, src, scale, offset);
    ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/Quant/quant_asym/quant_asym_kernel.cpp:117:3: note: in instantiation of function template specialization 'pto::TQUANT<pto::QuantType::INT8_ASYM, pto::Tile<pto::TileType::Vec, unsigned char, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>>' requested here
  TQUANT<pto::QuantType::INT8_ASYM, Tile<TileType::Vec, uint8_t, 32, 32, BLayout::RowMajor, 32, 32, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>, Tile<TileType::Vec, float, 32, 32, BLayout::RowMajor, 32, 32, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>, Tile<TileType::Vec, float, 32, 1, BLayout::ColMajor, 32, 1, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>>(v29, v23, v25, v31);
  ^
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/Quant/quant_asym/quant_asym_kernel.cpp:32:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/pto-inst.hpp:30:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr.hpp:18:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr_impl.hpp:141:
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandMul.hpp:97:9: error: no matching function for call to 'TRowExpandBin'
        TRowExpandBin<RowExpandMulOp<T>, TileDataDst, TileDataSrc0, TileDataSrc1, TileDataTmp>(
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandBinOp.hpp:211:26: note: candidate template ignored: substitution failure [with Op = pto::RowExpandMulOp<float>, TileDataDst = pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataSrc0 = pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataSrc1 = pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataTmp = pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *]: type 'pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *' cannot be used prior to '::' because it has no members
__tf__ PTO_INTERNAL void TRowExpandBin(typename TileDataDst::TileDType __out__ dst,
                         ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandBinOp.hpp:188:26: note: candidate function template not viable: requires 5 arguments, but 6 were provided
__tf__ PTO_INTERNAL void TRowExpandBin(typename TileDataDst::TileDType __out__ dst,
                         ^
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/Quant/quant_asym/quant_asym_kernel.cpp:32:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/pto-inst.hpp:30:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr.hpp:18:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr_impl.hpp:141:
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandMul.hpp:103:54: error: member reference type 'pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *' is a pointer; did you mean to use '->'?
            dst.data(), src1.data(), src0.data(), tmp.data(), validRow, validCol);
                                                  ~~~^
                                                     ->
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandMul.hpp:102:9: error: no matching function for call to 'TRowExpandBin'
        TRowExpandBin<RowExpandMulOp<T>, TileDataDst, TileDataSrc1, TileDataSrc0, TileDataTmp>(
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandBinOp.hpp:211:26: note: candidate template ignored: substitution failure [with Op = pto::RowExpandMulOp<float>, TileDataDst = pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataSrc0 = pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataSrc1 = pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataTmp = pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *]: type 'pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *' cannot be used prior to '::' because it has no members
__tf__ PTO_INTERNAL void TRowExpandBin(typename TileDataDst::TileDType __out__ dst,
                         ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandBinOp.hpp:188:26: note: candidate function template not viable: requires 5 arguments, but 6 were provided
__tf__ PTO_INTERNAL void TRowExpandBin(typename TileDataDst::TileDType __out__ dst,
                         ^
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/Quant/quant_asym/quant_asym_kernel.cpp:32:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/pto-inst.hpp:30:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr.hpp:18:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr_impl.hpp:137:
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandAdd.hpp:98:54: error: member reference type 'pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *' is a pointer; did you mean to use '->'?
            dst.data(), src0.data(), src1.data(), tmp.data(), validRow, validCol);
                                                  ~~~^
                                                     ->
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TQuant.hpp:45:9: note: in instantiation of function template specialization 'pto::TROWEXPANDADD_IMPL<pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *>' requested here
        TROWEXPANDADD_IMPL(src, src, *offset, tmp);
        ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr.hpp:2290:5: note: in instantiation of function template specialization 'pto::TQUANT_IMPL<pto::QuantType::INT8_ASYM, pto::Tile<pto::TileType::Vec, unsigned char, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *>' requested here
    TQUANT_IMPL<quant_type, TileDataOut, TileDataSrc, TileDataPara>(dst, src, scale, offset);
    ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/Quant/quant_asym/quant_asym_kernel.cpp:117:3: note: in instantiation of function template specialization 'pto::TQUANT<pto::QuantType::INT8_ASYM, pto::Tile<pto::TileType::Vec, unsigned char, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>>' requested here
  TQUANT<pto::QuantType::INT8_ASYM, Tile<TileType::Vec, uint8_t, 32, 32, BLayout::RowMajor, 32, 32, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>, Tile<TileType::Vec, float, 32, 32, BLayout::RowMajor, 32, 32, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>, Tile<TileType::Vec, float, 32, 1, BLayout::ColMajor, 32, 1, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>>(v29, v23, v25, v31);
  ^
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/Quant/quant_asym/quant_asym_kernel.cpp:32:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/pto-inst.hpp:30:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr.hpp:18:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr_impl.hpp:137:
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandAdd.hpp:97:9: error: no matching function for call to 'TRowExpandBin'
        TRowExpandBin<RowExpandAddOp<T>, TileDataDst, TileDataSrc0, TileDataSrc1, TileDataTmp>(
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandBinOp.hpp:211:26: note: candidate template ignored: substitution failure [with Op = pto::RowExpandAddOp<float>, TileDataDst = pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataSrc0 = pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataSrc1 = pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataTmp = pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *]: type 'pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *' cannot be used prior to '::' because it has no members
__tf__ PTO_INTERNAL void TRowExpandBin(typename TileDataDst::TileDType __out__ dst,
                         ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandBinOp.hpp:188:26: note: candidate function template not viable: requires 5 arguments, but 6 were provided
__tf__ PTO_INTERNAL void TRowExpandBin(typename TileDataDst::TileDType __out__ dst,
                         ^
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/Quant/quant_asym/quant_asym_kernel.cpp:32:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/pto-inst.hpp:30:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr.hpp:18:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr_impl.hpp:137:
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandAdd.hpp:103:54: error: member reference type 'pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *' is a pointer; did you mean to use '->'?
            dst.data(), src1.data(), src0.data(), tmp.data(), validRow, validCol);
                                                  ~~~^
                                                     ->
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandAdd.hpp:102:9: error: no matching function for call to 'TRowExpandBin'
        TRowExpandBin<RowExpandAddOp<T>, TileDataDst, TileDataSrc1, TileDataSrc0, TileDataTmp>(
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandBinOp.hpp:211:26: note: candidate template ignored: substitution failure [with Op = pto::RowExpandAddOp<float>, TileDataDst = pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataSrc0 = pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataSrc1 = pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataTmp = pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *]: type 'pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *' cannot be used prior to '::' because it has no members
__tf__ PTO_INTERNAL void TRowExpandBin(typename TileDataDst::TileDType __out__ dst,
                         ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandBinOp.hpp:188:26: note: candidate function template not viable: requires 5 arguments, but 6 were provided
__tf__ PTO_INTERNAL void TRowExpandBin(typename TileDataDst::TileDType __out__ dst,
                         ^
9 errors generated.
gmake[2]: *** [CMakeFiles/quant_asym_kernel.dir/build.make:76: CMakeFiles/quant_asym_kernel.dir/quant_asym_kernel.cpp.o] Error 1
gmake[2]: *** Waiting for unfinished jobs....
gmake[1]: *** [CMakeFiles/Makefile2:85: CMakeFiles/quant_asym_kernel.dir/all] Error 2
gmake: *** [Makefile:91: all] Error 2
[2026-06-26 13:22:58] ERROR: testcase failed (exit 2): quant_asym

quant

stage=run info=exit=2

/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TQuant.hpp:53:58: error: member reference type 'pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *' is a pointer; did you mean to use '->'?
    TASSIGN_IMPL(src_s32, reinterpret_cast<uintptr_t>(tmp.data()));
                                                      ~~~^
                                                         ->
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr.hpp:2290:5: note: in instantiation of function template specialization 'pto::TQUANT_IMPL<pto::QuantType::INT8_SYM, pto::Tile<pto::TileType::Vec, signed char, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *>' requested here
    TQUANT_IMPL<quant_type, TileDataOut, TileDataSrc, TileDataPara>(dst, src, scale, offset);
    ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/Quant/quant/quant_kernel.cpp:108:3: note: in instantiation of function template specialization 'pto::TQUANT<pto::QuantType::INT8_SYM, pto::Tile<pto::TileType::Vec, signed char, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>>' requested here
  TQUANT<pto::QuantType::INT8_SYM, Tile<TileType::Vec, int8_t, 32, 32, BLayout::RowMajor, 32, 32, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>, Tile<TileType::Vec, float, 32, 32, BLayout::RowMajor, 32, 32, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>, Tile<TileType::Vec, float, 32, 1, BLayout::ColMajor, 32, 1, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>>(v22, v18, v20);
  ^
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/Quant/quant/quant_kernel.cpp:32:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/pto-inst.hpp:30:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr.hpp:18:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr_impl.hpp:141:
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandMul.hpp:98:54: error: member reference type 'pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *' is a pointer; did you mean to use '->'?
            dst.data(), src0.data(), src1.data(), tmp.data(), validRow, validCol);
                                                  ~~~^
                                                     ->
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TQuant.hpp:42:5: note: in instantiation of function template specialization 'pto::TROWEXPANDMUL_IMPL<pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *>' requested here
    TROWEXPANDMUL_IMPL(src, src, scale, tmp);
    ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr.hpp:2290:5: note: in instantiation of function template specialization 'pto::TQUANT_IMPL<pto::QuantType::INT8_SYM, pto::Tile<pto::TileType::Vec, signed char, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *>' requested here
    TQUANT_IMPL<quant_type, TileDataOut, TileDataSrc, TileDataPara>(dst, src, scale, offset);
    ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/Quant/quant/quant_kernel.cpp:108:3: note: in instantiation of function template specialization 'pto::TQUANT<pto::QuantType::INT8_SYM, pto::Tile<pto::TileType::Vec, signed char, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>>' requested here
  TQUANT<pto::QuantType::INT8_SYM, Tile<TileType::Vec, int8_t, 32, 32, BLayout::RowMajor, 32, 32, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>, Tile<TileType::Vec, float, 32, 32, BLayout::RowMajor, 32, 32, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>, Tile<TileType::Vec, float, 32, 1, BLayout::ColMajor, 32, 1, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>>(v22, v18, v20);
  ^
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/Quant/quant/quant_kernel.cpp:32:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/pto-inst.hpp:30:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr.hpp:18:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr_impl.hpp:141:
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandMul.hpp:97:9: error: no matching function for call to 'TRowExpandBin'
        TRowExpandBin<RowExpandMulOp<T>, TileDataDst, TileDataSrc0, TileDataSrc1, TileDataTmp>(
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandBinOp.hpp:211:26: note: candidate template ignored: substitution failure [with Op = pto::RowExpandMulOp<float>, TileDataDst = pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataSrc0 = pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataSrc1 = pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataTmp = pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *]: type 'pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *' cannot be used prior to '::' because it has no members
__tf__ PTO_INTERNAL void TRowExpandBin(typename TileDataDst::TileDType __out__ dst,
                         ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandBinOp.hpp:188:26: note: candidate function template not viable: requires 5 arguments, but 6 were provided
__tf__ PTO_INTERNAL void TRowExpandBin(typename TileDataDst::TileDType __out__ dst,
                         ^
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/Quant/quant/quant_kernel.cpp:32:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/pto-inst.hpp:30:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr.hpp:18:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr_impl.hpp:141:
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandMul.hpp:103:54: error: member reference type 'pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *' is a pointer; did you mean to use '->'?
            dst.data(), src1.data(), src0.data(), tmp.data(), validRow, validCol);
                                                  ~~~^
                                                     ->
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandMul.hpp:102:9: error: no matching function for call to 'TRowExpandBin'
        TRowExpandBin<RowExpandMulOp<T>, TileDataDst, TileDataSrc1, TileDataSrc0, TileDataTmp>(
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandBinOp.hpp:211:26: note: candidate template ignored: substitution failure [with Op = pto::RowExpandMulOp<float>, TileDataDst = pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataSrc0 = pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataSrc1 = pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataTmp = pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *]: type 'pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *' cannot be used prior to '::' because it has no members
__tf__ PTO_INTERNAL void TRowExpandBin(typename TileDataDst::TileDType __out__ dst,
                         ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandBinOp.hpp:188:26: note: candidate function template not viable: requires 5 arguments, but 6 were provided
__tf__ PTO_INTERNAL void TRowExpandBin(typename TileDataDst::TileDType __out__ dst,
                         ^
5 errors generated.
gmake[2]: *** [CMakeFiles/quant_kernel.dir/build.make:76: CMakeFiles/quant_kernel.dir/quant_kernel.cpp.o] Error 1
gmake[2]: *** Waiting for unfinished jobs....
gmake[1]: *** [CMakeFiles/Makefile2:85: CMakeFiles/quant_kernel.dir/all] Error 2
gmake: *** [Makefile:91: all] Error 2
[2026-06-26 13:23:03] ERROR: testcase failed (exit 2): quant

scatter

stage=run info=exit=2

[ERROR] Mismatch: golden_v3.bin vs v3.bin, max diff=4.940644979476929 at idx=27 (golden=2.725048542022705, out=-2.2155964374542236, dtype=float32)
[ERROR] compare failed
[2026-06-26 13:23:41] ERROR: testcase failed (exit 2): scatter

rowexpandsub

stage=run info=exit=2

[ERROR] Mismatch: golden_v3.bin vs v3.bin, max diff=5.565198540687561 at idx=202 (golden=1.555970311164856, out=-4.009228229522705, dtype=float32)
[ERROR] compare failed
[2026-06-26 13:25:43] ERROR: testcase failed (exit 2): rowexpandsub

rope_kv_cache

stage=run info=exit=2

[ERROR] Mismatch (bf16 golden_v1.bin vs v1.bin): max ulp diff=30463 at idx=26720 (golden_bits=48012, out_bits=15218, golden=-0.0042724609375, out=0.003692626953125)
[ERROR] compare failed
[2026-06-26 13:25:49] ERROR: testcase failed (exit 2): rope_kv_cache

sels

stage=run info=exit=2

[ERROR] Mismatch: golden_v3.bin vs v3.bin, max diff=66.99866366386414 at idx=92 (golden=-2.9986636638641357, out=64.0, dtype=float32)
[ERROR] compare failed
[2026-06-26 13:27:08] ERROR: testcase failed (exit 2): sels

tprefetch_async_binding

stage=run info=exit=1

[SDMA] aclrtSynchronizeStream (aicpu) failed
[WARN] SdmaWorkspaceManager::Init failed - TPREFETCH_ASYNC will fall back to no-op prefetch
[ERROR] aclrtSynchronizeStream(stream) failed: 507018 (/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/TPrefetchAsync/tprefetch_async_binding/main.cpp:132)
[ERROR] RecentErrMsg: E39999: Inner Error!
E39999[PID: 3973399] 2026-06-26-13:27:51.295.779 (E39999):  The error from device(chipId:0, dieId:1), serial number is 330, an exception occurred during AICPU execution, stream_id:45, task_id:0, errcode:0, msg:aicpu execute failed.[FUNC:ProcessStarsAicpuErrorInfo][FILE:device_error_proc.cc][LINE:1644]
        TraceBack (most recent call last):
       Kernel task happen error, retCode=0x2a, [aicpu exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1729]
       Aicpu kernel execute failed, device_id=1, stream_id=45, task_id=0, soName=libcpu_kernels.so, funcName=RunCpuKernel, kernelName=ShmemSdmaStarsQuery, errorCode=0x2a.[FUNC:PrintAicpuErrorInfo][FILE:davinci_kernel_task.cc][LINE:1435]
       rtStreamSynchronize execution failed, reason=aicpu exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507018[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
       Failed to submit kernel task, retCode=0x715002a.[FUNC:LaunchKernelSubmit][FILE:context.cc][LINE:1223]
       kernel launch submit failed.[FUNC:LaunchKernel][FILE:context.cc][LINE:1349]
       rtKernelLaunch execution failed, reason=aicpu exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
[2026-06-26 13:30:22] ERROR: testcase failed (exit 1): tprefetch_async_binding

partmin

stage=run info=exit=2

[ERROR] Mismatch: golden_v3.bin vs v3.bin, max diff=nan at idx=112 (golden=-0.0, out=nan, dtype=float16)
[ERROR] compare failed
[2026-06-26 13:33:13] ERROR: testcase failed (exit 2): partmin

Zhendong404 · 2026-06-26T06:54:00Z

+    core_id = pto.get_block_idx()
+    frag_elems: pto.const_expr = rounds * lanes
+
+    w_ub = pto.alloc_buffer((hidden_size,), pto.f32, scope="ub")


pto.alloc_buffer的地址谁在分配？能否使用alloc_tile?
我以为pto.alloc_buffer是在simt kernel内部使用的，如果新引入一个alloc_buffer op是否会对已有pass产生冲击？

ub部分是tracing里分配的，会在trace过程中生成已有ir(参考golden中的内容对应的ir)，后续都有已有的pass去处理，我认为实际冲击比较小？不直接使用已有ir，是因为alloc buffer ub包了一层自动维护实际的buffer ptr的部分，不用手动声明整个buffer块有多大，每次从哪里放到哪里。alloc_tile.as_ptr()会走tile的op多绕一圈，且需要显式传递addr，但最终形态能符合。我不确定kernel层面哪样比较好？给我一点意见吧

and0d0 · 2026-06-25T13:51:20Z

+
+| Scope | Storage | Returned value | Typical use | Layout notes |
+|-------|---------|----------------|-------------|--------------|
+| `"ub"` | Function-level Unified Buffer scratch | Typed `!pto.ptr<T, ub>` | MTE source/destination buffers, cross-SIMT scratch such as reductions | Contributes to `dyn_shared_memory_buf`; the frontend may insert alignment padding between allocations |


不删了，改回手写读起来有点困难

and0d0 · 2026-06-25T13:55:46Z

+        w_vec = scalar.load(w_ub, ub_offset, contiguous=lanes)
+        scalar.store(w_vec, w_frag, frag_offset)
+
+


每token的其中一个lane做的事

and0d0 · 2026-06-26T10:22:13Z

+    core_id = pto.get_block_idx()
+    frag_elems: pto.const_expr = rounds * lanes
+
+    w_ub = pto.alloc_buffer((hidden_size,), pto.f32, scope="ub")


ub部分是tracing里分配的，会在trace过程中生成已有ir(参考golden中的内容对应的ir)，后续都有已有的pass去处理，我认为实际冲击比较小？不直接使用已有ir，是因为alloc buffer ub包了一层自动维护实际的buffer ptr的部分，不用手动声明整个buffer块有多大，每次从哪里放到哪里。alloc_tile.as_ptr()会走tile的op多绕一圈，且需要显式传递addr，但最终形态能符合。我不确定kernel层面哪样比较好？给我一点意见吧

reedhecre · 2026-06-29T04:35:11Z

Manual Codex Review

该评论由 /review 手动触发。

PR: Rms Norm Impl #868 Rms Norm Impl
Author: and0d0
Base/Head: main / rmsNorm_merge_test
Head SHA: 1835bc14c685
Trigger: /review
Generated At: 2026-06-29T04:35:11Z
Requested By: and0d0
Trigger Comment: Rms Norm Impl #868 (comment)
Status: running

Summary

收到 /review，正在执行 Codex review。

Findings

Review in progress.

reedhecre · 2026-06-29T05:11:04Z

Manual Codex Review

该评论由 /review 手动触发。

PR: Rms Norm Impl #868 Rms Norm Impl
Author: and0d0
Base/Head: main / rmsNorm_merge_test
Head SHA: 1835bc14c685
Trigger: /review
Generated At: 2026-06-29T05:11:03Z
Requested By: and0d0
Trigger Comment: Rms Norm Impl #868 (comment)
Status: completed

Summary

PR #868 存在 3 个问题：simt_allreduce_sum 的 thread_offset 语义在多条路径下会算错，新引入的 LLVM 依赖会让部分环境导入 PTODSL 直接失败，alloc_buffer(scope='ub') 的 scratch 规划也会把顺序执行的子模块错误地累计到同一份 UB 配额里。

Findings

P1 `simt_allreduce_sum(..., thread_offset=...)` 在 shuffle/redux 路径下会归约错误的 lanes ptodsl/ptodsl/_allreduce.py:425

thread_offset 目前只是把 tid_x 重新编号成逻辑 tx，但 helper 之后仍直接使用物理 warp 语义的 pto.redux_add / pto.shuffle_bfly。例如 threads=128, thread_offset=4 时，这段代码把 lane 4..35 当成逻辑 warp0（tx = tid_x - offset，再按 >> 5 / & 31 分组），但硬件归约/shuffle 仍按物理 0..31、32..63 分 warp，结果 stage1/stage4 会把错误的 lane 混进来。单 warp 的 butterfly 路径也一样：_emit_butterfly() 完全没有使用 offset。当前实现也没有限制 thread_offset 必须 warp 对齐或保证子组不跨物理 warp，所以这是可触发的错误结果。

P2 新功能把 `mlir.dialects.llvm` 变成了 PTODSL 的导入时硬依赖 ptodsl/ptodsl/_builtin_vector.py:16

这里直接 from mlir.dialects import llvm，而 pto.py 会在导入时立刻加载这个模块；scalar.py 也做了同样的硬依赖。与此同时，_bootstrap.py 仍把 LLVM dialect 当作“可能不存在”的可选依赖处理。结果是：即使调用方完全不使用 pto.vec、alloc_buffer(scope='local') 或 contiguous vector load/store，只要环境里没有 mlir.dialects.llvm（这是此前允许的安装形态），from ptodsl import pto / from ptodsl import scalar 就会直接 ImportError。这是明确的兼容性回归。

P2 `alloc_buffer(scope='ub')` 的 scratch 规划按整个 trace session 累加，不能按函数/生命周期复用 ptodsl/ptodsl/_tracing/session.py:226

allocate_ub_scratch() 用的是 session 级别的 _ub_scratch_next_byte 单调累加所有 scope='ub' 分配，并在 trace 结束时一次性把总大小写回 entry 函数；这个计数器不会随 enter_function() 切换而重置，也没有任何 lifetime reuse。这样一来，只要一个 entry 调用了多个各自有私有 UB scratch 的子模块/辅助函数，这些 scratch 就会被永久叠加到同一块地址空间里。顺序执行、本可复用的 scratch 也会被算成总和，容易把模块化后的 kernel 平白推过 UB 容量上限，导致编译/资源检查失败。

reedhecre · 2026-06-29T05:15:29Z

Codex Review

该评论由 review 机器人自动更新。

PR: Rms Norm Impl #868 Rms Norm Impl
Author: and0d0
Base/Head: main / rmsNorm_merge_test
Head SHA: 709d9362d80b
Trigger: PR 有新提交
Generated At: 2026-07-01T01:30:49Z
Previous Head SHA: 59ad97ed5c3f
Status: failed at codex-review (exit=1)

Summary

Review failed at stage codex-review: exit=1

Findings

未生成结构化 findings，因为 review 过程提前失败。

Log Tail

作者：and0d0
base branch：origin/main
head branch：HEAD（当前已 checkout 到 PR head）

要求：
1. 只审查这个 PR 相对 origin/main 的改动，必要时可以看上下文文件。
2. 重点找真实的 correctness / regression / contract mismatch / CI / runtime / compatibility 问题。
3. 不要提纯风格建议，不要提低价值猜测。
4. 严格按优先级输出：
   - P1：高概率会导致错误结果、编译/运行失败、严重回归、发布阻断
   - P2：重要缺陷、行为回归、遗漏校验/测试、较大兼容性问题
   - P3：次要但明确可改的问题
5. 如果没有问题，summary 直接写：未检查到 PR #868 存在问题，并返回 findings=[]。
6. 如果有问题，summary 简洁概括，findings 里每条都要给出：
   - severity
   - title
   - body（说明为什么是问题，尽量具体）
   - file（尽量给相对路径）
   - line（能确定就填整数，否则 null）

建议先查看：
- git status --short
- git diff --stat origin/main...HEAD
- git diff --unified=80 origin/main...HEAD

最终输出必须严格匹配 JSON schema。

mcp startup: no servers
exec
/bin/bash -lc 'git status --short' in /tmp/ptoas-pr-review-monitor/runs/20260701_092527_pr868/repo succeeded in 0ms:

exec
/bin/bash -lc 'git diff --stat origin/main...HEAD' in /tmp/ptoas-pr-review-monitor/runs/20260701_092527_pr868/repo succeeded in 0ms:
 ptodsl/README.md                                   |  20 +
 ptodsl/docs/user_guide/01-introduction.md          |   4 +-
 .../user_guide/03-kernel-entry-and-subkernels.md   |  25 +-
 .../docs/user_guide/04-type-system-and-buffer.md   |  30 +-
 .../docs/user_guide/06-scalar-and-pointer-ops.md   |  65 +-
 ptodsl/docs/user_guide/08-compute-operations.md    |  40 ++
 ptodsl/docs/user_guide/13-simt-micro-ops.md        |   4 +-
 .../examples/rms_norm/rmsnorm_alloc_buffer_simt.py | 236 +++++++
 .../rmsnorm_alloc_buffer_simt_launch_common.py     | 120 ++++
 .../rmsnorm_alloc_buffer_simt_manual_launch.py     | 292 +++++++++
 ptodsl/ptodsl/_allreduce.py                        | 355 +++++++++++
 ptodsl/ptodsl/_bootstrap.py                        |   6 +
 ptodsl/ptodsl/_builtin_vector.py                   |  49 ++
 ptodsl/ptodsl/_jit.py                              |  16 +
 ptodsl/ptodsl/_ops.py                              |  89 ++-
 ptodsl/ptodsl/_scalar_adaptation.py                |   6 +-
 ptodsl/ptodsl/_subkernels.py                       |  21 +-
 ptodsl/ptodsl/_surface_values.py                   |  75 ++-
 ptodsl/ptodsl/_tracing/module_builder.py           |   1 +
 ptodsl/ptodsl/_tracing/session.py                  |  32 +-
exec
/bin/bash -lc 'git diff --unified=80 origin/main...HEAD -- ptodsl/ptodsl/_allreduce.py ptodsl/ptodsl/_ops.py ptodsl/ptodsl/scalar.py ptodsl/ptodsl/_types.py ptodsl/ptodsl/_surface_values.py ptodsl/ptodsl/_jit.py ptodsl/ptodsl/_builtin_vector.py ptodsl/ptodsl/_subkernels.py ptodsl/ptodsl/_scalar_adaptation.py ptodsl/ptodsl/pto.py' in /tmp/ptoas-pr-review-monitor/runs/20260701_092527_pr868/repo succeeded in 0ms:
diff --git a/ptodsl/ptodsl/_allreduce.py b/ptodsl/ptodsl/_allreduce.py
new file mode 100644
index 00000000..3c6637bd
--- /dev/null
+++ b/ptodsl/ptodsl/_allreduce.py
@@ -0,0 +1,355 @@
+# Copyright (c) 2026 Huawei Technologies Co., Ltd.
+# This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+# CANN Open Software License Agreement Version 2.0 (the "License").
+# Please refer to the License for details. You may not use this file except in compliance with the License.
+# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+# See LICENSE in the root of the software repository for the full text of the License.
+"""
+SIMT cross-workitem all-reduce.
+
+All-reduce ops are emitted **inline** at the current insertion point.
+Three reducer variants: ``simt_allreduce_sum``, ``simt_allreduce_max``, ``simt_allreduce_min``.
+
+Dispatch tree (compile-time, since *threads* / *scale* are Python ints)::
ERROR: Quota exceeded. Check your plan and billing details.
Warning: no last agent message; wrote empty content to /tmp/ptoas-pr-review-monitor/runs/20260701_092527_pr868/codex_last_message.json
tokens used
19,550
===== END STAGE codex-review rc=1 @ 2026-07-01 09:30:49 =====

Signed-off-by: andodo <andodo@andododeMacBook-Air.local>

Path 3 now emits the cross-warp allreduce body directly in the caller instead of returning through a SIMT helper call. Paths 1, 2, and 4 still use helper calls and should be converted to inline emission in a follow-up. Signed-off-by: andodo <andodo@andododeMacBook-Air.local>

Signed-off-by: andodo <andodo@andododeMacBook-Air.local>

and0d0 · 2026-06-30T07:07:29Z

+    name = _helper_name(dtype, threads, scale, thread_offset)
+    args = dict(dtype=dtype, threads=threads, scale=scale,
+                thread_offset=thread_offset)
+


以下其他path没有验证，在simt使用中可能会造成无法正确return值的问题
解决方法：inline函数
目前只inline了path 3，其余待修改验证。

等待学坤版本合并

学坤版本已合并

and0d0 · 2026-06-30T09:12:08Z

    return f"ptodsl_launch_{ir_function_name}"

-def generate_launch_cpp(*, ir_function_name: str, kernel_signature) -> str:
+def generate_launch_cpp(*, ir_function_name: str, kernel_signature, dyn_shared_bytes: int = 0) -> str:


需要人肉review一下具体实现

不该改动原codegen，在testcase中绕过

and0d0 · 2026-06-30T10:51:32Z

+
+    # ── Path 3: cross_warp_reduce ────────────────────────────────────────
+    if scale <= 32 and _is_pow2(threads) and _is_pow2(scale):
+        return _emit_cross_warp_reduce_inline(


已改为inline版本

and0d0 · 2026-06-30T15:18:50Z

+        y_offset = pingpong * hidden_size + lane_base
+        frag_offset = r * lanes
+
+        x_vec = scalar.load(x_frag, frag_offset, contiguous=lanes)


这里，从x_frag读(llvm.load !llvm.ptr -> vector)，在thread=64/32时会出问题。

去掉此处下方y计算从ub直接加载，thread32全部与t=64，r=32，l=2 依旧失败

所以从local x_frag alloca 做 vector load不是唯一问题？

and0d0 · 2026-07-01T01:53:35Z

/review

reedhecre · 2026-07-01T02:25:07Z

Manual Codex Review

该评论由 /review 手动触发。

PR: Rms Norm Impl #868 Rms Norm Impl
Author: and0d0
Base/Head: main / rmsNorm_merge_test
Head SHA: 709d9362d80b
Trigger: /review
Generated At: 2026-07-01T02:25:06Z
Requested By: and0d0
Trigger Comment: Rms Norm Impl #868 (comment)
Status: running

Summary

收到 /review，正在执行 Codex review。

Findings

Review in progress.

reedhecre · 2026-07-01T02:48:39Z

Manual Codex Review

该评论由 /review 手动触发。

PR: Rms Norm Impl #868 Rms Norm Impl
Author: and0d0
Base/Head: main / rmsNorm_merge_test
Head SHA: 709d9362d80b
Trigger: /review
Generated At: 2026-07-01T02:48:39Z
Requested By: and0d0
Trigger Comment: Rms Norm Impl #868 (comment)
Status: completed

Summary

发现 3 个高价值问题：dyn_shared_memory_buf 没有接入默认 launcher、simt_allreduce 的非零 thread_offset 路径会算错，以及对 mlir.dialects.llvm 的顶层硬依赖会让部分环境直接 import 失败。

Findings

P1 `dyn_shared_memory_buf` 没有接入默认 launch 路径 ptodsl/ptodsl/_runtime/codegen.py:128

这个 PR 把 dyn_shared_memory_buf 暴露成 @pto.jit(...) 的公开参数，并且新的 RMSNorm kernel 明确依赖 82496B 的动态 UB；但默认 runtime 仍然生成 kernel<<<grid, nullptr, stream>>>。也就是说 compiled[grid, stream](...) 会始终以 0B 动态共享内存启动这些 kernel。rmsnorm_alloc_buffer_simt_manual_launch.py 已经在用手写 launcher 绕过这个问题，说明标准 launch 路径当前是跑不通的；用户按正常 PTODSL API 启动时会直接落到错误的 UB scratch 布局，结果要么运行失败，要么静默算错。

P1 `simt_allreduce(..., thread_offset!=0)` 会把错误的 lane 混进归约 ptodsl/ptodsl/_allreduce.py:108

这里直接用 get_tid_x() - thread_offset 重新编号，然后仍然调用 warp-local 的 redux_* / shuffle_bfly。函数本身既没有屏蔽 tid_x < thread_offset 的线程，也没有要求 thread_offset 做 warp 对齐，所以像新增测试里使用的 threads=16, thread_offset=4 这种调用，0..3 号 lane 仍会参与 reduction，而后续 lane 也会被错误并入相邻 group；threads>32 路径同样会把跨物理 warp 的 logical group 当成一个 warp 来归约，结果会直接算错。现在的新增测试只检查 IR 里出现了 subi，没有任何功能验证，所以这个错误很容易直接带进发布。

P2 新的 `pto.vec`/contiguous access 让 PTODSL 对 llvm Python dialect 变成硬依赖 ptodsl/ptodsl/_builtin_vector.py:16

_bootstrap.py 明明已经把 mlir.dialects.llvm 视为可选依赖，但这里仍在模块顶层无条件 from mlir.dialects import llvm。再加上 pto.py 现在会 eager import _builtin_vector，于是只要环境里没有 LLVM python bindings，from ptodsl import pto 就会直接 ModuleNotFoundError，即便用户根本没有使用 pto.vec 或 contiguous load/store。scalar.py 也引入了同样的顶层硬依赖，所以这是整个 PTODSL import 路径的兼容性回退。

andodo and others added 8 commits June 25, 2026 14:42

Add PTODSL alloc_buffer surface

9fd6076

Add PTODSL contiguous scalar vector access

b8456d6

Merge branch 'alloc_buf_dsl' into rmsNorm_merge_test

7955639

Merge branch 'ldst_contiguous_ext_dsl' into rmsNorm_merge_test

d54f745

fix(ptodsl): align allreduce scratch interface

af96964

test(ptodsl): cover alloc_buffer allreduce scratch

875db5b

example(ptodsl): add RMSNorm alloc_buffer SIMT kernel

1f08361

gemini-code-assist Bot reviewed Jun 25, 2026

View reviewed changes

andodo added 2 commits June 25, 2026 18:14

test(ptodsl): cover RMSNorm example compile

edae0af

and0d0 marked this pull request as draft June 25, 2026 10:54

docs(ptodsl): avoid fixed alloc_buffer alignment contract

1b39788

docs(ptodsl): summarize alloc_buffer scopes in table

1e7cc23

and0d0 commented Jun 25, 2026

View reviewed changes

andodo added 10 commits June 25, 2026 19:53

test(ptodsl): compact rmsnorm simt loops

b5a1d78

test(ptodsl): align rmsnorm simt body

2a254f5

test(ptodsl): match rmsnorm rstd store

4e7e957

test(ptodsl): rename rmsnorm simt helper

3df78fe

test(ptodsl): validate rmsnorm simt partition

ce139da

docs(ptodsl): clarify alloc buffer parameters

e3f9746

docs(ptodsl): restrict alloc buffer scope

ab7239d

docs(ptodsl): split scalar contiguous access docs

0c4e634

docs(ptodsl): preserve scalar access wording

e7fa2e9

docs(ptodsl): move builtin vector docs

e69811c

feat(ptodsl): support inline simt launch dimensions

0e46cb4

Clarify inline SIMT launch context docs

08185ba

Zhendong404 suggested changes Jun 26, 2026

View reviewed changes

KurrinQu reviewed Jun 26, 2026

View reviewed changes

Comment thread ptodsl/examples/rmsnorm_alloc_buffer_simt.py Outdated

Comment thread ptodsl/examples/rmsnorm_alloc_buffer_simt.py Outdated

andodo added 3 commits June 26, 2026 16:35

refactor(ptodsl): use python loops in rmsnorm simt body

993ab62

refactor(ptodsl): remove alloc buffer persistent flag

f10301a

Align RMSNorm SIMT loops with golden MLIR

25d0bd5

and0d0 commented Jun 27, 2026

View reviewed changes

test(ptodsl): add rmsnorm launch validation script

1835bc1

KurrinQu mentioned this pull request Jun 29, 2026

[Feature] TileLang PTO后端对 PTODSL未支持能力的需求 mouliangyu/PTOAS#489

Open

KurrinQu reviewed Jun 30, 2026

View reviewed changes

Comment thread ptodsl/docs/user_guide/04-type-system-and-buffer.md Outdated

default and others added 5 commits June 30, 2026 15:06

fix(ptodsl): load rmsnorm pingpong tile via UB pointer

d1719fe

Signed-off-by: andodo <andodo@andododeMacBook-Air.local>

fix(ptodsl): pass dynamic shared memory to runtime launch

3cb1af6

Signed-off-by: andodo <andodo@andododeMacBook-Air.local>

fix(ptodsl): use pto sqrt in rmsnorm simt example

3f8afd4

Signed-off-by: andodo <andodo@andododeMacBook-Air.local>

refactor(ptodsl): make alloc_buffer local-only

28fa9ef

Signed-off-by: andodo <andodo@andododeMacBook-Air.local>

and0d0 force-pushed the rmsNorm_merge_test branch from 5f0b843 to 28fa9ef Compare June 30, 2026 09:07

default and others added 3 commits June 30, 2026 20:39

test(ptodsl): add manual dyn-UB RMSNorm launch

102e008

Signed-off-by: andodo <andodo@andododeMacBook-Air.local>

feat(ptodsl): inline SIMT allreduce implementation

59ac739

Signed-off-by: andodo <andodo@andododeMacBook-Air.local>

test(ptodsl): keep only manual RMSNorm launch entry

433ff88

Signed-off-by: andodo <andodo@andododeMacBook-Air.local>

and0d0 force-pushed the rmsNorm_merge_test branch from cb693a0 to 433ff88 Compare June 30, 2026 14:07

andodo added 2 commits June 30, 2026 22:24

fix(ptodsl): remove automatic dyn shared launch bytes

59ad97e

Signed-off-by: andodo <andodo@andododeMacBook-Air.local>

test(ptodsl): relax RMSNorm MLIR shape checks

709d936

Signed-off-by: andodo <andodo@andododeMacBook-Air.local>

and0d0 commented Jul 1, 2026

View reviewed changes

		w_vec = scalar.load(w_ub, ub_offset, contiguous=lanes)
		scalar.store(w_vec, w_frag, frag_offset)

Uh oh!

Conversation

and0d0 commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

and0d0 commented Jun 25, 2026

Uh oh!

and0d0 commented Jun 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HecreReed commented Jun 26, 2026

Uh oh!

reedhecre commented Jun 26, 2026

Uh oh!

reedhecre commented Jun 26, 2026

A3 板测失败详情：PR #868

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

reedhecre commented Jun 29, 2026

Manual Codex Review

Summary

Findings

Uh oh!

reedhecre commented Jun 29, 2026

Manual Codex Review

Summary

Findings

Uh oh!

reedhecre commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codex Review

Summary

Findings

Log Tail

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

and0d0 commented Jun 25, 2026 •

edited

Loading

reedhecre commented Jun 29, 2026 •

edited

Loading