Skip to content

Rms Norm Impl#868

Draft
and0d0 wants to merge 40 commits into
hw-native-sys:mainfrom
and0d0:rmsNorm_merge_test
Draft

Rms Norm Impl#868
and0d0 wants to merge 40 commits into
hw-native-sys:mainfrom
and0d0:rmsNorm_merge_test

Conversation

@and0d0

@and0d0 and0d0 commented Jun 25, 2026

Copy link
Copy Markdown

未完成:

  • reduce all文档
  • 板测有不通过场景 thread=64/32

需要人工再确认一下:

已完成:

  • kernel对齐cpp与.pto参考
  • Load store extend
  • pto.vec
  • alloc_buffer
  • 文档整改
  • with pto.simt(threads, 1, 1)
  • load compute store之间的wait & set
  • lanuch测试修改了runtime/codegen.py 显式传入了ub所需要的空间大小(dyn_shared_bytes), 我认为我不该改,从测试侧绕过 or 弥补(测试侧绕过)
  • cleancode

andodo and others added 8 commits June 25, 2026 14:42
…l-reduce

Implement the pto.simt_allreduce_sum frontend interface as designed in
mission/483/483_docs.md.  Pure Python MLIR IR emission with three
dispatch strategies: warp_reduce (<=32 threads, pow2), cross_warp_reduce
(>32, pow2), ub_reduce (fallback).  Supports f32 and f16.

- ptodsl/ptodsl/_allreduce.py: new — 674 lines
- ptodsl/ptodsl/pto.py: export simt_allreduce_sum (+3 lines)
- ptodsl/tests/test_allreduce.py: new — 533 lines, all passing

Co-Authored-By: Claude <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces explicit scratch buffer allocation via pto.alloc_buffer, contiguous vector memory operations for scalar.load and scalar.store, and SIMT cross-workitem all-reduce helpers (simt_allreduce_sum). The feedback highlights three robustness improvements: safely retrieving the type attribute in scratch validation and vector binary operations to prevent unhandled AttributeErrors when raw Python values are passed, and explicitly handling local/private pointer types in address space mapping to avoid incorrect compilation outputs.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread ptodsl/ptodsl/_allreduce.py Outdated
Comment on lines +92 to +98
try:
ptr_type = _pto.PtrType(raw_scratch.type)
except Exception:
raise TypeError(
f"all_reduce {context}: scratch must be a !pto.ptr buffer, "
f"got {raw_scratch.type}"
) from None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If raw_scratch does not have a type attribute (for example, if a non-surface Python value is passed), accessing raw_scratch.type inside the except block will raise an AttributeError. This masks the original exception and prevents a clean TypeError from being raised.

Using getattr or checking for the attribute beforehand avoids this secondary exception.

Suggested change
try:
ptr_type = _pto.PtrType(raw_scratch.type)
except Exception:
raise TypeError(
f"all_reduce {context}: scratch must be a !pto.ptr buffer, "
f"got {raw_scratch.type}"
) from None
scratch_type = getattr(raw_scratch, "type", None)
if scratch_type is None:
raise TypeError(
f"all_reduce {context}: scratch must be a !pto.ptr buffer, "
f"got {type(raw_scratch).__name__}"
)
try:
ptr_type = _pto.PtrType(scratch_type)
except Exception:
raise TypeError(
f"all_reduce {context}: scratch must be a !pto.ptr buffer, "
f"got {scratch_type}"
) from None

Comment thread ptodsl/ptodsl/scalar.py
Comment on lines +287 to +291
text = str(ptr_type)
if ", ub>" in text or ", vec>" in text:
return 6
if ", gm>" in text or text.endswith(">"):
return 1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The fallback check text.endswith(">") is too generic and will match local/private pointer types (e.g., !pto.ptr<f32, local>), incorrectly returning 1 (Global Memory address space) instead of 0 (Local/Private address space). This address space mismatch with llvm.alloca (which allocates in address space 0) can lead to LLVM compilation failures or invalid memory accesses.

Adding explicit checks for ", local>" and ", private>" before the fallback ensures they are correctly mapped to address space 0.

Suggested change
text = str(ptr_type)
if ", ub>" in text or ", vec>" in text:
return 6
if ", gm>" in text or text.endswith(">"):
return 1
text = str(ptr_type)
if ", ub>" in text or ", vec>" in text:
return 6
if ", local>" in text or ", private>" in text:
return 0
if ", gm>" in text or text.endswith(">"):
return 1

Comment on lines +288 to +292
def _emit_vec_binary_op(op_name: str, lhs, rhs):
lhs_raw = unwrap_surface_value(lhs)
rhs_raw = unwrap_surface_value(rhs)
if not (VectorType.isinstance(lhs_raw.type) and VectorType.isinstance(rhs_raw.type)):
raise TypeError("PTODSL VecValue arithmetic expects compatible vector operands")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If either lhs or rhs is a raw Python scalar (e.g., 2.0 or 3), unwrap_surface_value returns the raw Python value, which does not have a type attribute. Accessing .type directly on it will raise an AttributeError, masking the intended TypeError with a cryptic traceback.

Using getattr to safely retrieve the types prevents this issue.

Suggested change
def _emit_vec_binary_op(op_name: str, lhs, rhs):
lhs_raw = unwrap_surface_value(lhs)
rhs_raw = unwrap_surface_value(rhs)
if not (VectorType.isinstance(lhs_raw.type) and VectorType.isinstance(rhs_raw.type)):
raise TypeError("PTODSL VecValue arithmetic expects compatible vector operands")
def _emit_vec_binary_op(op_name: str, lhs, rhs):
lhs_raw = unwrap_surface_value(lhs)
rhs_raw = unwrap_surface_value(rhs)
lhs_type = getattr(lhs_raw, "type", None)
rhs_type = getattr(rhs_raw, "type", None)
if lhs_type is None or rhs_type is None or not (VectorType.isinstance(lhs_type) and VectorType.isinstance(rhs_type)):
raise TypeError("PTODSL VecValue arithmetic expects compatible vector operands")

andodo added 2 commits June 25, 2026 18:14
# Conflicts:
#	ptodsl/docs/user_guide/06-scalar-and-pointer-ops.md
#	ptodsl/ptodsl/_ops.py
#	ptodsl/ptodsl/_runtime_scalar_ops.py
#	ptodsl/ptodsl/_surface_values.py
#	ptodsl/ptodsl/_tracing/session.py
#	ptodsl/ptodsl/pto.py
#	ptodsl/tests/test_jit_compile.py
@and0d0

and0d0 commented Jun 25, 2026

Copy link
Copy Markdown
Author

/review

@and0d0 and0d0 marked this pull request as draft June 25, 2026 10:54
@and0d0

and0d0 commented Jun 25, 2026

Copy link
Copy Markdown
Author

/review


| Scope | Storage | Returned value | Typical use | Layout notes |
|-------|---------|----------------|-------------|--------------|
| `"ub"` | Function-level Unified Buffer scratch | Typed `!pto.ptr<T, ub>` | MTE source/destination buffers, cross-SIMT scratch such as reductions | Contributes to `dyn_shared_memory_buf`; the frontend may insert alignment padding between allocations |

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

考虑直接删除ub部分

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不删了,改回手写读起来有点困难

@HecreReed

Copy link
Copy Markdown
Collaborator

/run a3

@reedhecre

Copy link
Copy Markdown

已接收 /run a3,A3 板测器会处理这条请求。

页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。

@reedhecre

Copy link
Copy Markdown

A3 板测失败详情:PR #868

rowexpanddiv

stage=run info=exit=2

[ERROR] Mismatch: golden_v3.bin vs v3.bin, max diff=18.47865390777588 at idx=963 (golden=10.328862190246582, out=-8.149791717529297, dtype=float32)
[ERROR] compare failed
[2026-06-26 13:14:55] ERROR: testcase failed (exit 2): rowexpanddiv
xor

stage=run info=exit=2

[ERROR] Mismatch: golden_v2.bin vs v2.bin, max diff=255.0 at idx=312 (golden=0, out=-255, dtype=int16)
[ERROR] compare failed
[2026-06-26 13:18:37] ERROR: testcase failed (exit 2): xor
rowexpandmul

stage=run info=exit=2

[ERROR] Mismatch: golden_v3.bin vs v3.bin, max diff=14.702873229980469 at idx=206 (golden=8.388465881347656, out=-6.3144073486328125, dtype=float32)
[ERROR] compare failed
[2026-06-26 13:22:43] ERROR: testcase failed (exit 2): rowexpandmul
quant_asym

stage=run info=exit=2

/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TQuant.hpp:53:58: error: member reference type 'pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *' is a pointer; did you mean to use '->'?
    TASSIGN_IMPL(src_s32, reinterpret_cast<uintptr_t>(tmp.data()));
                                                      ~~~^
                                                         ->
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr.hpp:2290:5: note: in instantiation of function template specialization 'pto::TQUANT_IMPL<pto::QuantType::INT8_ASYM, pto::Tile<pto::TileType::Vec, unsigned char, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *>' requested here
    TQUANT_IMPL<quant_type, TileDataOut, TileDataSrc, TileDataPara>(dst, src, scale, offset);
    ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/Quant/quant_asym/quant_asym_kernel.cpp:117:3: note: in instantiation of function template specialization 'pto::TQUANT<pto::QuantType::INT8_ASYM, pto::Tile<pto::TileType::Vec, unsigned char, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>>' requested here
  TQUANT<pto::QuantType::INT8_ASYM, Tile<TileType::Vec, uint8_t, 32, 32, BLayout::RowMajor, 32, 32, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>, Tile<TileType::Vec, float, 32, 32, BLayout::RowMajor, 32, 32, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>, Tile<TileType::Vec, float, 32, 1, BLayout::ColMajor, 32, 1, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>>(v29, v23, v25, v31);
  ^
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/Quant/quant_asym/quant_asym_kernel.cpp:32:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/pto-inst.hpp:30:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr.hpp:18:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr_impl.hpp:141:
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandMul.hpp:98:54: error: member reference type 'pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *' is a pointer; did you mean to use '->'?
            dst.data(), src0.data(), src1.data(), tmp.data(), validRow, validCol);
                                                  ~~~^
                                                     ->
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TQuant.hpp:42:5: note: in instantiation of function template specialization 'pto::TROWEXPANDMUL_IMPL<pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *>' requested here
    TROWEXPANDMUL_IMPL(src, src, scale, tmp);
    ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr.hpp:2290:5: note: in instantiation of function template specialization 'pto::TQUANT_IMPL<pto::QuantType::INT8_ASYM, pto::Tile<pto::TileType::Vec, unsigned char, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *>' requested here
    TQUANT_IMPL<quant_type, TileDataOut, TileDataSrc, TileDataPara>(dst, src, scale, offset);
    ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/Quant/quant_asym/quant_asym_kernel.cpp:117:3: note: in instantiation of function template specialization 'pto::TQUANT<pto::QuantType::INT8_ASYM, pto::Tile<pto::TileType::Vec, unsigned char, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>>' requested here
  TQUANT<pto::QuantType::INT8_ASYM, Tile<TileType::Vec, uint8_t, 32, 32, BLayout::RowMajor, 32, 32, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>, Tile<TileType::Vec, float, 32, 32, BLayout::RowMajor, 32, 32, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>, Tile<TileType::Vec, float, 32, 1, BLayout::ColMajor, 32, 1, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>>(v29, v23, v25, v31);
  ^
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/Quant/quant_asym/quant_asym_kernel.cpp:32:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/pto-inst.hpp:30:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr.hpp:18:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr_impl.hpp:141:
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandMul.hpp:97:9: error: no matching function for call to 'TRowExpandBin'
        TRowExpandBin<RowExpandMulOp<T>, TileDataDst, TileDataSrc0, TileDataSrc1, TileDataTmp>(
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandBinOp.hpp:211:26: note: candidate template ignored: substitution failure [with Op = pto::RowExpandMulOp<float>, TileDataDst = pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataSrc0 = pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataSrc1 = pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataTmp = pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *]: type 'pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *' cannot be used prior to '::' because it has no members
__tf__ PTO_INTERNAL void TRowExpandBin(typename TileDataDst::TileDType __out__ dst,
                         ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandBinOp.hpp:188:26: note: candidate function template not viable: requires 5 arguments, but 6 were provided
__tf__ PTO_INTERNAL void TRowExpandBin(typename TileDataDst::TileDType __out__ dst,
                         ^
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/Quant/quant_asym/quant_asym_kernel.cpp:32:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/pto-inst.hpp:30:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr.hpp:18:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr_impl.hpp:141:
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandMul.hpp:103:54: error: member reference type 'pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *' is a pointer; did you mean to use '->'?
            dst.data(), src1.data(), src0.data(), tmp.data(), validRow, validCol);
                                                  ~~~^
                                                     ->
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandMul.hpp:102:9: error: no matching function for call to 'TRowExpandBin'
        TRowExpandBin<RowExpandMulOp<T>, TileDataDst, TileDataSrc1, TileDataSrc0, TileDataTmp>(
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandBinOp.hpp:211:26: note: candidate template ignored: substitution failure [with Op = pto::RowExpandMulOp<float>, TileDataDst = pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataSrc0 = pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataSrc1 = pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataTmp = pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *]: type 'pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *' cannot be used prior to '::' because it has no members
__tf__ PTO_INTERNAL void TRowExpandBin(typename TileDataDst::TileDType __out__ dst,
                         ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandBinOp.hpp:188:26: note: candidate function template not viable: requires 5 arguments, but 6 were provided
__tf__ PTO_INTERNAL void TRowExpandBin(typename TileDataDst::TileDType __out__ dst,
                         ^
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/Quant/quant_asym/quant_asym_kernel.cpp:32:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/pto-inst.hpp:30:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr.hpp:18:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr_impl.hpp:137:
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandAdd.hpp:98:54: error: member reference type 'pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *' is a pointer; did you mean to use '->'?
            dst.data(), src0.data(), src1.data(), tmp.data(), validRow, validCol);
                                                  ~~~^
                                                     ->
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TQuant.hpp:45:9: note: in instantiation of function template specialization 'pto::TROWEXPANDADD_IMPL<pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *>' requested here
        TROWEXPANDADD_IMPL(src, src, *offset, tmp);
        ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr.hpp:2290:5: note: in instantiation of function template specialization 'pto::TQUANT_IMPL<pto::QuantType::INT8_ASYM, pto::Tile<pto::TileType::Vec, unsigned char, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *>' requested here
    TQUANT_IMPL<quant_type, TileDataOut, TileDataSrc, TileDataPara>(dst, src, scale, offset);
    ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/Quant/quant_asym/quant_asym_kernel.cpp:117:3: note: in instantiation of function template specialization 'pto::TQUANT<pto::QuantType::INT8_ASYM, pto::Tile<pto::TileType::Vec, unsigned char, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>>' requested here
  TQUANT<pto::QuantType::INT8_ASYM, Tile<TileType::Vec, uint8_t, 32, 32, BLayout::RowMajor, 32, 32, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>, Tile<TileType::Vec, float, 32, 32, BLayout::RowMajor, 32, 32, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>, Tile<TileType::Vec, float, 32, 1, BLayout::ColMajor, 32, 1, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>>(v29, v23, v25, v31);
  ^
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/Quant/quant_asym/quant_asym_kernel.cpp:32:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/pto-inst.hpp:30:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr.hpp:18:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr_impl.hpp:137:
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandAdd.hpp:97:9: error: no matching function for call to 'TRowExpandBin'
        TRowExpandBin<RowExpandAddOp<T>, TileDataDst, TileDataSrc0, TileDataSrc1, TileDataTmp>(
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandBinOp.hpp:211:26: note: candidate template ignored: substitution failure [with Op = pto::RowExpandAddOp<float>, TileDataDst = pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataSrc0 = pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataSrc1 = pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataTmp = pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *]: type 'pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *' cannot be used prior to '::' because it has no members
__tf__ PTO_INTERNAL void TRowExpandBin(typename TileDataDst::TileDType __out__ dst,
                         ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandBinOp.hpp:188:26: note: candidate function template not viable: requires 5 arguments, but 6 were provided
__tf__ PTO_INTERNAL void TRowExpandBin(typename TileDataDst::TileDType __out__ dst,
                         ^
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/Quant/quant_asym/quant_asym_kernel.cpp:32:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/pto-inst.hpp:30:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr.hpp:18:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr_impl.hpp:137:
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandAdd.hpp:103:54: error: member reference type 'pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *' is a pointer; did you mean to use '->'?
            dst.data(), src1.data(), src0.data(), tmp.data(), validRow, validCol);
                                                  ~~~^
                                                     ->
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandAdd.hpp:102:9: error: no matching function for call to 'TRowExpandBin'
        TRowExpandBin<RowExpandAddOp<T>, TileDataDst, TileDataSrc1, TileDataSrc0, TileDataTmp>(
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandBinOp.hpp:211:26: note: candidate template ignored: substitution failure [with Op = pto::RowExpandAddOp<float>, TileDataDst = pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataSrc0 = pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataSrc1 = pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataTmp = pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *]: type 'pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *' cannot be used prior to '::' because it has no members
__tf__ PTO_INTERNAL void TRowExpandBin(typename TileDataDst::TileDType __out__ dst,
                         ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandBinOp.hpp:188:26: note: candidate function template not viable: requires 5 arguments, but 6 were provided
__tf__ PTO_INTERNAL void TRowExpandBin(typename TileDataDst::TileDType __out__ dst,
                         ^
9 errors generated.
gmake[2]: *** [CMakeFiles/quant_asym_kernel.dir/build.make:76: CMakeFiles/quant_asym_kernel.dir/quant_asym_kernel.cpp.o] Error 1
gmake[2]: *** Waiting for unfinished jobs....
gmake[1]: *** [CMakeFiles/Makefile2:85: CMakeFiles/quant_asym_kernel.dir/all] Error 2
gmake: *** [Makefile:91: all] Error 2
[2026-06-26 13:22:58] ERROR: testcase failed (exit 2): quant_asym
quant

stage=run info=exit=2

/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TQuant.hpp:53:58: error: member reference type 'pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *' is a pointer; did you mean to use '->'?
    TASSIGN_IMPL(src_s32, reinterpret_cast<uintptr_t>(tmp.data()));
                                                      ~~~^
                                                         ->
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr.hpp:2290:5: note: in instantiation of function template specialization 'pto::TQUANT_IMPL<pto::QuantType::INT8_SYM, pto::Tile<pto::TileType::Vec, signed char, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *>' requested here
    TQUANT_IMPL<quant_type, TileDataOut, TileDataSrc, TileDataPara>(dst, src, scale, offset);
    ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/Quant/quant/quant_kernel.cpp:108:3: note: in instantiation of function template specialization 'pto::TQUANT<pto::QuantType::INT8_SYM, pto::Tile<pto::TileType::Vec, signed char, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>>' requested here
  TQUANT<pto::QuantType::INT8_SYM, Tile<TileType::Vec, int8_t, 32, 32, BLayout::RowMajor, 32, 32, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>, Tile<TileType::Vec, float, 32, 32, BLayout::RowMajor, 32, 32, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>, Tile<TileType::Vec, float, 32, 1, BLayout::ColMajor, 32, 1, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>>(v22, v18, v20);
  ^
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/Quant/quant/quant_kernel.cpp:32:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/pto-inst.hpp:30:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr.hpp:18:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr_impl.hpp:141:
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandMul.hpp:98:54: error: member reference type 'pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *' is a pointer; did you mean to use '->'?
            dst.data(), src0.data(), src1.data(), tmp.data(), validRow, validCol);
                                                  ~~~^
                                                     ->
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TQuant.hpp:42:5: note: in instantiation of function template specialization 'pto::TROWEXPANDMUL_IMPL<pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *>' requested here
    TROWEXPANDMUL_IMPL(src, src, scale, tmp);
    ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr.hpp:2290:5: note: in instantiation of function template specialization 'pto::TQUANT_IMPL<pto::QuantType::INT8_SYM, pto::Tile<pto::TileType::Vec, signed char, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *>' requested here
    TQUANT_IMPL<quant_type, TileDataOut, TileDataSrc, TileDataPara>(dst, src, scale, offset);
    ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/Quant/quant/quant_kernel.cpp:108:3: note: in instantiation of function template specialization 'pto::TQUANT<pto::QuantType::INT8_SYM, pto::Tile<pto::TileType::Vec, signed char, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>>' requested here
  TQUANT<pto::QuantType::INT8_SYM, Tile<TileType::Vec, int8_t, 32, 32, BLayout::RowMajor, 32, 32, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>, Tile<TileType::Vec, float, 32, 32, BLayout::RowMajor, 32, 32, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>, Tile<TileType::Vec, float, 32, 1, BLayout::ColMajor, 32, 1, SLayout::NoneBox, 512, PadValue::Null, CompactMode::Null>>(v22, v18, v20);
  ^
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/Quant/quant/quant_kernel.cpp:32:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/pto-inst.hpp:30:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr.hpp:18:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr_impl.hpp:141:
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandMul.hpp:97:9: error: no matching function for call to 'TRowExpandBin'
        TRowExpandBin<RowExpandMulOp<T>, TileDataDst, TileDataSrc0, TileDataSrc1, TileDataTmp>(
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandBinOp.hpp:211:26: note: candidate template ignored: substitution failure [with Op = pto::RowExpandMulOp<float>, TileDataDst = pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataSrc0 = pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataSrc1 = pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataTmp = pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *]: type 'pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *' cannot be used prior to '::' because it has no members
__tf__ PTO_INTERNAL void TRowExpandBin(typename TileDataDst::TileDType __out__ dst,
                         ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandBinOp.hpp:188:26: note: candidate function template not viable: requires 5 arguments, but 6 were provided
__tf__ PTO_INTERNAL void TRowExpandBin(typename TileDataDst::TileDType __out__ dst,
                         ^
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/Quant/quant/quant_kernel.cpp:32:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/pto-inst.hpp:30:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr.hpp:18:
In file included from /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/common/pto_instr_impl.hpp:141:
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandMul.hpp:103:54: error: member reference type 'pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *' is a pointer; did you mean to use '->'?
            dst.data(), src1.data(), src0.data(), tmp.data(), validRow, validCol);
                                                  ~~~^
                                                     ->
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandMul.hpp:102:9: error: no matching function for call to 'TRowExpandBin'
        TRowExpandBin<RowExpandMulOp<T>, TileDataDst, TileDataSrc1, TileDataSrc0, TileDataTmp>(
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandBinOp.hpp:211:26: note: candidate template ignored: substitution failure [with Op = pto::RowExpandMulOp<float>, TileDataDst = pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataSrc0 = pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataSrc1 = pto::Tile<pto::TileType::Vec, float, 32, 32, pto::BLayout::RowMajor, 32, 32, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null>, TileDataTmp = pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *]: type 'pto::Tile<pto::TileType::Vec, float, 32, 1, pto::BLayout::ColMajor, 32, 1, pto::SLayout::NoneBox, 512, pto::PadValue::Null, pto::CompactMode::Null> *' cannot be used prior to '::' because it has no members
__tf__ PTO_INTERNAL void TRowExpandBin(typename TileDataDst::TileDType __out__ dst,
                         ^
/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/payload/pto-isa/include/pto/npu/a2a3/TRowExpandBinOp.hpp:188:26: note: candidate function template not viable: requires 5 arguments, but 6 were provided
__tf__ PTO_INTERNAL void TRowExpandBin(typename TileDataDst::TileDType __out__ dst,
                         ^
5 errors generated.
gmake[2]: *** [CMakeFiles/quant_kernel.dir/build.make:76: CMakeFiles/quant_kernel.dir/quant_kernel.cpp.o] Error 1
gmake[2]: *** Waiting for unfinished jobs....
gmake[1]: *** [CMakeFiles/Makefile2:85: CMakeFiles/quant_kernel.dir/all] Error 2
gmake: *** [Makefile:91: all] Error 2
[2026-06-26 13:23:03] ERROR: testcase failed (exit 2): quant
scatter

stage=run info=exit=2

[ERROR] Mismatch: golden_v3.bin vs v3.bin, max diff=4.940644979476929 at idx=27 (golden=2.725048542022705, out=-2.2155964374542236, dtype=float32)
[ERROR] compare failed
[2026-06-26 13:23:41] ERROR: testcase failed (exit 2): scatter
rowexpandsub

stage=run info=exit=2

[ERROR] Mismatch: golden_v3.bin vs v3.bin, max diff=5.565198540687561 at idx=202 (golden=1.555970311164856, out=-4.009228229522705, dtype=float32)
[ERROR] compare failed
[2026-06-26 13:25:43] ERROR: testcase failed (exit 2): rowexpandsub
rope_kv_cache

stage=run info=exit=2

[ERROR] Mismatch (bf16 golden_v1.bin vs v1.bin): max ulp diff=30463 at idx=26720 (golden_bits=48012, out_bits=15218, golden=-0.0042724609375, out=0.003692626953125)
[ERROR] compare failed
[2026-06-26 13:25:49] ERROR: testcase failed (exit 2): rope_kv_cache
sels

stage=run info=exit=2

[ERROR] Mismatch: golden_v3.bin vs v3.bin, max diff=66.99866366386414 at idx=92 (golden=-2.9986636638641357, out=64.0, dtype=float32)
[ERROR] compare failed
[2026-06-26 13:27:08] ERROR: testcase failed (exit 2): sels
tprefetch_async_binding

stage=run info=exit=1

[SDMA] aclrtSynchronizeStream (aicpu) failed
[WARN] SdmaWorkspaceManager::Init failed - TPREFETCH_ASYNC will fall back to no-op prefetch
[ERROR] aclrtSynchronizeStream(stream) failed: 507018 (/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260626_125905_manual_pr868/npu_validation/TPrefetchAsync/tprefetch_async_binding/main.cpp:132)
[ERROR] RecentErrMsg: E39999: Inner Error!
E39999[PID: 3973399] 2026-06-26-13:27:51.295.779 (E39999):  The error from device(chipId:0, dieId:1), serial number is 330, an exception occurred during AICPU execution, stream_id:45, task_id:0, errcode:0, msg:aicpu execute failed.[FUNC:ProcessStarsAicpuErrorInfo][FILE:device_error_proc.cc][LINE:1644]
        TraceBack (most recent call last):
       Kernel task happen error, retCode=0x2a, [aicpu exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1729]
       Aicpu kernel execute failed, device_id=1, stream_id=45, task_id=0, soName=libcpu_kernels.so, funcName=RunCpuKernel, kernelName=ShmemSdmaStarsQuery, errorCode=0x2a.[FUNC:PrintAicpuErrorInfo][FILE:davinci_kernel_task.cc][LINE:1435]
       rtStreamSynchronize execution failed, reason=aicpu exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507018[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
       Failed to submit kernel task, retCode=0x715002a.[FUNC:LaunchKernelSubmit][FILE:context.cc][LINE:1223]
       kernel launch submit failed.[FUNC:LaunchKernel][FILE:context.cc][LINE:1349]
       rtKernelLaunch execution failed, reason=aicpu exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
[2026-06-26 13:30:22] ERROR: testcase failed (exit 1): tprefetch_async_binding
partmin

stage=run info=exit=2

[ERROR] Mismatch: golden_v3.bin vs v3.bin, max diff=nan at idx=112 (golden=-0.0, out=nan, dtype=float16)
[ERROR] compare failed
[2026-06-26 13:33:13] ERROR: testcase failed (exit 2): partmin

Comment thread ptodsl/examples/rmsnorm_alloc_buffer_simt.py Outdated
Comment thread ptodsl/examples/rmsnorm_alloc_buffer_simt.py Outdated
Comment thread ptodsl/examples/rmsnorm_alloc_buffer_simt.py Outdated
core_id = pto.get_block_idx()
frag_elems: pto.const_expr = rounds * lanes

w_ub = pto.alloc_buffer((hidden_size,), pto.f32, scope="ub")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pto.alloc_buffer的地址谁在分配?能否使用alloc_tile?
我以为pto.alloc_buffer是在simt kernel内部使用的,如果新引入一个alloc_buffer op是否会对已有pass产生冲击?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ub部分是tracing里分配的,会在trace过程中生成已有ir(参考golden中的内容对应的ir),后续都有已有的pass去处理,我认为实际冲击比较小?不直接使用已有ir,是因为alloc buffer ub包了一层自动维护实际的buffer ptr的部分,不用手动声明整个buffer块有多大,每次从哪里放到哪里。alloc_tile.as_ptr()会走tile的op多绕一圈,且需要显式传递addr,但最终形态能符合。我不确定kernel层面哪样比较好?给我一点意见吧

Comment thread ptodsl/examples/rmsnorm_alloc_buffer_simt.py Outdated
Comment thread ptodsl/examples/rmsnorm_alloc_buffer_simt.py Outdated

| Scope | Storage | Returned value | Typical use | Layout notes |
|-------|---------|----------------|-------------|--------------|
| `"ub"` | Function-level Unified Buffer scratch | Typed `!pto.ptr<T, ub>` | MTE source/destination buffers, cross-SIMT scratch such as reductions | Contributes to `dyn_shared_memory_buf`; the frontend may insert alignment padding between allocations |

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不删了,改回手写读起来有点困难

w_vec = scalar.load(w_ub, ub_offset, contiguous=lanes)
scalar.store(w_vec, w_frag, frag_offset)


Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

每token的其中一个lane做的事

core_id = pto.get_block_idx()
frag_elems: pto.const_expr = rounds * lanes

w_ub = pto.alloc_buffer((hidden_size,), pto.f32, scope="ub")

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ub部分是tracing里分配的,会在trace过程中生成已有ir(参考golden中的内容对应的ir),后续都有已有的pass去处理,我认为实际冲击比较小?不直接使用已有ir,是因为alloc buffer ub包了一层自动维护实际的buffer ptr的部分,不用手动声明整个buffer块有多大,每次从哪里放到哪里。alloc_tile.as_ptr()会走tile的op多绕一圈,且需要显式传递addr,但最终形态能符合。我不确定kernel层面哪样比较好?给我一点意见吧

@reedhecre

Copy link
Copy Markdown

Manual Codex Review

该评论由 /review 手动触发。

  • PR: Rms Norm Impl #868 Rms Norm Impl
  • Author: and0d0
  • Base/Head: main / rmsNorm_merge_test
  • Head SHA: 1835bc14c685
  • Trigger: /review
  • Generated At: 2026-06-29T04:35:11Z
  • Requested By: and0d0
  • Trigger Comment: Rms Norm Impl #868 (comment)
  • Status: running

Summary

收到 /review,正在执行 Codex review。

Findings

Review in progress.

@reedhecre

Copy link
Copy Markdown

Manual Codex Review

该评论由 /review 手动触发。

  • PR: Rms Norm Impl #868 Rms Norm Impl
  • Author: and0d0
  • Base/Head: main / rmsNorm_merge_test
  • Head SHA: 1835bc14c685
  • Trigger: /review
  • Generated At: 2026-06-29T05:11:03Z
  • Requested By: and0d0
  • Trigger Comment: Rms Norm Impl #868 (comment)
  • Status: completed

Summary

PR #868 存在 3 个问题:simt_allreduce_sumthread_offset 语义在多条路径下会算错,新引入的 LLVM 依赖会让部分环境导入 PTODSL 直接失败,alloc_buffer(scope='ub') 的 scratch 规划也会把顺序执行的子模块错误地累计到同一份 UB 配额里。

Findings

  1. P1 `simt_allreduce_sum(..., thread_offset=...)` 在 shuffle/redux 路径下会归约错误的 lanes ptodsl/ptodsl/_allreduce.py:425

thread_offset 目前只是把 tid_x 重新编号成逻辑 tx,但 helper 之后仍直接使用物理 warp 语义的 pto.redux_add / pto.shuffle_bfly。例如 threads=128, thread_offset=4 时,这段代码把 lane 4..35 当成逻辑 warp0(tx = tid_x - offset,再按 >> 5 / & 31 分组),但硬件归约/shuffle 仍按物理 0..3132..63 分 warp,结果 stage1/stage4 会把错误的 lane 混进来。单 warp 的 butterfly 路径也一样:_emit_butterfly() 完全没有使用 offset。当前实现也没有限制 thread_offset 必须 warp 对齐或保证子组不跨物理 warp,所以这是可触发的错误结果。

  1. P2 新功能把 `mlir.dialects.llvm` 变成了 PTODSL 的导入时硬依赖 ptodsl/ptodsl/_builtin_vector.py:16

这里直接 from mlir.dialects import llvm,而 pto.py 会在导入时立刻加载这个模块;scalar.py 也做了同样的硬依赖。与此同时,_bootstrap.py 仍把 LLVM dialect 当作“可能不存在”的可选依赖处理。结果是:即使调用方完全不使用 pto.vecalloc_buffer(scope='local') 或 contiguous vector load/store,只要环境里没有 mlir.dialects.llvm(这是此前允许的安装形态),from ptodsl import pto / from ptodsl import scalar 就会直接 ImportError。这是明确的兼容性回归。

  1. P2 `alloc_buffer(scope='ub')` 的 scratch 规划按整个 trace session 累加,不能按函数/生命周期复用 ptodsl/ptodsl/_tracing/session.py:226

allocate_ub_scratch() 用的是 session 级别的 _ub_scratch_next_byte 单调累加所有 scope='ub' 分配,并在 trace 结束时一次性把总大小写回 entry 函数;这个计数器不会随 enter_function() 切换而重置,也没有任何 lifetime reuse。这样一来,只要一个 entry 调用了多个各自有私有 UB scratch 的子模块/辅助函数,这些 scratch 就会被永久叠加到同一块地址空间里。顺序执行、本可复用的 scratch 也会被算成总和,容易把模块化后的 kernel 平白推过 UB 容量上限,导致编译/资源检查失败。

@reedhecre

reedhecre commented Jun 29, 2026

Copy link
Copy Markdown

Codex Review

该评论由 review 机器人自动更新。

  • PR: Rms Norm Impl #868 Rms Norm Impl
  • Author: and0d0
  • Base/Head: main / rmsNorm_merge_test
  • Head SHA: 709d9362d80b
  • Trigger: PR 有新提交
  • Generated At: 2026-07-01T01:30:49Z
  • Previous Head SHA: 59ad97ed5c3f
  • Status: failed at codex-review (exit=1)

Summary

Review failed at stage codex-review: exit=1

Findings

未生成结构化 findings,因为 review 过程提前失败。

Log Tail

作者:and0d0
base branch:origin/main
head branch:HEAD(当前已 checkout 到 PR head)

要求:
1. 只审查这个 PR 相对 origin/main 的改动,必要时可以看上下文文件。
2. 重点找真实的 correctness / regression / contract mismatch / CI / runtime / compatibility 问题。
3. 不要提纯风格建议,不要提低价值猜测。
4. 严格按优先级输出:
   - P1:高概率会导致错误结果、编译/运行失败、严重回归、发布阻断
   - P2:重要缺陷、行为回归、遗漏校验/测试、较大兼容性问题
   - P3:次要但明确可改的问题
5. 如果没有问题,summary 直接写:未检查到 PR #868 存在问题,并返回 findings=[]。
6. 如果有问题,summary 简洁概括,findings 里每条都要给出:
   - severity
   - title
   - body(说明为什么是问题,尽量具体)
   - file(尽量给相对路径)
   - line(能确定就填整数,否则 null)

建议先查看:
- git status --short
- git diff --stat origin/main...HEAD
- git diff --unified=80 origin/main...HEAD

最终输出必须严格匹配 JSON schema。

mcp startup: no servers
exec
/bin/bash -lc 'git status --short' in /tmp/ptoas-pr-review-monitor/runs/20260701_092527_pr868/repo succeeded in 0ms:

exec
/bin/bash -lc 'git diff --stat origin/main...HEAD' in /tmp/ptoas-pr-review-monitor/runs/20260701_092527_pr868/repo succeeded in 0ms:
 ptodsl/README.md                                   |  20 +
 ptodsl/docs/user_guide/01-introduction.md          |   4 +-
 .../user_guide/03-kernel-entry-and-subkernels.md   |  25 +-
 .../docs/user_guide/04-type-system-and-buffer.md   |  30 +-
 .../docs/user_guide/06-scalar-and-pointer-ops.md   |  65 +-
 ptodsl/docs/user_guide/08-compute-operations.md    |  40 ++
 ptodsl/docs/user_guide/13-simt-micro-ops.md        |   4 +-
 .../examples/rms_norm/rmsnorm_alloc_buffer_simt.py | 236 +++++++
 .../rmsnorm_alloc_buffer_simt_launch_common.py     | 120 ++++
 .../rmsnorm_alloc_buffer_simt_manual_launch.py     | 292 +++++++++
 ptodsl/ptodsl/_allreduce.py                        | 355 +++++++++++
 ptodsl/ptodsl/_bootstrap.py                        |   6 +
 ptodsl/ptodsl/_builtin_vector.py                   |  49 ++
 ptodsl/ptodsl/_jit.py                              |  16 +
 ptodsl/ptodsl/_ops.py                              |  89 ++-
 ptodsl/ptodsl/_scalar_adaptation.py                |   6 +-
 ptodsl/ptodsl/_subkernels.py                       |  21 +-
 ptodsl/ptodsl/_surface_values.py                   |  75 ++-
 ptodsl/ptodsl/_tracing/module_builder.py           |   1 +
 ptodsl/ptodsl/_tracing/session.py                  |  32 +-
exec
/bin/bash -lc 'git diff --unified=80 origin/main...HEAD -- ptodsl/ptodsl/_allreduce.py ptodsl/ptodsl/_ops.py ptodsl/ptodsl/scalar.py ptodsl/ptodsl/_types.py ptodsl/ptodsl/_surface_values.py ptodsl/ptodsl/_jit.py ptodsl/ptodsl/_builtin_vector.py ptodsl/ptodsl/_subkernels.py ptodsl/ptodsl/_scalar_adaptation.py ptodsl/ptodsl/pto.py' in /tmp/ptoas-pr-review-monitor/runs/20260701_092527_pr868/repo succeeded in 0ms:
diff --git a/ptodsl/ptodsl/_allreduce.py b/ptodsl/ptodsl/_allreduce.py
new file mode 100644
index 00000000..3c6637bd
--- /dev/null
+++ b/ptodsl/ptodsl/_allreduce.py
@@ -0,0 +1,355 @@
+# Copyright (c) 2026 Huawei Technologies Co., Ltd.
+# This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+# CANN Open Software License Agreement Version 2.0 (the "License").
+# Please refer to the License for details. You may not use this file except in compliance with the License.
+# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+# See LICENSE in the root of the software repository for the full text of the License.
+"""
+SIMT cross-workitem all-reduce.
+
+All-reduce ops are emitted **inline** at the current insertion point.
+Three reducer variants: ``simt_allreduce_sum``, ``simt_allreduce_max``, ``simt_allreduce_min``.
+
+Dispatch tree (compile-time, since *threads* / *scale* are Python ints)::
ERROR: Quota exceeded. Check your plan and billing details.
Warning: no last agent message; wrote empty content to /tmp/ptoas-pr-review-monitor/runs/20260701_092527_pr868/codex_last_message.json
tokens used
19,550
===== END STAGE codex-review rc=1 @ 2026-07-01 09:30:49 =====

Comment thread ptodsl/docs/user_guide/04-type-system-and-buffer.md Outdated
default and others added 5 commits June 30, 2026 15:06
Signed-off-by: andodo <andodo@andododeMacBook-Air.local>
Signed-off-by: andodo <andodo@andododeMacBook-Air.local>
Path 3 now emits the cross-warp allreduce body directly in the caller instead of returning through a SIMT helper call. Paths 1, 2, and 4 still use helper calls and should be converted to inline emission in a follow-up.

Signed-off-by: andodo <andodo@andododeMacBook-Air.local>
Signed-off-by: andodo <andodo@andododeMacBook-Air.local>
Signed-off-by: andodo <andodo@andododeMacBook-Air.local>
@and0d0 and0d0 force-pushed the rmsNorm_merge_test branch from 5f0b843 to 28fa9ef Compare June 30, 2026 09:07
default and others added 3 commits June 30, 2026 20:39
Signed-off-by: andodo <andodo@andododeMacBook-Air.local>
Signed-off-by: andodo <andodo@andododeMacBook-Air.local>
Signed-off-by: andodo <andodo@andododeMacBook-Air.local>
@and0d0 and0d0 force-pushed the rmsNorm_merge_test branch from cb693a0 to 433ff88 Compare June 30, 2026 14:07
andodo added 2 commits June 30, 2026 22:24
Signed-off-by: andodo <andodo@andododeMacBook-Air.local>
Signed-off-by: andodo <andodo@andododeMacBook-Air.local>
name = _helper_name(dtype, threads, scale, thread_offset)
args = dict(dtype=dtype, threads=threads, scale=scale,
thread_offset=thread_offset)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

以下其他path没有验证,在simt使用中可能会造成无法正确return值的问题
解决方法:inline函数
目前只inline了path 3,其余待修改验证。

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

等待学坤版本合并

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

学坤版本已合并

Comment thread ptodsl/docs/user_guide/04-type-system-and-buffer.md Outdated
Comment thread ptodsl/ptodsl/_runtime/codegen.py Outdated
return f"ptodsl_launch_{ir_function_name}"

def generate_launch_cpp(*, ir_function_name: str, kernel_signature) -> str:
def generate_launch_cpp(*, ir_function_name: str, kernel_signature, dyn_shared_bytes: int = 0) -> str:

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要人肉review一下具体实现

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不该改动原codegen,在testcase中绕过

Comment thread ptodsl/ptodsl/_allreduce.py Outdated

# ── Path 3: cross_warp_reduce ────────────────────────────────────────
if scale <= 32 and _is_pow2(threads) and _is_pow2(scale):
return _emit_cross_warp_reduce_inline(

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已改为inline版本

y_offset = pingpong * hidden_size + lane_base
frag_offset = r * lanes

x_vec = scalar.load(x_frag, frag_offset, contiguous=lanes)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里,从x_frag读(llvm.load !llvm.ptr -> vector),在thread=64/32时会出问题。

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

去掉此处下方y计算从ub直接加载,thread32全部与t=64,r=32,l=2 依旧失败

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

所以从local x_frag alloca 做 vector load不是唯一问题?

@and0d0

and0d0 commented Jul 1, 2026

Copy link
Copy Markdown
Author

/review

@reedhecre

Copy link
Copy Markdown

Manual Codex Review

该评论由 /review 手动触发。

  • PR: Rms Norm Impl #868 Rms Norm Impl
  • Author: and0d0
  • Base/Head: main / rmsNorm_merge_test
  • Head SHA: 709d9362d80b
  • Trigger: /review
  • Generated At: 2026-07-01T02:25:06Z
  • Requested By: and0d0
  • Trigger Comment: Rms Norm Impl #868 (comment)
  • Status: running

Summary

收到 /review,正在执行 Codex review。

Findings

Review in progress.

@reedhecre

Copy link
Copy Markdown

Manual Codex Review

该评论由 /review 手动触发。

  • PR: Rms Norm Impl #868 Rms Norm Impl
  • Author: and0d0
  • Base/Head: main / rmsNorm_merge_test
  • Head SHA: 709d9362d80b
  • Trigger: /review
  • Generated At: 2026-07-01T02:48:39Z
  • Requested By: and0d0
  • Trigger Comment: Rms Norm Impl #868 (comment)
  • Status: completed

Summary

发现 3 个高价值问题:dyn_shared_memory_buf 没有接入默认 launcher、simt_allreduce 的非零 thread_offset 路径会算错,以及对 mlir.dialects.llvm 的顶层硬依赖会让部分环境直接 import 失败。

Findings

  1. P1 `dyn_shared_memory_buf` 没有接入默认 launch 路径 ptodsl/ptodsl/_runtime/codegen.py:128

这个 PR 把 dyn_shared_memory_buf 暴露成 @pto.jit(...) 的公开参数,并且新的 RMSNorm kernel 明确依赖 82496B 的动态 UB;但默认 runtime 仍然生成 kernel<<<grid, nullptr, stream>>>。也就是说 compiled[grid, stream](...) 会始终以 0B 动态共享内存启动这些 kernel。rmsnorm_alloc_buffer_simt_manual_launch.py 已经在用手写 launcher 绕过这个问题,说明标准 launch 路径当前是跑不通的;用户按正常 PTODSL API 启动时会直接落到错误的 UB scratch 布局,结果要么运行失败,要么静默算错。

  1. P1 `simt_allreduce(..., thread_offset!=0)` 会把错误的 lane 混进归约 ptodsl/ptodsl/_allreduce.py:108

这里直接用 get_tid_x() - thread_offset 重新编号,然后仍然调用 warp-local 的 redux_* / shuffle_bfly。函数本身既没有屏蔽 tid_x < thread_offset 的线程,也没有要求 thread_offset 做 warp 对齐,所以像新增测试里使用的 threads=16, thread_offset=4 这种调用,0..3 号 lane 仍会参与 reduction,而后续 lane 也会被错误并入相邻 group;threads>32 路径同样会把跨物理 warp 的 logical group 当成一个 warp 来归约,结果会直接算错。现在的新增测试只检查 IR 里出现了 subi,没有任何功能验证,所以这个错误很容易直接带进发布。

  1. P2 新的 `pto.vec`/contiguous access 让 PTODSL 对 llvm Python dialect 变成硬依赖 ptodsl/ptodsl/_builtin_vector.py:16

_bootstrap.py 明明已经把 mlir.dialects.llvm 视为可选依赖,但这里仍在模块顶层无条件 from mlir.dialects import llvm。再加上 pto.py 现在会 eager import _builtin_vector,于是只要环境里没有 LLVM python bindings,from ptodsl import pto 就会直接 ModuleNotFoundError,即便用户根本没有使用 pto.vec 或 contiguous load/store。scalar.py 也引入了同样的顶层硬依赖,所以这是整个 PTODSL import 路径的兼容性回退。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants