Rms Norm Impl#868
Conversation
…l-reduce Implement the pto.simt_allreduce_sum frontend interface as designed in mission/483/483_docs.md. Pure Python MLIR IR emission with three dispatch strategies: warp_reduce (<=32 threads, pow2), cross_warp_reduce (>32, pow2), ub_reduce (fallback). Supports f32 and f16. - ptodsl/ptodsl/_allreduce.py: new — 674 lines - ptodsl/ptodsl/pto.py: export simt_allreduce_sum (+3 lines) - ptodsl/tests/test_allreduce.py: new — 533 lines, all passing Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces explicit scratch buffer allocation via pto.alloc_buffer, contiguous vector memory operations for scalar.load and scalar.store, and SIMT cross-workitem all-reduce helpers (simt_allreduce_sum). The feedback highlights three robustness improvements: safely retrieving the type attribute in scratch validation and vector binary operations to prevent unhandled AttributeErrors when raw Python values are passed, and explicitly handling local/private pointer types in address space mapping to avoid incorrect compilation outputs.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| try: | ||
| ptr_type = _pto.PtrType(raw_scratch.type) | ||
| except Exception: | ||
| raise TypeError( | ||
| f"all_reduce {context}: scratch must be a !pto.ptr buffer, " | ||
| f"got {raw_scratch.type}" | ||
| ) from None |
There was a problem hiding this comment.
If raw_scratch does not have a type attribute (for example, if a non-surface Python value is passed), accessing raw_scratch.type inside the except block will raise an AttributeError. This masks the original exception and prevents a clean TypeError from being raised.
Using getattr or checking for the attribute beforehand avoids this secondary exception.
| try: | |
| ptr_type = _pto.PtrType(raw_scratch.type) | |
| except Exception: | |
| raise TypeError( | |
| f"all_reduce {context}: scratch must be a !pto.ptr buffer, " | |
| f"got {raw_scratch.type}" | |
| ) from None | |
| scratch_type = getattr(raw_scratch, "type", None) | |
| if scratch_type is None: | |
| raise TypeError( | |
| f"all_reduce {context}: scratch must be a !pto.ptr buffer, " | |
| f"got {type(raw_scratch).__name__}" | |
| ) | |
| try: | |
| ptr_type = _pto.PtrType(scratch_type) | |
| except Exception: | |
| raise TypeError( | |
| f"all_reduce {context}: scratch must be a !pto.ptr buffer, " | |
| f"got {scratch_type}" | |
| ) from None |
| text = str(ptr_type) | ||
| if ", ub>" in text or ", vec>" in text: | ||
| return 6 | ||
| if ", gm>" in text or text.endswith(">"): | ||
| return 1 |
There was a problem hiding this comment.
The fallback check text.endswith(">") is too generic and will match local/private pointer types (e.g., !pto.ptr<f32, local>), incorrectly returning 1 (Global Memory address space) instead of 0 (Local/Private address space). This address space mismatch with llvm.alloca (which allocates in address space 0) can lead to LLVM compilation failures or invalid memory accesses.
Adding explicit checks for ", local>" and ", private>" before the fallback ensures they are correctly mapped to address space 0.
| text = str(ptr_type) | |
| if ", ub>" in text or ", vec>" in text: | |
| return 6 | |
| if ", gm>" in text or text.endswith(">"): | |
| return 1 | |
| text = str(ptr_type) | |
| if ", ub>" in text or ", vec>" in text: | |
| return 6 | |
| if ", local>" in text or ", private>" in text: | |
| return 0 | |
| if ", gm>" in text or text.endswith(">"): | |
| return 1 |
| def _emit_vec_binary_op(op_name: str, lhs, rhs): | ||
| lhs_raw = unwrap_surface_value(lhs) | ||
| rhs_raw = unwrap_surface_value(rhs) | ||
| if not (VectorType.isinstance(lhs_raw.type) and VectorType.isinstance(rhs_raw.type)): | ||
| raise TypeError("PTODSL VecValue arithmetic expects compatible vector operands") |
There was a problem hiding this comment.
If either lhs or rhs is a raw Python scalar (e.g., 2.0 or 3), unwrap_surface_value returns the raw Python value, which does not have a type attribute. Accessing .type directly on it will raise an AttributeError, masking the intended TypeError with a cryptic traceback.
Using getattr to safely retrieve the types prevents this issue.
| def _emit_vec_binary_op(op_name: str, lhs, rhs): | |
| lhs_raw = unwrap_surface_value(lhs) | |
| rhs_raw = unwrap_surface_value(rhs) | |
| if not (VectorType.isinstance(lhs_raw.type) and VectorType.isinstance(rhs_raw.type)): | |
| raise TypeError("PTODSL VecValue arithmetic expects compatible vector operands") | |
| def _emit_vec_binary_op(op_name: str, lhs, rhs): | |
| lhs_raw = unwrap_surface_value(lhs) | |
| rhs_raw = unwrap_surface_value(rhs) | |
| lhs_type = getattr(lhs_raw, "type", None) | |
| rhs_type = getattr(rhs_raw, "type", None) | |
| if lhs_type is None or rhs_type is None or not (VectorType.isinstance(lhs_type) and VectorType.isinstance(rhs_type)): | |
| raise TypeError("PTODSL VecValue arithmetic expects compatible vector operands") |
# Conflicts: # ptodsl/docs/user_guide/06-scalar-and-pointer-ops.md # ptodsl/ptodsl/_ops.py # ptodsl/ptodsl/_runtime_scalar_ops.py # ptodsl/ptodsl/_surface_values.py # ptodsl/ptodsl/_tracing/session.py # ptodsl/ptodsl/pto.py # ptodsl/tests/test_jit_compile.py
|
/review |
|
/review |
|
|
||
| | Scope | Storage | Returned value | Typical use | Layout notes | | ||
| |-------|---------|----------------|-------------|--------------| | ||
| | `"ub"` | Function-level Unified Buffer scratch | Typed `!pto.ptr<T, ub>` | MTE source/destination buffers, cross-SIMT scratch such as reductions | Contributes to `dyn_shared_memory_buf`; the frontend may insert alignment padding between allocations | |
|
/run a3 |
|
已接收
页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。 |
A3 板测失败详情:PR #868rowexpanddiv
xor
rowexpandmul
quant_asym
quant
scatter
rowexpandsub
rope_kv_cache
sels
tprefetch_async_binding
partmin
|
| core_id = pto.get_block_idx() | ||
| frag_elems: pto.const_expr = rounds * lanes | ||
|
|
||
| w_ub = pto.alloc_buffer((hidden_size,), pto.f32, scope="ub") |
There was a problem hiding this comment.
pto.alloc_buffer的地址谁在分配?能否使用alloc_tile?
我以为pto.alloc_buffer是在simt kernel内部使用的,如果新引入一个alloc_buffer op是否会对已有pass产生冲击?
There was a problem hiding this comment.
ub部分是tracing里分配的,会在trace过程中生成已有ir(参考golden中的内容对应的ir),后续都有已有的pass去处理,我认为实际冲击比较小?不直接使用已有ir,是因为alloc buffer ub包了一层自动维护实际的buffer ptr的部分,不用手动声明整个buffer块有多大,每次从哪里放到哪里。alloc_tile.as_ptr()会走tile的op多绕一圈,且需要显式传递addr,但最终形态能符合。我不确定kernel层面哪样比较好?给我一点意见吧
|
|
||
| | Scope | Storage | Returned value | Typical use | Layout notes | | ||
| |-------|---------|----------------|-------------|--------------| | ||
| | `"ub"` | Function-level Unified Buffer scratch | Typed `!pto.ptr<T, ub>` | MTE source/destination buffers, cross-SIMT scratch such as reductions | Contributes to `dyn_shared_memory_buf`; the frontend may insert alignment padding between allocations | |
| w_vec = scalar.load(w_ub, ub_offset, contiguous=lanes) | ||
| scalar.store(w_vec, w_frag, frag_offset) | ||
|
|
||
|
|
| core_id = pto.get_block_idx() | ||
| frag_elems: pto.const_expr = rounds * lanes | ||
|
|
||
| w_ub = pto.alloc_buffer((hidden_size,), pto.f32, scope="ub") |
There was a problem hiding this comment.
ub部分是tracing里分配的,会在trace过程中生成已有ir(参考golden中的内容对应的ir),后续都有已有的pass去处理,我认为实际冲击比较小?不直接使用已有ir,是因为alloc buffer ub包了一层自动维护实际的buffer ptr的部分,不用手动声明整个buffer块有多大,每次从哪里放到哪里。alloc_tile.as_ptr()会走tile的op多绕一圈,且需要显式传递addr,但最终形态能符合。我不确定kernel层面哪样比较好?给我一点意见吧
Manual Codex Review该评论由
Summary收到 FindingsReview in progress. |
Manual Codex Review该评论由
SummaryPR #868 存在 3 个问题: Findings
这里直接
|
Codex Review该评论由 review 机器人自动更新。
SummaryReview failed at stage Findings未生成结构化 findings,因为 review 过程提前失败。 Log Tail |
Signed-off-by: andodo <andodo@andododeMacBook-Air.local>
Signed-off-by: andodo <andodo@andododeMacBook-Air.local>
Path 3 now emits the cross-warp allreduce body directly in the caller instead of returning through a SIMT helper call. Paths 1, 2, and 4 still use helper calls and should be converted to inline emission in a follow-up. Signed-off-by: andodo <andodo@andododeMacBook-Air.local>
Signed-off-by: andodo <andodo@andododeMacBook-Air.local>
Signed-off-by: andodo <andodo@andododeMacBook-Air.local>
5f0b843 to
28fa9ef
Compare
Signed-off-by: andodo <andodo@andododeMacBook-Air.local>
Signed-off-by: andodo <andodo@andododeMacBook-Air.local>
Signed-off-by: andodo <andodo@andododeMacBook-Air.local>
cb693a0 to
433ff88
Compare
Signed-off-by: andodo <andodo@andododeMacBook-Air.local>
Signed-off-by: andodo <andodo@andododeMacBook-Air.local>
| name = _helper_name(dtype, threads, scale, thread_offset) | ||
| args = dict(dtype=dtype, threads=threads, scale=scale, | ||
| thread_offset=thread_offset) | ||
|
|
There was a problem hiding this comment.
以下其他path没有验证,在simt使用中可能会造成无法正确return值的问题
解决方法:inline函数
目前只inline了path 3,其余待修改验证。
| return f"ptodsl_launch_{ir_function_name}" | ||
|
|
||
| def generate_launch_cpp(*, ir_function_name: str, kernel_signature) -> str: | ||
| def generate_launch_cpp(*, ir_function_name: str, kernel_signature, dyn_shared_bytes: int = 0) -> str: |
|
|
||
| # ── Path 3: cross_warp_reduce ──────────────────────────────────────── | ||
| if scale <= 32 and _is_pow2(threads) and _is_pow2(scale): | ||
| return _emit_cross_warp_reduce_inline( |
| y_offset = pingpong * hidden_size + lane_base | ||
| frag_offset = r * lanes | ||
|
|
||
| x_vec = scalar.load(x_frag, frag_offset, contiguous=lanes) |
There was a problem hiding this comment.
这里,从x_frag读(llvm.load !llvm.ptr -> vector),在thread=64/32时会出问题。
There was a problem hiding this comment.
去掉此处下方y计算从ub直接加载,thread32全部与t=64,r=32,l=2 依旧失败
There was a problem hiding this comment.
所以从local x_frag alloca 做 vector load不是唯一问题?
|
/review |
Manual Codex Review该评论由
Summary收到 FindingsReview in progress. |
Manual Codex Review该评论由
Summary发现 3 个高价值问题: Findings
这个 PR 把
这里直接用
|
未完成:
需要人工再确认一下:
已完成: