Category
Technical Debt (cleanup, refactor)
Component
Tests
Description
kernel_simt_scatter.cpp (the SIMT element-scatter ST kernel) currently
forks the scatter call on __CPU_SIM:
#ifdef __CPU_SIM
MSCATTER(outGlobal, srcTile, idxTile); // non-templated
#else
MSCATTER<Coalesce::Elem, ScatterAtomicOp::None, ScatterOOB::Skip>(...); // templated
#endif
The fork existed because pto-isa previously gated the templated MSCATTER
overloads behind PTO_NPU_ARCH_A5 only — they were not visible to the CPU
simulator, so the sim path had to fall back to the non-templated form (whose
CPU-sim default happens to be Coalesce::Elem, matching our element-scatter
golden), while onboard selected Coalesce::Elem explicitly. See
pto-isa#164.
The pinned pto-isa (bumped to 016396b5 in #1156, the
pto-isa#166 mechanism)
now opens the templated overloads to __CPU_SIM as well as
PTO_NPU_ARCH_A5:
// build/pto-isa/include/pto/common/pto_instr.hpp:2049
#if defined(PTO_NPU_ARCH_A5) || defined(__CPU_SIM)
template <Coalesce Mode, ScatterAtomicOp Atomic, ScatterOOB Oob, ...>
PTO_INST RecordEvent MSCATTER(...) { ... MSCATTER_IMPL<Mode, Atomic, Oob>(...); }
#endif
So the same explicit templated call now compiles and runs identically on both
backends, and the #ifdef __CPU_SIM fork can be removed.
Important caveat — the non-templated form is still NOT portable. Only the
templated overload was unified. The non-templated MSCATTER(dst, src, idx)
still dispatches to each backend's own default MSCATTER_IMPL:
- CPU sim default →
Coalesce::Elem (pto/cpu/MScatter.hpp:139)
- a5 onboard default →
Coalesce::Row (pto/npu/a5/MScatter.hpp:456)
i.e. the original #164 divergence persists for the non-templated surface. The
single portable instruction must therefore be the explicit
MSCATTER<Coalesce::Elem, ScatterAtomicOp::None, ScatterOOB::Skip>.
Location
tests/st/a5/tensormap_and_ringbuffer/simt_basic/kernels/aiv/kernel_simt_scatter.cpp:85-89
Proposed Fix
Drop the #ifdef __CPU_SIM branch and call the explicit templated form
unconditionally on both sim and onboard:
MSCATTER<Coalesce::Elem, ScatterAtomicOp::None, ScatterOOB::Skip>(outGlobal, srcTile, idxTile);
Verified on --platform a5sim (1 passed). The onboard call site is
unchanged by this cleanup (it already used this exact instruction), so a5
behavior is unaffected; an a5 onboard rerun is still warranted to close the
loop (not available on the current a2a3 dev box).
Priority
Low (no impact today, good to fix eventually)
Category
Technical Debt (cleanup, refactor)
Component
Tests
Description
kernel_simt_scatter.cpp(the SIMT element-scatter ST kernel) currentlyforks the scatter call on
__CPU_SIM:The fork existed because pto-isa previously gated the templated
MSCATTERoverloads behind
PTO_NPU_ARCH_A5only — they were not visible to the CPUsimulator, so the sim path had to fall back to the non-templated form (whose
CPU-sim default happens to be
Coalesce::Elem, matching our element-scattergolden), while onboard selected
Coalesce::Elemexplicitly. Seepto-isa#164.
The pinned pto-isa (bumped to
016396b5in #1156, thepto-isa#166 mechanism)
now opens the templated overloads to
__CPU_SIMas well asPTO_NPU_ARCH_A5:So the same explicit templated call now compiles and runs identically on both
backends, and the
#ifdef __CPU_SIMfork can be removed.Important caveat — the non-templated form is still NOT portable. Only the
templated overload was unified. The non-templated
MSCATTER(dst, src, idx)still dispatches to each backend's own default
MSCATTER_IMPL:Coalesce::Elem(pto/cpu/MScatter.hpp:139)Coalesce::Row(pto/npu/a5/MScatter.hpp:456)i.e. the original #164 divergence persists for the non-templated surface. The
single portable instruction must therefore be the explicit
MSCATTER<Coalesce::Elem, ScatterAtomicOp::None, ScatterOOB::Skip>.Location
tests/st/a5/tensormap_and_ringbuffer/simt_basic/kernels/aiv/kernel_simt_scatter.cpp:85-89Proposed Fix
Drop the
#ifdef __CPU_SIMbranch and call the explicit templated formunconditionally on both sim and onboard:
MSCATTER<Coalesce::Elem, ScatterAtomicOp::None, ScatterOOB::Skip>(outGlobal, srcTile, idxTile);Verified on
--platform a5sim(1 passed). The onboard call site isunchanged by this cleanup (it already used this exact instruction), so a5
behavior is unaffected; an a5 onboard rerun is still warranted to close the
loop (not available on the current a2a3 dev box).
Priority
Low (no impact today, good to fix eventually)