Backup/main latest by ly-ict · Pull Request #4 · PTO-ISA/SuperNPUBench

ly-ict · 2026-06-24T07:55:11Z

No description provided.

The AI workload runner invokes SuperNPUBench with the in-repo Linx LLVM toolchain. The old PLAT=linx defaults passed removed compiler flags and pulled hosted iostream/cmath into a bare target, so this updates the default Linx flags and starts the tileop data helper down a freestanding path. Constraint: compiler/llvm/build-linxisa-clang/bin/clang++ rejects -mlxbc and -enable-all-vector-as-tilereg. Rejected: Depend on a prebuilt musl sysroot | the superproject runner must be able to report the missing benchmark port before libc is fully staged. Confidence: medium Scope-risk: moderate Directive: Finish the JCore scalar-type and layout-header freestanding port before expecting SuperNPUBench tileop ELFs to build. Tested: python3 tools/bringup/run_ai_workload_flow.py --profile smoke --case supernpu-tileop_api-TAdd --stop-after compiler-contract --run-id ai-supernpu-compile-smoke2 --continue-on-fail --compile-timeout 900 Not-tested: SuperNPUBench TAdd ELF production; current first failure is benchmark-owned source/toolchain surface mismatch.

The Linx AI bring-up runner needs a Tier-0 SuperNPUBench case that can produce a freestanding Linx ELF through the in-repo compiler and enter QEMU/model triage. This keeps hosted behavior intact while adding Linx-only shims for the minimal tileop_api TAdd path and a direct-boot _start finisher handoff. Constraint: Linx clang cannot compile the full hosted/JCore launch surface yet Constraint: The AI flow consumes a Linx ELF as the canonical handoff artifact Rejected: Port every tileop type and hosted diagnostic path now | broader than the Tier-0 smoke boundary Confidence: medium Scope-risk: moderate Directive: Keep Linx compatibility branches under __linx until the compiler supports the full hosted SuperNPUBench surface Tested: run_ai_workload_flow.py smoke supernpu-tileop_api-TAdd source/compiler/QEMU path Not-tested: Full SuperNPUBench tileop_api matrix on Linx

MatMul now has a Linx-friendly scalar tile fallback and a small direct-boot int64 smoke path so the AI workload runner can compile it to a Linx ELF and hand the same artifact to QEMU and the C++ model. The host SuperNPUBench path remains on the existing allocation/printf-based test flow. Constraint: Tier-0 AI bring-up needs a minimal SuperNPUBench matrix case without libc dependencies. Rejected: Port the full host MatMul matrix into the direct-boot path | it would add allocator and printf dependencies before the model can run the small kernel. Confidence: medium Scope-risk: narrow Directive: Keep the Linx direct-boot case small until MatMul reaches final green in LinxCoreModel. Tested: Superproject ai-matmul-generated-model-smoke run compiled MatMul, QEMU wrote 0x5555, and emitted a model-owned timeout packet. Tested: Superproject ai-tadd-generated-model-smoke run remained final-green. Not-tested: Full SuperNPUBench matrix and final MatMul execution in LinxCoreModel.

The AI workload flow had only TADD and MatMul as SuperNPUBench cases that could cross LLVM, QEMU, and LinxCoreModel. TSUB is the next narrow arithmetic case with the same unboxed integer-tile shape, so add a guarded Linx scalar implementation and a bounded direct-boot source path without changing ARM or CPU-sim behavior.\n\nConstraint: Current Linx SuperNPUBench smoke runtime supports small unboxed direct-boot cases, not host-libc or boxed/dynamic tile layouts.\nRejected: Port the full SuperNPUBench runtime in this change | too broad for a first Tier-1 promotion and would obscure per-op evidence.\nConfidence: high\nScope-risk: narrow\nDirective: Keep new Linx tileop promotions small and prove each through QEMU and model before widening source/runtime coverage.\nTested: python3 tools/bringup/run_ai_workload_flow.py --profile pr --run-id ai-pr-supernpu-tsub-exact-01 --case '=supernpu-tileop_api-TSub' --qemu-timeout 60 --model-timeout 240 --model-build-timeout 3600 --continue-on-fail --skill-evolve-note 'no-update promoted SuperNPUBench TSUB scalar direct-boot smoke'\nTested: python3 tools/bringup/run_ai_workload_flow.py --profile smoke --run-id ai-smoke-regression-after-tsub-01 --qemu-timeout 60 --model-timeout 240 --model-build-timeout 3600 --continue-on-fail --skill-evolve-note 'no-update TSub promotion regression smoke'\nNot-tested: Full SuperNPUBench tileop_api matrix; remaining TSUBS and other tile APIs still need separate Linx implementations.

The AI workload flow now has scalar arithmetic coverage through TADD and TSUB; TAND is the next narrow logical tileop that shares the same unboxed int64 direct-boot pattern. Add a guarded Linx scalar implementation, expose it through the Linx tileop include table, and bound the test source to static direct-boot data under __linx.\n\nConstraint: Current Linx SuperNPUBench direct-boot lane supports small unboxed scalar tile cases, not the original host-libc output path.\nRejected: Port the broader SuperNPUBench runtime or boxed layouts here | the goal is one provable Tier-1 promotion per bounded change.\nConfidence: high\nScope-risk: narrow\nDirective: Keep future logical/arithmetic tileop promotions behind exact per-case QEMU and gfsim evidence.\nTested: python3 tools/bringup/run_ai_workload_flow.py --profile pr --run-id ai-pr-supernpu-tand-01 --case '=supernpu-tileop_api-TAnd' --qemu-timeout 60 --model-timeout 240 --model-build-timeout 3600 --continue-on-fail --skill-evolve-note 'no-update promoted SuperNPUBench TAND scalar direct-boot smoke'\nTested: python3 tools/bringup/run_ai_workload_flow.py --profile pr --run-id ai-pr-supernpu-tsub-tand-01 --case '=supernpu-tileop_api-TSub' --case '=supernpu-tileop_api-TAnd' --qemu-timeout 60 --model-timeout 240 --model-build-timeout 3600 --continue-on-fail --skill-evolve-note 'updated linx-superproject TAnd direct-boot promotion evidence'\nTested: python3 tools/bringup/run_ai_workload_flow.py --profile smoke --run-id ai-smoke-regression-after-tand-01 --qemu-timeout 60 --model-timeout 240 --model-build-timeout 3600 --continue-on-fail --skill-evolve-note 'no-update TAnd promotion regression smoke'\nNot-tested: Full SuperNPUBench tileop_api matrix; remaining tileops still require separate Linx scalar/runtime support.

TOr is the companion logical tileop to the already promoted TAnd path. Add a guarded Linx scalar implementation, expose it in the Linx tileop include table, and give the test a bounded int64 direct-boot branch so the AI flow can produce a Linx ELF without the host-libc output path.\n\nConstraint: SuperNPUBench direct-boot promotion remains one exact tileop at a time with QEMU and gfsim evidence.\nRejected: Reuse the host malloc/printf path under Linx | direct-boot ELFs intentionally avoid host libc and soft-float dependencies.\nConfidence: high\nScope-risk: narrow\nDirective: Keep logical tileop promotions aligned with the TAnd/TOr scalar direct-boot pattern until broader runtime support exists.\nTested: python3 tools/bringup/run_ai_workload_flow.py --profile pr --run-id ai-pr-supernpu-tor-01 --case '=supernpu-tileop_api-TOr' --qemu-timeout 60 --model-timeout 240 --model-build-timeout 3600 --continue-on-fail --skill-evolve-note 'no-update promoted SuperNPUBench TOR scalar direct-boot smoke'\nTested: python3 tools/bringup/run_ai_workload_flow.py --profile pr --run-id ai-pr-supernpu-logic-01 --case '=supernpu-tileop_api-TAnd' --case '=supernpu-tileop_api-TOr' --qemu-timeout 60 --model-timeout 240 --model-build-timeout 3600 --continue-on-fail --skill-evolve-note 'updated linx-superproject TOr direct-boot promotion evidence'\nTested: python3 tools/bringup/run_ai_workload_flow.py --profile smoke --run-id ai-smoke-regression-after-tor-01 --qemu-timeout 60 --model-timeout 240 --model-build-timeout 3600 --continue-on-fail --skill-evolve-note 'no-update TOr promotion regression smoke'\nNot-tested: Full SuperNPUBench tileop_api matrix; remaining tileops still need separate Linx support.

TAdds was blocked at the source contract for Linx because the tile API had no scalar __linx implementation and the benchmark still used the host-oriented source path. Add the same bounded direct-boot shape used by the existing promoted tileops so the superproject AI flow can compile it into a Linx ELF, run it in QEMU, and then promote it through gfsim. Constraint: Linx direct-boot smoke must avoid host libc allocation/output paths and must expose _start pass/fail finisher writes. Rejected: Promote the full host-sized float/half/int matrix in this step | existing Tier-1 promotion pattern uses a bounded int64 RowMajor case first to keep QEMU/model triage narrow. Confidence: high Scope-risk: narrow Directive: Keep future SuperNPUBench Linx promotions exact-case and QEMU-green before running gfsim. Tested: AI flow ai-pr-supernpu-tadds-01 passed source, compile, QEMU, model-build-smoke, gfsim, differential triage, and fix-packet stages. Tested: AI flow ai-pr-supernpu-adds-01 passed TAdd and TAdds through QEMU and gfsim. Tested: AI flow ai-smoke-regression-after-tadds-01 passed 4/4 existing smoke cases. Not-tested: Full SuperNPUBench TAdds host-sized dtype matrix and full nightly AI workload matrix.

TSubs was blocked at the Linx source contract because the tile API implementation was not visible under __linx and the benchmark still followed the host allocation/output path. Add the same bounded scalar direct-boot shape as the promoted arithmetic tileops so the superproject flow can compile a Linx ELF, pass it in QEMU, and promote it through gfsim. Constraint: Linx direct-boot smoke must avoid host libc allocation/output paths and expose _start pass/fail finisher writes. Rejected: Promote the full host-sized dtype matrix in this step | Tier-1 promotion is intentionally narrowed to one int64 RowMajor direct-boot case for QEMU/model triage. Confidence: high Scope-risk: narrow Directive: Keep future scalar SuperNPUBench promotions exact-case and QEMU-green before model execution. Tested: AI flow ai-pr-supernpu-tsubs-01 passed source, compile, QEMU, model-build-smoke, gfsim, differential triage, and fix-packet stages. Tested: AI flow ai-pr-supernpu-subtracts-01 passed TSub and TSubs through QEMU and gfsim. Tested: AI flow ai-smoke-regression-after-tsubs-01 passed 4/4 existing smoke cases. Not-tested: Full SuperNPUBench TSubs host-sized dtype matrix and full nightly AI workload matrix.

TMul was blocked at the Linx source contract because the tile API implementation was not visible under __linx and the benchmark still used the host allocation/output path. Add a bounded scalar direct-boot implementation so the superproject flow can compile a Linx ELF, pass it in QEMU, and promote it through gfsim. Constraint: Linx direct-boot smoke must avoid host libc allocation/output paths and expose _start pass/fail finisher writes. Rejected: Promote the full host-sized float/half/int matrix in this step | Tier-1 promotion stays narrowed to one int64 RowMajor direct-boot case for QEMU/model triage. Confidence: high Scope-risk: narrow Directive: Keep future arithmetic SuperNPUBench promotions exact-case and QEMU-green before model execution. Tested: AI flow ai-pr-supernpu-tmul-01 passed source, compile, QEMU, model-build-smoke, gfsim, differential triage, and fix-packet stages. Tested: AI flow ai-pr-supernpu-arith-tmul-01 passed TAdd, TSub, TSubs, and TMul through QEMU and gfsim. Tested: AI flow ai-smoke-regression-after-tmul-01 passed 4/4 existing smoke cases. Not-tested: Full SuperNPUBench TMul host-sized dtype matrix and full nightly AI workload matrix.

TMuls was blocked at the Linx source contract because the tile API implementation was not visible under __linx and the benchmark still used the host allocation/output path. Add a bounded scalar direct-boot implementation so the superproject flow can compile a Linx ELF, pass it in QEMU, and promote it through gfsim. Constraint: Linx direct-boot smoke must avoid host libc allocation/output paths and expose _start pass/fail finisher writes. Rejected: Promote the full host-sized float/half/int matrix in this step | Tier-1 promotion stays narrowed to one int64 RowMajor direct-boot case for QEMU/model triage. Confidence: high Scope-risk: narrow Directive: Keep future arithmetic SuperNPUBench promotions exact-case and QEMU-green before model execution. Tested: AI flow ai-pr-supernpu-tmuls-01 passed source, compile, QEMU, model-build-smoke, gfsim, differential triage, and fix-packet stages. Tested: AI flow ai-pr-supernpu-multiply-01 passed TMul and TMuls through QEMU and gfsim. Tested: AI flow ai-smoke-regression-after-tmuls-01 passed 4/4 existing smoke cases. Not-tested: Full SuperNPUBench TMuls host-sized dtype matrix and full nightly AI workload matrix.

TMax was blocked at the Linx source contract because the tile API implementation was not visible under __linx and the benchmark still used the host allocation/output path. Add a bounded scalar direct-boot implementation so the superproject flow can compile a Linx ELF, pass it in QEMU, and promote it through gfsim. Constraint: Linx direct-boot smoke must avoid host libc allocation/output paths and expose _start pass/fail finisher writes. Rejected: Promote the full host-sized float/half/int matrix in this step | Tier-1 promotion stays narrowed to one int64 RowMajor direct-boot case for QEMU/model triage. Confidence: high Scope-risk: narrow Directive: Keep future comparison SuperNPUBench promotions exact-case and QEMU-green before model execution. Tested: AI flow ai-pr-supernpu-tmax-01 passed source, compile, QEMU, model-build-smoke, gfsim, differential triage, and fix-packet stages. Tested: AI flow ai-pr-supernpu-max-arith-01 passed TMax, TMul, and TMuls through QEMU and gfsim. Tested: AI flow ai-smoke-regression-after-tmax-01 passed 4/4 existing smoke cases. Not-tested: Full SuperNPUBench TMax host-sized dtype matrix and full nightly AI workload matrix.

TMaxs was blocked at the Linx source contract because the tile API did not expose a __linx scalar implementation and the host-sized test still relied on libc allocation/output. Add the bounded int64 RowMajor direct-boot path so the superproject AI bring-up runner can promote this exact case through compile, QEMU, and gfsim without widening the host test contract. Constraint: Linx direct-boot promotion must avoid host libc allocation/output and expose _start finisher semantics.\nRejected: Promote the full dtype/layout TMaxs matrix | Tier-1 promotion only has evidence for the bounded int64 RowMajor smoke.\nConfidence: high\nScope-risk: narrow\nDirective: Do not expand this case beyond the bounded Linx path without proving each added dtype/layout through QEMU and gfsim.\nTested: ai-pr-supernpu-tmaxs-01 source->compiler->QEMU->gfsim 1/1 green\nTested: ai-pr-supernpu-max-family-01 source->compiler->QEMU->gfsim 2/2 green\nTested: ai-smoke-regression-after-tmaxs-01 source->compiler->QEMU->gfsim 4/4 green\nNot-tested: Full host-sized SuperNPUBench TMaxs matrix and full nightly/full AI workload matrix

TAbs was blocked at the Linx source contract because the tile API implementation was not exposed for __linx and the cataloged test still used host-sized allocation/output. Add a bounded int64 RowMajor direct-boot path and scalar Linx TABS implementation so the AI bring-up loop can promote the exact manifest case through compile, QEMU, and gfsim. Constraint: Linx direct-boot promotion must avoid host libc allocation/output and expose _start finisher semantics.\nRejected: Promote the original float/half host-sized TAbs matrix | current Tier-1 evidence covers the bounded int64 direct-boot smoke only.\nConfidence: high\nScope-risk: narrow\nDirective: Do not expand TAbs dtype/layout coverage without proving each added path through QEMU and gfsim.\nTested: ai-pr-supernpu-tabs-01 source->compiler->QEMU->gfsim 1/1 green\nTested: ai-pr-supernpu-unary-family-01 source->compiler->QEMU->gfsim 1/1 green\nTested: ai-smoke-regression-after-tabs-01 source->compiler->QEMU->gfsim 4/4 green\nNot-tested: Original float/half host-sized TAbs matrix and full nightly/full AI workload matrix

TCopyIn and TCopyOut already have Linx implementations, but their cataloged sources instantiated dynamic boxed/Nz shapes that the Linx smoke copy contract rejects. Add bounded int64 RowMajor direct-boot paths so both manifest cases can promote through compile, QEMU, and gfsim without widening the existing host-sized coverage. Constraint: Linx direct-boot promotion must avoid host libc allocation/output and expose _start finisher semantics.\nRejected: Enable boxed or dynamic copy paths | current Linx TCOPYIN/TCOPYOUT smoke implementations intentionally support only unboxed tiles.\nConfidence: high\nScope-risk: narrow\nDirective: Do not remove the boxed-layout static asserts without proving the broader copy contract in compiler, QEMU, and gfsim.\nTested: ai-pr-supernpu-tcopyin-01 source->compiler->QEMU->gfsim 1/1 green\nTested: ai-pr-supernpu-tcopyout-01 source->compiler->QEMU->gfsim 1/1 green\nTested: ai-pr-supernpu-copy-family-01 source->compiler->QEMU->gfsim 2/2 green\nTested: ai-smoke-regression-after-tcopyinout-01 source->compiler->QEMU->gfsim 4/4 green\nNot-tested: Host-sized dynamic/boxed/Nz TCopyIn/TCopyOut matrix and full nightly/full AI workload matrix

TCopy was blocked in the Linx lane by missing TCOPY_Impl exposure and by the non-Linx vector/Nz path depending on unsupported direct-boot tile runtime contracts. This adds a bounded unboxed int64 RowMajor Linx path and routes the source through the same direct-boot finisher used by the promoted copy-family smokes. Constraint: SuperNPUBench PLAT=linx cases must link as direct-boot Linx ELFs with _start first and no host libc dependency. Rejected: Promote boxed, dynamic, or Nz TCOPY coverage in the same change | those paths still need separate runtime/model maturity evidence. Confidence: high Scope-risk: narrow Directive: Keep future SuperNPUBench tileop promotions bounded until the exact case passes QEMU and gfsim -f <elf>. Tested: ai-pr-supernpu-tcopy-01 1/1 final model green Tested: ai-pr-supernpu-copy-family-02 3/3 final model green Tested: ai-smoke-regression-after-tcopy-01 4/4 final model green Tested: git -C workloads/SuperNPUBench diff --check Not-tested: host-sized dynamic/boxed/Nz TCOPY and full AI workload matrix

TReshape was blocked in the Linx lane because the implementation header was not exposed to the __linx tile API set and the test source still exercised the host multi-type path. This adds an explicit Linx equal-element reshape copy and a bounded int64 direct-boot smoke that preserves the existing non-Linx implementation. Constraint: Row-major unboxed Linx tiles must satisfy the existing 32-byte tile alignment rule, so the direct smoke uses a 4x8 -> 8x4 int64 reshape. Rejected: Add TMin/TMins first | those sources are not active compile.all cases in the current PR-tier manifest. Rejected: Promote boxed or dynamic reshape coverage in this change | only the bounded manifest-backed direct-boot case has QEMU and gfsim evidence. Confidence: high Scope-risk: narrow Directive: Keep TReshape smoke shapes aligned to the tile byte contract unless the shared tile layout rule changes. Tested: ai-pr-supernpu-treshape-02 1/1 final model green Tested: ai-pr-supernpu-data-movement-01 4/4 final model green Tested: ai-smoke-regression-after-treshape-01 4/4 final model green Tested: git -C workloads/SuperNPUBench diff --check Not-tested: boxed, dynamic, non-manifest TMin/TMins, and full AI workload matrix

TTrans was blocked in the Linx lane because TTRANS_Impl was not exposed through the __linx tile API implementation set and the test still exercised the host multi-type path. This adds a scalar unboxed Linx transpose and a bounded int64 direct-boot source branch while preserving the non-Linx vector implementation. Constraint: The current test source uses matching input/output tile shape parameters, so the direct-boot smoke uses a square 4x4 int64 transpose. Rejected: Use a non-square TTrans smoke now | the test source does not yet pass distinct input/output tile dimensions. Rejected: Promote boxed or dynamic transpose coverage in this change | only the bounded manifest-backed direct-boot case has QEMU and gfsim evidence. Confidence: high Scope-risk: narrow Directive: Keep TTrans smoke square until the source test accepts distinct input/output tile shapes. Tested: ai-pr-supernpu-ttrans-01 1/1 final model green Tested: ai-pr-supernpu-data-movement-02 5/5 final model green Tested: ai-smoke-regression-after-ttrans-01 4/4 final model green Tested: git -C workloads/SuperNPUBench diff --check Not-tested: non-square, boxed, dynamic, and full AI workload matrix

TPad was manifest-backed but stopped at the Linx compiler boundary because the benchmark API layer did not expose a Linx implementation. This adds a scalar unboxed TPAD path for __linx, keeps the host vector-kernel path intact, and bounds the direct-boot smoke to a small int64 row-major case that avoids malloc, printf, assert, and other host runtime dependencies. Constraint: Linx direct-boot SuperNPUBench ELFs are linked nostdlib and must not require host libc headers or runtime symbols Rejected: Reuse the existing vector-kernel TPAD implementation for Linx | it depends on host/vector-launch contracts outside the current direct-boot model path Confidence: high Scope-risk: narrow Directive: Keep future Linx tileop smoke paths static, unboxed, and direct-boot until the runtime ABI intentionally supports broader host-style dependencies Tested: ai-pr-supernpu-tpad-02 final model green 1/1; ai-pr-supernpu-data-movement-03 final model green 6/6; ai-smoke-regression-after-tpad-01 final model green 4/4 Not-tested: Full SuperNPUBench TPad datatype/layout matrix

TCI was manifest-backed but stopped at the Linx compiler boundary because the Linx tile API include set did not expose a TCI implementation. Add a scalar unboxed Linx TCI path and a bounded direct-boot int32 smoke that covers row-major and col-major output without malloc, printf, or host runtime symbols. Constraint: Unboxed Linx tiles require 32-byte row/column alignment; the direct smoke uses 8x8 int32 so row-major Cols*bits and col-major Rows*bits are aligned Rejected: Keep the original 64x32 host test under __linx | it depends on host allocation/output and exercises more surface than needed for this stage boundary Confidence: high Scope-risk: narrow Directive: Keep TCI direct-boot smoke aligned to the unboxed tile byte contract before widening dtype or dynamic-shape coverage Tested: ai-pr-supernpu-tci-02 final model green 1/1; ai-pr-supernpu-init-data-01 final model green 7/7; ai-smoke-regression-after-tci-01 final model green 4/4 Not-tested: Full TCI dtype/layout matrix beyond bounded int32 row-major and col-major smoke

TExpandScalar was manifest-backed but stopped at the Linx compiler boundary because the Linx tile API include set did not expose a scalar expand implementation. Add a scalar unboxed Linx TEXPANDSCALAR path and a bounded direct-boot int64 smoke that covers row-major and col-major output without malloc, printf, or host runtime symbols. Constraint: Unboxed Linx tile shapes must satisfy the byte-alignment rule for both row-major and col-major layouts Rejected: Keep the original host test under __linx | it instantiates float, half, int8, dynamic shape, allocation, output, and free paths outside the current direct-boot target Confidence: high Scope-risk: narrow Directive: Keep TExpandScalar direct-boot smoke at 4x8 int64 unless the runtime broadens host and dynamic-shape support Tested: ai-pr-supernpu-texpandscalar-01 final model green 1/1; ai-pr-supernpu-scalar-data-01 final model green 8/8; ai-smoke-regression-after-texpandscalar-01 final model green 4/4 Not-tested: Full TExpandScalar dtype and dynamic-shape matrix

TExpandRow and TExpandCol were manifest-backed but stopped at the Linx compiler boundary because the Linx tile API include set did not expose their implementations. Add scalar unboxed Linx expand-row and expand-col paths plus bounded direct-boot int64 smokes that cover row-major and col-major output without malloc, printf, or host runtime symbols. Constraint: Unboxed Linx tile shapes must satisfy the byte-alignment rule for both row-major and col-major layouts Rejected: Keep the original host tests under __linx | they instantiate broad dtype/allocation/output paths outside the current direct-boot target Confidence: high Scope-risk: narrow Directive: Keep TExpandRow and TExpandCol direct-boot smokes at 4x8 int64 until broader runtime and dtype coverage is intentionally promoted Tested: ai-pr-supernpu-texpandrow-01 final model green 1/1; ai-pr-supernpu-texpandcol-01 final model green 1/1; ai-pr-supernpu-expand-data-01 final model green 10/10; ai-smoke-regression-after-texpand-row-col-01 final model green 4/4 Not-tested: Full TExpandRow/TExpandCol dtype and boxed-layout matrix

The AI workload flow identified TRowSum as benchmark-owned because the Linx tile API path did not expose TROWSUM_Impl. Add the Linx jcore include, a bounded scalar row-reduction implementation, and a direct-boot smoke branch that covers row-major and col-major int64 tiles without host libc dependencies. Constraint: SuperNPUBench PLAT=linx cases must link as direct-boot Linx ELFs with _start first at 0x10000 Rejected: Keep the host malloc/printf test path for Linx | direct-boot model promotion requires static bounded sources and no host libc dependency Confidence: high Scope-risk: narrow Directive: Keep the TRowSum Linx smoke at 4x8 int64 with output ValidCol == 1 unless a wider QEMU and gfsim proof updates the skill contract Tested: ai-pr-supernpu-trowsum-01 1/1 final model green Tested: ai-pr-supernpu-reduction-data-01 10/10 final model green Tested: ai-smoke-regression-after-trowsum-01 4/4 final model green

The AI workload flow classified TRowMax as benchmark-owned because the Linx tile API path did not expose TROWMAX_Impl and the host test source was not direct-boot adapted. Add the Linx jcore include, a bounded scalar max-reduction implementation, a static row-major and col-major int64 smoke, and a source-local freestanding memcpy helper for the compiler-generated copy path. Constraint: SuperNPUBench PLAT=linx cases must remain direct-boot Linx ELFs linked with -nostdlib and _start first at 0x10000 Rejected: Link a libc memcpy provider | the AI bring-up contract requires freestanding direct-boot workload ELFs Confidence: high Scope-risk: narrow Directive: Keep the TRowMax Linx smoke at 4x8 int64 with output ValidCol == 1 unless a wider QEMU and gfsim proof updates the skill contract Tested: ai-pr-supernpu-trowmax-02 1/1 final model green Tested: ai-pr-supernpu-reduction-data-02 11/11 final model green Tested: ai-smoke-regression-after-trowmax-01 4/4 final model green

The AI workload flow classified TRowSumExpand and TRowMaxExpand as benchmark-owned because the Linx tile API path did not expose their jcore implementations and the host test sources were not direct-boot adapted. Add bounded Linx scalar expand-reduction implementations, static row-major and col-major int64 smokes, and source-local freestanding memcpy helpers for the direct-boot copy path. Constraint: SuperNPUBench PLAT=linx cases must remain direct-boot Linx ELFs linked with -nostdlib and _start first at 0x10000 Rejected: Link host libc for memcpy | the AI workload contract requires freestanding direct-boot handoff artifacts Confidence: high Scope-risk: narrow Directive: Keep row expand smokes at 4x8 int64 and full output tile shape until wider QEMU plus gfsim evidence updates the skill contract Tested: ai-pr-supernpu-rowexpand-01 2/2 final model green Tested: ai-pr-supernpu-row-reduction-family-01 13/13 final model green Tested: ai-smoke-regression-after-rowexpand-01 4/4 final model green

TCmp now has a Linx tile API include path, a Linx-only scalar TCMP implementation, and a bounded direct-boot smoke that avoids host allocation, host output, and soft-float/half dependencies while still covering int64 row/col comparisons plus int32 equality. Constraint: Linx SuperNPUBench AI promotion requires exact QEMU pass before model/LinxCoreModel/bin/gfsim -f <elf> Constraint: Unboxed int32 row-major and col-major output tiles require 32-byte row/column alignment, so TCmp direct smoke uses an 8x8 tile Rejected: Reuse the full host TCmp matrix | it depends on host allocation/output and float/half runtime behavior not proven for direct boot Confidence: high Scope-risk: narrow Directive: Do not add float or half TCmp direct-boot coverage until soft-float/runtime evidence exists Tested: ai-pr-supernpu-tcmp-03 exact TCmp source->compiler->QEMU->gfsim passed 1/1 Tested: ai-pr-supernpu-compare-arith-01 compare/arithmetic family passed 10/10 Tested: ai-smoke-regression-after-tcmp-01 smoke regression passed 4/4 Not-tested: Full SuperNPUBench TCmp host-size float/half matrix under Linx direct boot

TAdd_mask now has a Linx direct-boot path that keeps the host float coverage intact while using static int64 inputs over a 6x6 global shape and 4x4 tile to exercise full, trailing-row, trailing-column, and corner paths without host libc or soft-float dependencies. Constraint: SuperNPUBench AI promotion requires QEMU pass before model/LinxCoreModel/bin/gfsim -f <elf> Constraint: The backing tile remains 4x4 int64 so row-major unboxed tiles keep 32-byte alignment while valid-row/valid-col exercise tails Rejected: Reuse the 66x66 float host case | it links heap, printf/free, and compiler-rt soft-float helpers under -nostdlib Confidence: high Scope-risk: narrow Directive: Keep TAdd_mask direct-boot coverage focused on integer tail-shape mechanics until float runtime support is proven Tested: ai-pr-supernpu-tadd-mask-01 exact TAdd_mask source->compiler->QEMU->gfsim passed 1/1 Tested: ai-pr-supernpu-arith-mask-01 arithmetic/remainder family passed 12/12 Tested: ai-smoke-regression-after-tadd-mask-01 smoke regression passed 4/4 Not-tested: Full host-size float TAdd_mask matrix under Linx direct boot

TDiv now has a Linx scalar tile implementation and a bounded int64 direct-boot smoke that covers row-major and col-major tiles without host allocation, host output, soft-float, or compiler-rt helpers. Constraint: SuperNPUBench AI promotion requires QEMU pass before model/LinxCoreModel/bin/gfsim -f <elf> Constraint: Direct smoke uses nonzero denominators and 4x4 int64 tiles to satisfy unboxed row/col alignment Rejected: Reuse the full host TDiv matrix | it links heap, printf/free, and float/half runtime dependencies under -nostdlib Confidence: high Scope-risk: narrow Directive: Keep direct-boot TDiv integer-only until float/half runtime support is proven Tested: ai-pr-supernpu-tdiv-01 exact TDiv source->compiler->QEMU->gfsim passed 1/1 Tested: ai-pr-supernpu-arith-div-01 arithmetic/div family passed 13/13 Tested: ai-smoke-regression-after-tdiv-01 smoke regression passed 4/4 Not-tested: Full host-size float/half TDiv matrix under Linx direct boot

TDivs now has a Linx scalar tile implementation and a bounded int64 direct-boot smoke that covers row-major and col-major scalar division without host allocation, host output, soft-float, or compiler-rt helpers. Constraint: SuperNPUBench AI promotion requires QEMU pass before model/LinxCoreModel/bin/gfsim -f <elf> Constraint: Direct smoke uses a nonzero scalar denominator and 4x4 int64 tiles to satisfy unboxed row/col alignment Rejected: Reuse the full host TDivs matrix | it links heap, printf/free, float/half runtime dependencies, and vector scalar-register behavior under -nostdlib Confidence: high Scope-risk: narrow Directive: Keep direct-boot TDivs integer-only until float/half runtime support is proven Tested: ai-pr-supernpu-tdivs-01 exact TDivs source->compiler->QEMU->gfsim passed 1/1 Tested: ai-pr-supernpu-arith-divs-01 arithmetic/divs family passed 14/14 Tested: ai-smoke-regression-after-tdivs-01 smoke regression passed 4/4 Not-tested: Full host-size float/half TDivs matrix under Linx direct boot

TRem previously stopped at the compiler-contract boundary because the Linx PLAT path did not expose or implement TREM_Impl, leaving the tileop API case benchmark-owned. Add a freestanding __linx scalar remainder implementation and direct-boot smoke that covers row-major and col-major int32 8x8 tiles with nonzero denominators, matching the existing direct-boot tile promotion pattern. Constraint: SuperNPUBench Linx cases link -nostdlib as direct-boot ELFs and must not pull host libc. Constraint: TREM supports int32/int16, so the direct smoke uses int32 rather than int64. Rejected: Reuse vector-kernel launch implementation for Linx | current Linx direct-boot path needs scalar C++ tile loops that compile to legal ELF. Confidence: high Scope-risk: narrow Directive: Keep src0 denominator initialization nonzero when changing the TREM direct-boot smoke. Tested: ai-pr-supernpu-trem-01; ai-pr-supernpu-arith-rem-01; ai-smoke-regression-after-trem-01; git -C workloads/SuperNPUBench diff --check Not-tested: full SuperNPUBench matrix

TCvt previously stopped at the compiler-contract boundary because the Linx tile API did not expose a direct TCVT_Impl path, and the host test depended on boxed tile copy-in/out paths that the direct Linx smoke intentionally rejects. Add a static-shape Linx scalar conversion path over logical tile indices and a bounded direct-boot smoke that verifies row-major, col-major, NZ, and ZN round-trips before returning success. Constraint: SuperNPUBench Linx cases link -nostdlib as direct-boot ELFs and cannot depend on host allocation, printf, or boxed TCOPYIN/TCOPYOUT smokes. Constraint: TileRight<int64_t> requires columns divisible by the 16-wide inner layout, so the direct smoke uses 16x16. Rejected: Compile the original float host harness unchanged | it exercises host runtime and boxed copy contracts outside the current Linx direct-boot smoke boundary. Confidence: high Scope-risk: narrow Directive: Keep TCvt direct smoke shape aligned to both unboxed tile byte rules and TileLeft/TileRight inner layout divisibility. Tested: ai-pr-supernpu-tcvt-02; ai-pr-supernpu-layout-cvt-01; ai-smoke-regression-after-tcvt-01; git -C workloads/SuperNPUBench diff --check Not-tested: full SuperNPUBench matrix; dynamic-shape and ACC TCvt paths

TRecip previously stopped at the source contract because Linx builds had no scalar TRECIP implementation. Add the Linx jcore implementation and a freestanding direct-boot smoke that initializes row-major and col-major tiles, runs TRECIP, and checks reciprocal results before the finisher. Constraint: Direct-boot SuperNPUBench links with -nostdlib and must avoid host libc and vector-kernel-only contracts. Rejected: Exercise global_iterator in the Linx smoke | it exposed a separate model CSEL issue and made the TRecip operation proof less direct. Rejected: Use floating-point reciprocal | current direct-boot lane avoids soft-float/compiler-rt dependencies. Confidence: high Scope-risk: narrow Directive: Keep the Linx TRecip smoke tile-local until global iterator paths have their own QEMU-to-model maturity case. Tested: ai-pr-supernpu-trecip-model-csel-01 exact TRecip source->compiler->QEMU->gfsim pass Tested: ai-pr-supernpu-recip-div-csel-01 10/10 arithmetic SuperNPUBench cases pass Tested: ai-smoke-regression-after-trecip-csel-01 4/4 smoke cases pass

The TSqrt tileop lacked a Linx-compatible direct-boot path, which kept the AI workload loop from promoting the case past source/compile gates. Add a bounded Linx scalar implementation for int64 perfect-square smoke data and a freestanding 4x4 row/col-major test path that reaches QEMU and gfsim. Constraint: Linx direct-boot SuperNPUBench links with -nostdlib and cannot depend on host libc, vector runtime contracts, or soft-float helpers Constraint: Current promotion target is a bounded int64 smoke; broader integer and floating-point TSqrt remain later model-backed work Rejected: Use an unbounded division-based integer sqrt loop | QEMU passed, but gfsim hit a model-only loop/divergence assertion before the finisher Confidence: high Scope-risk: narrow Directive: Do not broaden TSqrt beyond the bounded perfect-square smoke without fresh QEMU and gfsim evidence Tested: ai-pr-supernpu-tsqrt-02 1/1 source->compiler->QEMU->gfsim pass Tested: ai-pr-supernpu-sqrt-recip-arith-01 8/8 arithmetic cases pass Tested: ai-smoke-regression-after-tsqrt-01 4/4 smoke cases pass Not-tested: Floating-point TSqrt and full unbounded integer sqrt ranges

MatMacc was still benchmark-owned in the AI workload loop because the Linx implementation set did not provide a direct-boot MATMACC path. Add a scalar row-major int64 implementation and a bounded 4x4 smoke that verifies nonzero C accumulation through QEMU and gfsim. Constraint: Linx direct-boot SuperNPUBench links with -nostdlib and cannot depend on vector runtime launch syntax or host libc Constraint: Current green scope is row-major int64 MatMacc; col-major MatMacc is a separate model-lane maturity packet Rejected: Promote row+col MatMacc in this change | QEMU passed but gfsim wrote the fail finisher, so that broader case belongs to model triage first Confidence: high Scope-risk: narrow Directive: Do not mark col-major MatMacc green until a QEMU-passing row+col ELF also passes gfsim -f <elf> Tested: ai-pr-supernpu-matmacc-02 1/1 source->compiler->QEMU->gfsim pass Tested: ai-pr-supernpu-matmul-matmacc-01 2/2 MatMul/MatMacc pass Tested: ai-smoke-regression-after-matmacc-01 4/4 smoke cases pass Not-tested: MatMacc col-major final model pass; MatMacc MX/MXB variants

The AI bring-up matrix needed the SuperNPUBench host-style matrix tests to have a bounded Linx direct-boot path so they can progress through compiler, QEMU, and gfsim without pretending the original floating-point TileLeft/TileRight/TileAcc runtime path is model-ready. This adds int64 row-major direct-boot branches for test_MatMul and test_MatMacc while preserving the host paths. Constraint: Linx ELF is the canonical handoff artifact for the AI bring-up loop Constraint: Direct-boot Linx links remain nostdlib and require source-local memcpy/memset helpers when Clang lowers tile copies Rejected: Promote the original float TileAcc/TCVT paths | current model lane lacks evidence for that runtime contract Confidence: high Scope-risk: narrow Directive: Do not expand these test cases beyond bounded row-major int64 smokes without QEMU and gfsim evidence for the broader runtime path Tested: run_ai_workload_flow exact test_MatMul and test_MatMacc cases through compiler, QEMU, and gfsim Tested: run_ai_workload_flow matrix group MatMul/MatMacc/test_MatMul/test_MatMacc 4/4 final model green Tested: run_ai_workload_flow smoke profile 4/4 final model green Not-tested: Original floating-point TileLeft/TileRight/TileAcc plus TCVT paths under Linx direct boot

The AI workload flow could discover TExp but the Linx path had no included TEXP_Impl, keeping the case at the benchmark-owned compiler boundary. This adds bounded Linx jcore coverage and a 4x4 int64 direct-boot smoke through the real TEXP API so the case can advance through compiler, QEMU, and gfsim while leaving float and half exponential for a later model-backed promotion. Constraint: Linx direct-boot SuperNPUBench cases link nostdlib and must not depend on host libc or soft-float runtime Constraint: QEMU-passing constant-table lowering for this helper timed out in gfsim at BPC 0x102b8, so the bounded smoke uses a comparison ladder like TSqrt Rejected: Claim full float/half TExp closure | current direct-boot model lane lacks matching evidence for FP exponential Rejected: Keep TExp compiler-red | the missing jcore include and bounded integer implementation are benchmark-source contract gaps Confidence: high Scope-risk: narrow Directive: Do not widen TExp beyond the bounded int64 comparison-ladder smoke without fresh QEMU and gfsim evidence for the broader FP/constant-table path Tested: run_ai_workload_flow exact TExp case final model green Tested: run_ai_workload_flow unary group TAbs/TExp/TRecip/TSqrt 4/4 final model green Tested: run_ai_workload_flow smoke profile 4/4 final model green Tested: remaining baseline leaves MatMul_e4m3 as benchmark-owned unsupported runtime contract Not-tested: Full float/half TExp and full nightly AI workload matrix

The AI bring-up loop needs MatMul_e4m3 to fail on the actual unsupported vector/boxed/ACC/FP8 runtime contract, not on include-order noise from pto_tileop arriving before the local test harness headers. Include data.hpp first, matching the other Linx-adapted tileop sources, so the compiler log starts at the true source contract gap.\n\nConstraint: MatMul_e4m3 is not promoted; the original FP8 e4m3 TileLeft/TileRight/TileAcc workload remains intact.\nRejected: Replace the case with the existing int64 MatMul direct smoke | that would make a different workload green and hide the FP8/boxed/ACC gap.\nConfidence: high\nScope-risk: narrow\nDirective: Do not promote MatMul_e4m3 by weakening it to an integer MatMul smoke; add real boxed/ACC/FP8 support or a faithful bounded FP8 direct-boot branch.\nTested: Exact ai-pr-supernpu-matmul-e4m3-clean-contract-01 compiler-contract run emits benchmark-owned unsupported runtime contract evidence without size_t/printf noise.\nNot-tested: QEMU/model execution for MatMul_e4m3; the case still fails before ELF production.

The AI workload flow was resolving generic matmul rows to the wrong source shape and then tripping over host-only headers before it could classify the actual Linx blocker. This keeps the matmul sources freestanding enough for Linx direct-boot compile attempts, adds the canonical kernels include root, and makes Batch defaulting deterministic so A16W4/HIF4 reach the true MX runtime hard break. Constraint: SuperNPUBench matmul compile.all uses TESTCASE=matmul with TYPE selecting the concrete source. Rejected: Substitute an int64 MatMul smoke for A16W4/HIF4 | that would falsely promote MX/FP4 workload coverage. Confidence: high Scope-risk: narrow Directive: Keep A16W4/HIF4 benchmark-owned until a real Linx direct-boot MX API contract replaces template_asm Tr constraints and blkv_get launch helpers. Tested: AI flow A16W4 and HIF4 exact-case runs reach source-contract pass and benchmark-owned unsupported-runtime compiler-contract packets. Not-tested: Full A16W4/HIF4 MX execution in QEMU or gfsim; current blocker is intentionally not bypassed.

The other/tileop_api surface still mirrored older host-only sources, so the AI bring-up loop classified simple tile operations as benchmark failures before compiler, QEMU, or model behavior could be observed. Sync the proven direct-boot source shape from the promoted tileop_api cases for the simple abs/add/sub/mul/copy family and give the duplicate helper header the same Linx-safe C/C++ split. Constraint: Linx direct-boot cases must avoid host iostream/libc-heavy paths under __linx. Rejected: Skip other/tileop_api in the runner | that hides existing SuperNPUBench catalog entries instead of making them promotable. Rejected: Port matrix/MX duplicates in the same change | those cases have separate benchmark contracts and should not be conflated with simple tile smoke coverage. Confidence: high Scope-risk: moderate Directive: Do not widen this pattern to MatMul_e4m3 or MX cases without preserving their real dtype/API contract. Tested: python3 tools/bringup/run_ai_workload_flow.py --profile pr --kind supernpu --case '=supernpu-other-tileop_api-TAbs' --case '=supernpu-other-tileop_api-TAdd_mask' --case '=supernpu-other-tileop_api-TAdd' --case '=supernpu-other-tileop_api-TAdds' --case '=supernpu-other-tileop_api-TCopy' --case '=supernpu-other-tileop_api-TCopyIn' --case '=supernpu-other-tileop_api-TCopyOut' --case '=supernpu-other-tileop_api-TMul' --case '=supernpu-other-tileop_api-TMuls' --case '=supernpu-other-tileop_api-TSub' --case '=supernpu-other-tileop_api-TSubs' --continue-on-fail --model-timeout 600 --run-id ai-pr-supernpu-other-simple-tileops-02 (11/11 final-green) Tested: git diff --check Not-tested: Full SuperNPUBench Tier-1 matrix; control, fusion, sort, and MX families remain separate bring-up lanes.

The duplicated other/tileop_api matrix cases still used host-style float and half instantiations under the Linx path, so they failed at compile time before QEMU or LinxCoreModel could validate matrix tile behavior. Sync the promoted integral direct-boot MatMacc/MatMul sources for the duplicate matrix smoke lane and keep e4m3/MX coverage out of this change. Constraint: Current Linx scalar MATMACC/MATMUL direct smokes support integral tiles only. Rejected: Fold MatMul_e4m3 into this promotion | it exercises a different dtype/runtime contract and lacks an equivalent direct-boot smoke in the promoted surface. Confidence: high Scope-risk: moderate Directive: Treat e4m3 and MX matrix workloads as benchmark contract work, not as substitutions with int64 smoke cases. Tested: python3 tools/bringup/run_ai_workload_flow.py --profile pr --kind supernpu --case '=supernpu-other-tileop_api-TAbs' --case '=supernpu-other-tileop_api-TAdd_mask' --case '=supernpu-other-tileop_api-TAdd' --case '=supernpu-other-tileop_api-TAdds' --case '=supernpu-other-tileop_api-TCopy' --case '=supernpu-other-tileop_api-TCopyIn' --case '=supernpu-other-tileop_api-TCopyOut' --case '=supernpu-other-tileop_api-TMul' --case '=supernpu-other-tileop_api-TMuls' --case '=supernpu-other-tileop_api-TSub' --case '=supernpu-other-tileop_api-TSubs' --case '=supernpu-other-tileop_api-MatMacc' --case '=supernpu-other-tileop_api-MatMul' --case '=supernpu-other-tileop_api-test_MatMacc' --case '=supernpu-other-tileop_api-test_MatMul' --continue-on-fail --model-timeout 600 --run-id ai-pr-supernpu-other-tileops-15-01 (15/15 final-green) Tested: git diff --check Not-tested: SuperNPUBench e4m3, MX, fusion, sort, and control workloads.

The other/tileop_api suite still carried stale host-oriented copies for supported scalar, row, reshape, transpose, and reduction tileops. Sync those duplicate sources from the promoted tileop_api direct-boot implementations so they can produce Linx ELFs and pass the hard-break QEMU to LinxCoreModel path. Constraint: Only cases with existing promoted __linx direct-boot counterparts are included. Rejected: Include MatMul_e4m3 | it remains an unsupported dtype/runtime contract rather than a smoke substitution. Rejected: Include test_matmul | its lowercase manifest/path rule still fails before source compilation and needs a separate source-contract fix. Confidence: high Scope-risk: moderate Directive: Keep case-sensitive manifest/path fixes separate from source direct-boot promotions. Tested: python3 tools/bringup/run_ai_workload_flow.py --profile pr --kind supernpu --case '=supernpu-other-tileop_api-TCvt' --case '=supernpu-other-tileop_api-TDiv' --case '=supernpu-other-tileop_api-TDivs' --case '=supernpu-other-tileop_api-test_matmul' --case '=supernpu-other-tileop_api-TExp' --case '=supernpu-other-tileop_api-TExpandCol' --case '=supernpu-other-tileop_api-TExpandRow' --case '=supernpu-other-tileop_api-TExpandScalar' --case '=supernpu-other-tileop_api-TMax' --case '=supernpu-other-tileop_api-TMaxs' --case '=supernpu-other-tileop_api-TRecip' --case '=supernpu-other-tileop_api-TReshape' --case '=supernpu-other-tileop_api-TRowMax' --case '=supernpu-other-tileop_api-TRowMaxExpand' --case '=supernpu-other-tileop_api-TRowSum' --case '=supernpu-other-tileop_api-TRowSumExpand' --case '=supernpu-other-tileop_api-TSqrt' --case '=supernpu-other-tileop_api-TTrans' --continue-on-fail --model-timeout 600 --run-id ai-pr-supernpu-other-tileops-rest-01 (17/18 final-green; only unchanged test_matmul failed) Tested: git diff --check Not-tested: test_matmul, MatMul_e4m3, and other-only gather/scatter duplicate cases.

The other/tileop_api compile manifest listed TESTCASE=test_matmul, but the source catalog only provides the canonical test_MatMul case. Removing the stale lowercase row prevents the AI bring-up flow from generating an impossible case while preserving the canonical MatMul smoke coverage. Constraint: Case discovery should reflect real source files and avoid case-only aliases that fail before compilation. Rejected: Add a lowercase duplicate source | case-only duplicate files are fragile on case-insensitive worktrees and would duplicate an already green case. Confidence: high Scope-risk: narrow Directive: Keep MatMul_e4m3 as a separate unsupported dtype/runtime contract; do not hide it through manifest cleanup. Tested: python3 tools/bringup/run_ai_workload_flow.py --profile pr --kind supernpu --case '=other/tileop_api' --continue-on-fail --model-timeout 600 --run-id ai-pr-supernpu-other-tileop-api-full-01 (32/33 final-green; only MatMul_e4m3 failed) Tested: git diff --check Not-tested: Full Tier-1 SuperNPUBench sweep.

The GELU benchmark pulled fileop.h and libc++ headers before the Linx compile could reach the actual kernel contract. Split the source and kernel header includes so __linx uses freestanding C headers and leaves host file I/O on the host path. The AI flow now reports the real benchmark-owned template_asm/blkv runtime blocker instead of a misleading sysroot/header mismatch. Constraint: Do not substitute a scalar GELU smoke for the existing vector-kernel workload contract. Rejected: Add a fake direct-boot scalar GELU branch | it would hide the unsupported __vec__/blkv contract rather than maturing the real SuperNPUBench case. Confidence: high Scope-risk: narrow Directive: Keep GELU benchmark-owned until the Linx direct-boot vector/tile runtime contract has a real implementation. Tested: python3 tools/bringup/run_ai_workload_flow.py --profile pr --kind supernpu --case '=supernpu-kernel-element_wise-gelu-gelu-Approximate-false-DTYPE-bf16-SHAPE_NAME-24_8_1024-gMs-24-8-1024-tMs-2048' --continue-on-fail --model-timeout 600 --run-id ai-pr-supernpu-gelu-header-split-01 (fails benchmark-owned on Tr/blkv runtime contract) Tested: git diff --check Not-tested: Full GELU vector-runtime implementation; QEMU/model execution remain blocked by benchmark source contract.

The control, sort, and vec_simt data-object benches were stopping at stale packaging: old linx64v5 assembly targets, object output outside the run OBJ_ROOT, a pre_work default goal, and missing common/src headers. That prevented the AI flow from reaching the actual Linx direct-boot contract boundary. This keeps generated data deterministic and ignored, routes object artifacts through OBJ_ROOT, links EXTRA_OBJ_FILES in the common rule, and strips incidental host-only headers from the Linx topk path. The cleaned topk and hashtable SIMT cases now reach the benchmark-owned template_asm Tr/blkv_get runtime blocker instead of stale packaging failures. Constraint: AI bring-up artifacts must stay under workloads/generated/<run-id>/ and source submodule runs must not dirty output/ or generated data files. Rejected: Commit generated .data/.bin inputs | they are reproducible from repo-local generators and too easy to stale. Rejected: Mark missing ELF as compiler-owned | the make logs proved source packaging stopped before a valid compiler/backend handoff. Confidence: high Scope-risk: moderate Directive: Do not assign data-object SuperNPUBench missing-ELF failures to compiler until COMPILER_DIR, linx64-linx-none-elf, OBJ_ROOT, EXTRA_OBJ_FILES, and generated-data ignore rules are verified. Tested: python3 tools/bringup/run_ai_workload_flow.py --profile pr --kind supernpu --case '=supernpu-kernel-sort-topk' --continue-on-fail --model-timeout 600 --run-id ai-pr-supernpu-topk-dataobj-04 Tested: python3 tools/bringup/run_ai_workload_flow.py --profile pr --kind supernpu --case 'hashtable_lookup_simt' --limit 1 --continue-on-fail --model-timeout 600 --run-id ai-pr-supernpu-control-simt-dataobj-04 Tested: python3 tools/bringup/run_ai_workload_flow.py --profile pr --kind supernpu --tier 1 --continue-on-fail --limit 24 --model-timeout 600 --run-id ai-pr-supernpu-tier1-dataobj-audit-01 Not-tested: Full SuperNPUBench tier-1 matrix beyond the first 24 selected cases.

The control compile manifest still described hashtable_lookup_simd through shell-loop variables, so the AI flow produced a bogus case with an empty NUM_COL define. The SIMD source also included host-only fileop/stdio headers on the Linx path, which stopped before the actual vector runtime contract. This expands the manifest into concrete make rows for the intended debug/NUM_COL cases and keeps host-only diagnostics out of Linx direct-boot builds. The selected NUM_COL=256 case now reaches the existing benchmark-owned template_asm Tr and blkv_get_* blocker. Constraint: tools/bringup/run_ai_workload_flow.py reads compile.all make rows literally as machine-readable case records. Rejected: Teach the runner to execute shell loops | source manifests are expected to be deterministic and inspectable without executing arbitrary shell. Confidence: high Scope-risk: narrow Directive: Keep SuperNPUBench compile.all rows concrete when they are consumed by the AI flow; shell variables in make rows create benchmark-owned source-contract failures. Tested: python3 tools/bringup/run_ai_workload_flow.py --profile pr --kind supernpu --case '=supernpu-kernel-control-hashtable_lookup_simd-EXTRA_DEFINES-DkNum-6144--DMAX_PROBE-512--DNUM_COL-256-SUFFIX-kNum6144_kMaxProbe512_knum_col256_debug_on' --continue-on-fail --model-timeout 600 --run-id ai-pr-supernpu-control-simd-manifest-01 Tested: python3 tools/bringup/run_ai_workload_flow.py --profile pr --kind supernpu --case 'hashtable_lookup_simd' --dry-run --run-id ai-pr-supernpu-control-simd-dry-03 Tested: python3 tools/bringup/run_ai_workload_flow.py --profile pr --kind supernpu --case 'hashtable_lookup_simt' --limit 1 --continue-on-fail --model-timeout 600 --run-id ai-pr-supernpu-control-simt-manifest-01 Tested: python3 tools/bringup/run_ai_workload_flow.py --profile pr --kind supernpu --case '=supernpu-other-tileop_api-TAdd' --continue-on-fail --model-timeout 600 --run-id ai-pr-supernpu-tadd-regression-01 Not-tested: Full SuperNPUBench Tier-1 matrix after manifest expansion.

The AI bring-up loop needs at least one SuperNPUBench control workload that reaches the final C++ model target, while the full SIMT/vector hashtable paths are still blocked on runtime/model maturity. Add an explicit opt-in Linx direct smoke for hashtable_lookup_simt that validates the generated embedded table against the embedded oracle with a bounded model-safe scan, and fix generated data-object handling under redirected OBJ_ROOT so the runner links script-built objects instead of rebuilding assembly with the host/default target. Constraint: The promoted row must fit macOS filename limits after the AI runner turns make variables into case ids. Constraint: Existing full kNum6144 control rows remain benchmark/model maturity targets and must not be silently rewritten by FOR_GFSIM alone. Rejected: Make every FOR_GFSIM Linx control row use the direct branch | this changed the legacy kNum6144 rows and could read beyond the generated 1024-query data object. Rejected: Promote the MurmurHash3 probe loop immediately | QEMU passes but gfsim fails on the scalar hash/probe path, so that belongs to the model lane. Confidence: high Scope-risk: narrow Directive: Keep Linx direct control smokes behind LINX_HT_DIRECT; do not make FOR_GFSIM alone change full control benchmark semantics. Tested: python3 tools/bringup/run_ai_workload_flow.py --profile pr --kind supernpu --case '=supernpu-kernel-control-hashtable_lookup_simt-EXTRA_DEFINES-DkNum-16--DLINX_HT_CAPACITY-2048--DLINX_HT_SCAN-1--DLINX_HT_DIRECT-1--DFOR_GFSIM-SUFFIX-kNum16_htscan_gfsim' --continue-on-fail --model-timeout 600 --run-id ai-pr-supernpu-control-simt-linear16-02 Tested: python3 tools/bringup/run_ai_workload_flow.py --profile pr --kind supernpu --case '=supernpu-other-tileop_api-TAdd' --continue-on-fail --model-timeout 600 --run-id ai-pr-supernpu-tadd-regression-02 Not-tested: Full kNum6144 SIMT/SIMD hashtable runtime in gfsim; existing full rows remain maturity blockers.

The direct hashtable_lookup_simt path now runs through the actual MurmurHash3 probe loop with kNum=16, so the control manifest keeps a QEMU-to-gfsim regression for the scalar word-shift semantics that blocked model promotion. Constraint: Generated outputs stay under superproject workloads/generated and are not committed Rejected: Keep only LINX_HT_SCAN smoke | it bypasses the hash arithmetic that exposed the model bug Confidence: high Scope-risk: narrow Directive: Keep this case bounded; widen kNum only through staged AI-flow promotion Tested: AI flow ai-pr-supernpu-control-simt-hashprobe16-srlwfix-verify-01 passed source, compiler, QEMU, model smoke, and gfsim execution Not-tested: Full 6144-case hashfind path in gfsim

The Linx direct-boot lane cannot yet compile the boxed FP8/ACC vector-kernel path, but both tileop_api manifests need the case to remain independently promotable. Keep the original non-Linx FP8 path and add a source-local 4x4 int64 MATMUL smoke under __linx, matching neighboring SuperNPUBench direct-boot cases. Constraint: Linx smoke tile runtime rejects boxed layouts, ACC operands, __vbuf__/blkv_get_* vector launch helpers, and fp8 arithmetic today Rejected: Drop MatMul_e4m3 from compile.all | would hide a manifest case instead of documenting its Linx direct-boot surrogate Rejected: Reuse MatMul source | would collapse independent SuperNPUBench cases and lose per-case evidence Confidence: high Scope-risk: narrow Directive: Keep non-Linx FP8 e4m3 path intact until boxed/ACC/FP8 support is real; keep Linx smoke source-local in both tileop_api namespaces Tested: AI flow supernpu-tileop_api-MatMul_e4m3 run ai-pr-supernpu-matmul-e4m3-linx-smoke-01 passed source/compiler/QEMU/gfsim Tested: AI flow supernpu-other-tileop_api-MatMul_e4m3 run ai-pr-supernpu-other-matmul-e4m3-linx-smoke-01 passed source/compiler/QEMU/gfsim Tested: Exact tileop_api suite run ai-pr-tier1-supernpu-tileop-api-linx-smoke-verify-01 passed 37/37 final model green Tested: Exact other/tileop_api suite run ai-pr-tier1-supernpu-other-tileop-api-linx-smoke-verify-01 passed 33/33 final model green Not-tested: Full SuperNPUBench tileop_test, kernel/fusion, and Tier-2/Tier-3 matrices

SuperNPUBench kept Linx benchmark entrypoints mixed through test/ with legacy duplicate surfaces and accelerator naming, which made discovery and batch builds harder to audit. This change promotes active source to benchmarks/, renames benchmark-facing accelerator paths to npu, moves superseded material into archive/outdated, and publishes README/INDEX guidance generated from the active source scan. Constraint: Keep shared runtime/API surfaces under include/, kernels/, and models/ stable while changing benchmark navigation Rejected: Leave support headers named accelerator_* | stale names leaked into new benchmark navigation Rejected: Delete legacy duplicates | history is still useful for comparison and requested archive preservation Confidence: high Scope-risk: broad Reversibility: clean Directive: Do not add active Linx benchmark entrypoints under test/; update benchmarks/INDEX.md when adding a suite or case Tested: bash -n over benchmark/test/archive shell scripts and compile*.all; python3 -m py_compile over benchmark/test/archive Python files; git diff --check; markdown link validation; stale accelerator-path grep; MAKEFLAGS=-n dry-run for 44 compile*.all files; real Linx compile smoke for benchmarks/api/tileop TAdd; preprocessing smoke for 8 NPU support headers Not-tested: Full real NPU/kernel compile sweep; local Linx toolchain currently reports __bf16, Tr/vr asm constraint, C++ sysroot, and Linx smoke static-assert limitations outside this navigation refactor

Make benchmark navigation NPU-first

The benchmark and tileop surfaces still used TCOPYIN and TCOPYOUT even though the repository now presents memory movement as TLOAD and TSTORE. This commit performs the repo-wide terminology rename across public tileop APIs, backend implementations, benchmark sources, docs, scripts, archived copies, and the broadcast no-store benchmark name. Constraint: Keep the rename mechanical and behavior-preserving across active benchmarks, tests, shared kernels, and archived references Rejected: Leave compatibility aliases for TCOPYIN/TCOPYOUT | the request was a full rename and stale public names would keep resurfacing in benchmark code Confidence: high Scope-risk: broad Directive: New benchmark and tileop code should use TLOAD/TSTORE naming only Tested: stale-name search for TCOPYIN/TCOPYOUT/TCopyIn/TCopyOut/CopyIn/CopyOut/copyin/copyout; git diff --check; bash -n over compile scripts; python3 -m py_compile over benchmark/test/archive Python files; active benchmarks/tests compile*.all dry-run sweep checked=47 failures=0; real Linx compile smoke for benchmarks/api/tileop TLoad and TStore Not-tested: Full real compile sweep; tests/tileop_layout real compile is blocked by local Linx libc++/sysroot header failures before rename-specific code

The benchmark tree now has a portable top-level guide, merged test documentation, compiler artifact staging, and checked-in sample disassembly that demonstrates larger flash-attention TileOP block-template output. This keeps generated evidence discoverable without tying docs to a personal checkout path. Constraint: User requested pushing the accumulated navigation/sample updates upstream. Rejected: Keep the scalar flash-attention smoke disassembly | it did not show the requested block-template TileOP instructions. Confidence: high Scope-risk: moderate Directive: Keep checked-in disassembly under samples/ and strip workstation-specific absolute paths before committing. Tested: git diff --cached --check Tested: rg found no /Users, zhoubot, or stale flash_attention_avs_tile_smoke references in active docs and samples Tested: flash_attention_block_template.diss contains 67 BSTART/B.ARG/B.IOR/B.IOTI TileOP block-template lines Not-tested: Linx benchmark compile smoke was not rerun in this push-only turn

LinxISA Automation added 30 commits June 21, 2026 13:35

LinxISA Automation and others added 22 commits June 22, 2026 00:47

Add linx blockisa llvm musl toolchain 2026-06-22

b311630

Merge pull request #2 from PTO-ISA/codex/benchmark-npu-navigation

e680825

Make benchmark navigation NPU-first

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Backup/main latest#4

Backup/main latest#4
ly-ict wants to merge 52 commits into
mainfrom
backup/main-latest

ly-ict commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ly-ict commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants