diff --git a/docs/architecture-review-and-rust-analysis.md b/docs/architecture-review-and-rust-analysis.md new file mode 100644 index 000000000..2405d5658 --- /dev/null +++ b/docs/architecture-review-and-rust-analysis.md @@ -0,0 +1,252 @@ +# SIMPLER / PTO Runtime — Architecture Review & Rust-Suitability Analysis + +A review of the `simpler` project (HiSilicon PTO Runtime): the 7-layer level +model, the chip-level three-program model, architectural diagrams, and a +component-by-component analysis of where Rust would (and would not) help. + +> Scope note: this is an external review for discussion. The project today is +> ~131 k LOC C++ + ~40 k LOC Python, **zero Rust**. Nothing here proposes a +> rewrite; it maps where Rust's guarantees would pay off if components were +> (re)written, and labels each with the **single dominant reason**. + +![SIMPLER architecture: L0–L6 level model, engine + three-program model, and Rust-suitability map](diagrams/architecture.svg) + +> The three panels above are rendered from [`diagrams/make_diagrams.py`](diagrams/make_diagrams.py) +> (`python3 docs/diagrams/make_diagrams.py` regenerates `diagrams/architecture.svg`). The ASCII +> versions below are the same diagrams inline. + +--- + +## 1. What SIMPLER is + +A **task-graph runtime** that builds and executes DAGs of compute tasks on +Ascend NPU clusters, coordinating **AICPU** (on-device control processor) and +**AICore** (AIV vector / AIC cube compute) execution. Three independently +compiled programs — Host `.so`, AICPU `.so`, AICore `.o` — cooperate through +narrow C APIs, with a Python orchestration layer on top. + +Two orthogonal axes structure the codebase: + +- **Level**: L0 (core) → L6 (cluster) — a 7-layer hierarchy mirroring physical + topology. +- **Program**: Host / AICPU / AICore — the three-program model at the L2 chip + boundary. + +--- + +## 2. The 7-layer level model (L0–L6) + +```text + LEVEL NAME UNIT RUNTIME COMPONENT WORLD + ───── ──────────────── ─────────────────────────── ──────────────────────────── ───────────── + L6 ▒ CLOS2 / Cluster full cluster (N6 super-nodes) Worker(level=5) ×N ┐ + L5 ▒ CLOS1 / SuperNode super-node (N5 pods) Worker(level=4) ×N │ HOST / CLUSTER + L4 ▒ POD / Pod pod (4 hosts) Worker(level=3) ×N + Sub ×M │ (Orchestrator + + L3 ▒ HOST / Node one host (16 chips + M subs) ChipWorker ×N + SubWorker ×M │ Scheduler + Worker, + ─────────────────────────────────────────────────────────────────────────── │ IPC / RoCE / HCCS) + L2 █ CHIP / Processor one NPU chip (shared GM) Host.so + AICPU.so + AICore.o ┘ ← THE BOUNDARY + L1 ░ DIE / L2Cache chip die hardware-managed ┐ ON-DEVICE + L0 ░ CORE / AIV,AIC individual compute core hardware-managed ┘ (shared GM + atomics) +``` + +**L2 is the boundary** between two worlds: + +- **L0–L2 (on-device)**: AICPU scheduler + AICore workers + device Global + Memory. Coordination by shared GM, atomics, barriers, and the + AICPU↔AICore **handshake protocol**. Hard real hardware constraints apply + (e.g. AICore *cannot* write `DATA_MAIN_BASE`; MMIO reads are strictly serial + at ~95 ns each). +- **L3–L6 (host/cluster)**: every level runs the **same** scheduling engine — + one `Worker` C++ class handles L3 upward; `level` is just a diagnostic label. + Composition is **recursive**: a parent Worker schedules child Workers through + the identical mailbox protocol L3 uses for chip children. Local composition + via fork + shared memory; cross-host (L4–L6) via RoCE / HCCS / UB / sockets. + +Maturity, per the docs: L3 implemented; L4 local implemented + remote +simulation; L5/L6 reuse the L4 code path (untested) / remote proposed. + +--- + +## 3. The three engine components (L3+) and the L2 three-program model + +Every level L3+ composes three cooperating components, each on its own thread: + +```text + ORCHESTRATOR (Orch thread) SCHEDULER (Scheduler thread) WORKER (Worker threads) + ───────────────────────── ─────────────────────────── ────────────────────── + DAG builder. Runs on the DAG executor. Drains 3 queues: Execution layer. + user's thread. Owns: · wiring (wire fanout edges) WorkerManager holds + · Ring (slot pool) · ready (fanin satisfied → WorkerThread pools. + · TensorMap(dep inference) pick idle worker) Each encodes (callable, + · Scope (tensor lifetime) · completion (release fanout) config, args) into a shm + Never inspects task data — mailbox → signals the + submit_next_level(c, args, cfg) only moves slot ids + reads forked child → spin-polls + → alloc, dep-infer, push TaskSlotState metadata. TASK_DONE. + wiring_queue ─────────────────► ─────────────────► + ◄──── completion (slot, outcome) +``` + +At **L2**, the "Worker" leaf is a `ChipWorker` that drives the three on-device +programs: + +```text + ┌──────────────────────── Python application / SceneTestCase ───────────────────────┐ + │ nanobind (task_interface) ChipWorker(dlopen host.so) RuntimeBuilder/KernelCompiler │ + └───────────────┬───────────────────────┬───────────────────────────┬─────────────────┘ + │ │ │ (compile) + ▼ ▼ ▼ + ┌──────── Host Runtime (C++ .so) ────────┐ ┌──── Binary data (AICPU.so + AICore.o) ────┐ + │ DeviceRunner · MemoryAllocator · C API │ loads ──► │ dlopen'd / launched at runtime │ + └───────────────────────┬────────────────┘ └───────────────────┬───────────────────────┘ + │ │ + ▼ Ascend device ▼ + ┌──────────────────────────────────────────────────────────────────────────┐ + │ AICPU: task scheduler loop ◄── handshake buffers (aicpu_ready / │ + │ AICore (AIV/AIC): kernels aicore_done / task ptr) ──► compute │ + └──────────────────────────────────────────────────────────────────────────┘ +``` + +**Two platform backends** (`onboard/` real hardware, `sim/` thread-based host +simulation) and **two runtimes** (`host_build_graph` = graph built on host CPU, +for dev/debug; `tensormap_and_ringbuffer` = graph built on AICPU/device, for +production) sit under `src/{arch}/{platform,runtime}/`. + +**Python/C++ division** (from the docs): *Python decides **when** +(fork ordering, `SharedMemory` lifecycle, callable registration); C++ decides +**how fast** (threading, atomics, zero-copy dispatch).* + +--- + +## 4. Where Rust fits — component-by-component + +Reading the layers top-to-bottom, here is each component, its current +language, and the **one dominant reason** Rust would or would not help. The +label in **bold** is the headline reason. + +```text + ════════════════════════════════════════════════════════════════════════════════════════════ + LAYER / COMPONENT TODAY RUST? DOMINANT REASON (label) + ════════════════════════════════════════════════════════════════════════════════════════════ + L3–L6 orchestration / user DAG fn Python ✗ keep Py [ERGONOMICS] dynamic user API, + (python/simpler/{worker,orchestrator}) fork timing, torch interop + ──────────────────────────────────────────────────────────────────────────────────────────── + Scheduler engine (queues, dispatch C++ ✓✓ STRONG [CONCURRENCY-SAFETY] lock-free-ish + loop, TaskSlotState) queues across Orch/Sched/Worker + src/common/hierarchical/scheduler threads — data races are the bug + class Rust's Send/Sync removes + ──────────────────────────────────────────────────────────────────────────────────────────── + Orchestrator (Ring, TensorMap, Scope, C++ ✓✓ STRONG [LIFETIME-SAFETY] Scope = tensor + slot state machine) lifetimes + slot reuse; the exact + src/common/hierarchical/{ring, use-after-free / aliasing class + tensormap,scope,orchestrator} ownership/borrow encodes + ──────────────────────────────────────────────────────────────────────────────────────────── + WorkerManager / WorkerThread + shm C++ ✓ MODERATE [CONCURRENCY-SAFETY] thread pool + + mailbox dispatch mailbox state machine; but raw shm + src/common/hierarchical/worker_manager + fork interop needs heavy `unsafe` + ──────────────────────────────────────────────────────────────────────────────────────────── + Remote L3 transport: endpoint + wire C++ ✓✓ STRONG [PARSING-SAFETY] versioned frame + codec (RoCE/HCCS/UB/sockets) codec over the network = untrusted + src/common/hierarchical/remote_{endpoint,wire} bytes; Rust parsers reject malformed + input without memory-unsafety + ──────────────────────────────────────────────────────────────────────────────────────────── + Host Runtime: DeviceRunner, C++ ~ WEAK [FFI-COST] thin wrapper over CANN + MemoryAllocator, C API C SDK (rtSetDevice, dlsym); Rust + src/{arch}/platform/*/host adds FFI noise for little safety win + ──────────────────────────────────────────────────────────────────────────────────────────── + AICPU scheduler kernel (device .so) C++ ~ WEAK* [TOOLCHAIN] must compile with CANN's + src/{arch}/platform/*/aicpu AICPU toolchain; no Rust target. + *Logic is race-heavy → Rust would + help IF a target existed + ──────────────────────────────────────────────────────────────────────────────────────────── + AICore compute kernel (device .o) C++/PTO ✗ NO [NO-BACKEND] PTO ISA via CCEC; no + src/{arch}/platform/*/aicore Rust/LLVM backend for AICore. This + is the kernel-safety story of the + *other* project (ascend-rs), not a + runtime concern + ──────────────────────────────────────────────────────────────────────────────────────────── + Python↔C++ bindings (nanobind) C++ ~ WEAK [INTEROP] nanobind is mature; a Rust + python/bindings/task_interface.cpp PyO3 equivalent only pays off if the + bound engine is already Rust + ════════════════════════════════════════════════════════════════════════════════════════════ +``` + +### The labels, expanded + +- **[CONCURRENCY-SAFETY] — Scheduler & WorkerManager (strongest case).** + The Scheduler runs a dedicated thread draining three queues shared with the + Orch thread and N WorkerThreads, coordinated by mutex+CV and atomics over + `TaskSlotState`. This is precisely the class of bug — data races, torn reads + of slot state, missed wakeups — that Rust's `Send`/`Sync` + borrow checker + turn into *compile errors*. **Main reason to use Rust: fearless concurrency on + the hot scheduling path.** + +- **[LIFETIME-SAFETY] — Orchestrator (Ring / TensorMap / Scope).** + `Scope` manages intermediate-tensor lifetimes; `Ring` reuses fixed slots with + back-pressure; `TensorMap` maps a producer slot to consumers. A slot freed + while a downstream consumer still references it is a use-after-free — the + *same* hazard class the companion `ascend-rs` work shows Rust ownership + rejects at compile time. **Main reason: ownership/lifetimes make + slot-reuse-after-free unrepresentable.** + +- **[PARSING-SAFETY] — Remote L3 wire codec.** + `remote_wire.cpp` is a versioned frame codec for cross-host task frames over + RoCE/HCCS/UB/sockets — i.e. it decodes **bytes off the network**. Hand-rolled + C++ binary parsers are a perennial CVE source (overreads, length confusion). + **Main reason: safe decoding of untrusted/versioned input.** + +- **[FFI-COST] — Host Runtime / C API.** `DeviceRunner` is a thin handle-based + wrapper over CANN C calls (`rtSetDevice`, stream sync, `dlsym`). Rewriting in + Rust means wrapping all of CANN in `extern "C"` + `unsafe` — the safety upside + is small and the FFI tax is real. **Main reason *not* to: it's mostly FFI + glue, where Rust's guarantees are voided by `unsafe` anyway.** + +- **[TOOLCHAIN] — AICPU kernel.** Logically this is *also* a race-heavy + scheduler (it would benefit from Rust), but it must be built by CANN's AICPU + compiler; there is no Rust target for the AICPU. **Main reason *not* to: + no toolchain, regardless of merit.** + +- **[NO-BACKEND] — AICore kernel.** Compiled to PTO ISA via CCEC; no + Rust/LLVM AICore backend exists. (This is exactly the boundary the *separate* + `ascend-rs` project addresses with a shape-typed Rust model + an IR-level + oracle — out of scope for this runtime.) **Main reason *not* to: no code + generation path to the device.** + +- **[ERGONOMICS] — Python orchestration layer.** The user writes orch + functions in Python; the layer also owns `fork()` timing, `SharedMemory` + alloc/unlink, and torch zero-copy interop. This is "decide *when*" glue where + Python's dynamism and ecosystem win. **Main reason to keep Python: user-facing + API + lifecycle orchestration, not throughput.** + +--- + +## 5. Summary picture — Rust suitability over the architecture + +```text + RUST SUITABILITY (██ strong · ▓ moderate · ░ weak/no) + ┌─────────────────────────────────────────────────────────────────────────────────┐ + │ L3–L6 Python orchestration ........... ░ keep Python [ERGONOMICS] │ + │ ┌──────────────────────── host/cluster engine (C++) ───────────────────────────┐ │ + │ │ Scheduler (queues, dispatch) ........ ██ STRONG [CONCURRENCY-SAFETY] │ │ + │ │ Orchestrator (Ring/TensorMap/Scope) . ██ STRONG [LIFETIME-SAFETY] │ │ + │ │ Remote wire codec / endpoint ........ ██ STRONG [PARSING-SAFETY] │ │ + │ │ WorkerManager / mailbox dispatch .... ▓ MODERATE [CONCURRENCY-SAFETY] │ │ + │ │ Host Runtime / DeviceRunner / C API . ░ weak [FFI-COST] │ │ + │ │ nanobind bindings ................... ░ weak [INTEROP] │ │ + │ └──────────────────────────────────────────────────────────────────────────────┘ │ + │ ════════════════════════ L2 device boundary ════════════════════════════════════ │ + │ AICPU scheduler kernel ................ ░ blocked [TOOLCHAIN] │ + │ AICore compute kernel (PTO ISA) ....... ░ no [NO-BACKEND] │ + └─────────────────────────────────────────────────────────────────────────────────┘ +``` + +**Bottom line.** The high-value Rust targets are the **host-side coordination +core** — Scheduler, Orchestrator, and the remote wire codec — where the bug +classes are exactly concurrency races, slot-lifetime use-after-free, and +untrusted-input parsing that Rust eliminates at compile time. The device side +(AICPU/AICore) is blocked by toolchain/backend availability, not by merit, and +the Python layer is best left as the ergonomic "when" layer. A pragmatic first +step would be a single Rust crate replacing `src/common/hierarchical/` +(scheduler + orchestrator + ring/tensormap/scope + remote_wire), exposed to the +existing Python via PyO3 — leaving the CANN FFI host runtime and the device +kernels in C++. +``` diff --git a/docs/diagrams/architecture.svg b/docs/diagrams/architecture.svg new file mode 100644 index 000000000..be67b7327 --- /dev/null +++ b/docs/diagrams/architecture.svg @@ -0,0 +1,11 @@ + + + + + +1 · Level model — the 7-layer hierarchy (L0–L6)L6CLOS2 / Clusterfull cluster (N6 super-nodes)Worker(level=5) ×NL5CLOS1 / SuperNodesuper-node (N5 pods)Worker(level=4) ×NL4POD / Podpod (4 hosts)Worker(level=3) ×N + Sub ×ML3HOST / Nodeone host (16 chips + M subs)ChipWorker ×N + SubWorker ×ML2CHIP / Processorone NPU chip (shared GM)Host.so + AICPU.so + AICore.oL1DIE / L2Cachechip diehardware-managedL0CORE / AIV, AICindividual compute corehardware-managed◄ L2 BOUNDARYHOST / CLUSTEROrch+Sched+WorkerIPC · RoCE · HCCSON-DEVICEshared GM + atomics + +2 · Engine components (L3+) and the L2 three-program modelORCHESTRATOROrch thread · DAG builderRing · TensorMap · Scopesubmit_next_level(c, args, cfg)SCHEDULERScheduler thread · DAG executorwiring → ready → completion queuesmoves slot ids; never reads task dataWORKERWorker threads · executionWorkerManager + WorkerThread poolshm mailbox → forked child → pollwiringdispatch◄ completion (slot, outcome)At L2 the Worker leaf = ChipWorker driving three on-device programs:Python application / SceneTestCase — nanobind · ChipWorker (dlopen host.so) · RuntimeBuilder / KernelCompilerHost Runtime (C++ .so)DeviceRunner · MemoryAllocator · C APIBinary dataAICPU .so + AICore .o (dlopen'd / launched)Ascend deviceAICPU: task scheduler loopAICore (AIV/AIC): compute kernels◄── handshake buffers ──►aicpu_ready · aicore_done · task ptr + +3 · Rust-suitability map — dominant reason per componentCOMPONENTTODAYRUST?DOMINANT REASONL3–L6 Python orchestrationPython░ keep Python[ERGONOMICS]Scheduler (queues, dispatch loop)C++██ STRONG[CONCURRENCY-SAFETY]Orchestrator (Ring · TensorMap · Scope)C++██ STRONG[LIFETIME-SAFETY]Remote L3 wire codec / endpointC++██ STRONG[PARSING-SAFETY]WorkerManager / mailbox dispatchC++▓ MODERATE[CONCURRENCY-SAFETY]Host Runtime / DeviceRunner / C APIC++░ weak[FFI-COST]nanobind bindingsC++░ weak[INTEROP]L2 device boundaryAICPU scheduler kernel (device)C++░ blocked[TOOLCHAIN]AICore compute kernel (PTO ISA)C++/PTO░ no[NO-BACKEND]Strong targets = the host coordination core: races, slot-lifetime UAF, untrusted wire bytes — compile-time-eliminated by Rust. + \ No newline at end of file diff --git a/docs/diagrams/make_diagrams.py b/docs/diagrams/make_diagrams.py new file mode 100644 index 000000000..e66561fef --- /dev/null +++ b/docs/diagrams/make_diagrams.py @@ -0,0 +1,177 @@ +#!/usr/bin/env python3 +"""Render the SIMPLER architecture diagrams as a single self-contained SVG: + 1. L0-L6 level model (the 7-layer hierarchy) + 2. Three engine components + L2 three-program model + 3. Rust-suitability map over the architecture +Run: python3 make_diagrams.py -> architecture.svg +Pure stdlib; no external deps. Edit here and re-run to regenerate. +""" +import html, os + +W = 1180 +PADDED = [] # (svg fragment, height) appended per panel + +# ---- palette ---- +BG = "#0e1116"; PANEL = "#161b22"; INK = "#e6edf3"; MUTE = "#8b949e" +LINE = "#30363d" +ON_DEV = "#7c4a2d" # on-device (L0-L2) accent +HOST = "#1f4e6b" # host/cluster accent +STRONG = "#2ea043" # rust strong +MOD = "#9e6a1f" # rust moderate +WEAK = "#6e3b3b" # rust weak/no +ORCH = "#b7791f"; SCHED = "#2f6f4f"; WORK = "#3a5f8a" + + +def esc(s): return html.escape(str(s)) + + +def box(x, y, w, h, fill, text, sub="", rx=8, tcol=INK, scol=MUTE, ts=15, anchor="middle", stroke=LINE): + cx = x + w / 2 if anchor == "middle" else x + 12 + out = [f''] + ty = y + (h / 2 + ts / 2 - 3 if not sub else h / 2 - 4) + out.append(f'{esc(text)}') + if sub: + out.append(f'{esc(sub)}') + return "".join(out) + + +def label(x, y, text, col=INK, ts=13, anchor="start", weight="400", mono=False): + fam = "DejaVu Sans Mono, monospace" if mono else "DejaVu Sans, sans-serif" + return f'{esc(text)}' + + +def arrow(x1, y1, x2, y2, col=MUTE, w=1.6, dash=""): + d = f' stroke-dasharray="{dash}"' if dash else "" + return (f'') + + +# ============================ Panel 1: L0-L6 level model ============================ +def panel_level(): + h = 340 + o = [f'1 · Level model — the 7-layer hierarchy (L0–L6)'] + rows = [ + ("L6", "CLOS2 / Cluster", "full cluster (N6 super-nodes)", "Worker(level=5) ×N", HOST), + ("L5", "CLOS1 / SuperNode", "super-node (N5 pods)", "Worker(level=4) ×N", HOST), + ("L4", "POD / Pod", "pod (4 hosts)", "Worker(level=3) ×N + Sub ×M", HOST), + ("L3", "HOST / Node", "one host (16 chips + M subs)", "ChipWorker ×N + SubWorker ×M", HOST), + ("L2", "CHIP / Processor", "one NPU chip (shared GM)", "Host.so + AICPU.so + AICore.o", ON_DEV), + ("L1", "DIE / L2Cache", "chip die", "hardware-managed", ON_DEV), + ("L0", "CORE / AIV, AIC", "individual compute core", "hardware-managed", ON_DEV), + ] + y0, rh = 56, 36 + for i, (lv, name, unit, comp, accent) in enumerate(rows): + y = y0 + i * rh + o.append(f'') + o.append(label(50, y + 21, lv, INK, 14, "middle", "700")) + o.append(label(86, y + 21, name, INK, 14, weight="600")) + o.append(label(300, y + 21, unit, MUTE, 12.5)) + o.append(label(600, y + 21, comp, INK, 12.5, mono=True)) + # boundary line between L2 and L3 (after row idx 3) + by = y0 + 4 * rh - 3 + o.append(f'') + o.append(label(W-195, by - 6, "◄ L2 BOUNDARY", STRONG, 12, "start", "700")) + # world brackets + o.append(label(W-195, y0 + 24, "HOST / CLUSTER", HOST, 12, "start", "700")) + o.append(label(W-195, y0 + 40, "Orch+Sched+Worker", MUTE, 11, mono=True)) + o.append(label(W-195, y0 + 54, "IPC · RoCE · HCCS", MUTE, 11, mono=True)) + o.append(label(W-195, by + 40, "ON-DEVICE", ON_DEV, 12, "start", "700")) + o.append(label(W-195, by + 56, "shared GM + atomics", MUTE, 11, mono=True)) + return "".join(o), h + + +# ============== Panel 2: three engine components + L2 three-program model ============== +def panel_engine(): + h = 470 + o = [f'2 · Engine components (L3+) and the L2 three-program model'] + # three engine boxes + o.append(box(30, 56, 360, 96, PANEL, "ORCHESTRATOR", "Orch thread · DAG builder", stroke=ORCH)) + o.append(label(46, 118, "Ring · TensorMap · Scope", MUTE, 12, mono=True)) + o.append(label(46, 136, "submit_next_level(c, args, cfg)", MUTE, 12, mono=True)) + o.append(box(410, 56, 360, 96, PANEL, "SCHEDULER", "Scheduler thread · DAG executor", stroke=SCHED)) + o.append(label(426, 118, "wiring → ready → completion queues", MUTE, 12, mono=True)) + o.append(label(426, 136, "moves slot ids; never reads task data", MUTE, 12, mono=True)) + o.append(box(790, 56, 360, 96, PANEL, "WORKER", "Worker threads · execution", stroke=WORK)) + o.append(label(806, 118, "WorkerManager + WorkerThread pool", MUTE, 12, mono=True)) + o.append(label(806, 136, "shm mailbox → forked child → poll", MUTE, 12, mono=True)) + o.append(arrow(390, 104, 410, 104, ORCH, 2)) + o.append(arrow(770, 104, 790, 104, SCHED, 2)) + o.append(label(580, 100, "wiring", MUTE, 10, "middle", mono=True)) + o.append(label(960, 100, "dispatch", MUTE, 10, "middle", mono=True)) + o.append(arrow(790, 132, 770, 132, WORK, 1.4, "4 3")) + o.append(label(960, 146, "◄ completion (slot, outcome)", MUTE, 10, "middle", mono=True)) + + # L2 three-program model below + o.append(label(30, 196, "At L2 the Worker leaf = ChipWorker driving three on-device programs:", MUTE, 13)) + o.append(box(30, 214, 1120, 40, PANEL, "Python application / SceneTestCase — nanobind · ChipWorker (dlopen host.so) · RuntimeBuilder / KernelCompiler", "", ts=13)) + o.append(arrow(590, 254, 590, 280, MUTE, 2)) + o.append(box(30, 286, 540, 64, PANEL, "Host Runtime (C++ .so)", "DeviceRunner · MemoryAllocator · C API", stroke=WEAK)) + o.append(box(610, 286, 540, 64, PANEL, "Binary data", "AICPU .so + AICore .o (dlopen'd / launched)", stroke=LINE)) + o.append(arrow(300, 350, 300, 378, MUTE, 2)) + o.append(arrow(880, 350, 880, 378, MUTE, 2)) + o.append(box(30, 384, 1120, 70, "#10222b", "Ascend device", "", stroke=ON_DEV)) + o.append(label(60, 412, "AICPU: task scheduler loop", INK, 13)) + o.append(label(60, 432, "AICore (AIV/AIC): compute kernels", INK, 13)) + o.append(label(720, 412, "◄── handshake buffers ──►", MUTE, 12, "middle", mono=True)) + o.append(label(720, 432, "aicpu_ready · aicore_done · task ptr", MUTE, 11, "middle", mono=True)) + return "".join(o), h + + +# ==================== Panel 3: Rust-suitability map ==================== +def panel_rust(): + rows = [ + ("L3–L6 Python orchestration", "Python", WEAK, "░ keep Python", "ERGONOMICS"), + ("Scheduler (queues, dispatch loop)", "C++", STRONG, "██ STRONG", "CONCURRENCY-SAFETY"), + ("Orchestrator (Ring · TensorMap · Scope)", "C++", STRONG, "██ STRONG", "LIFETIME-SAFETY"), + ("Remote L3 wire codec / endpoint", "C++", STRONG, "██ STRONG", "PARSING-SAFETY"), + ("WorkerManager / mailbox dispatch", "C++", MOD, "▓ MODERATE", "CONCURRENCY-SAFETY"), + ("Host Runtime / DeviceRunner / C API", "C++", WEAK, "░ weak", "FFI-COST"), + ("nanobind bindings", "C++", WEAK, "░ weak", "INTEROP"), + ("AICPU scheduler kernel (device)", "C++", WEAK, "░ blocked", "TOOLCHAIN"), + ("AICore compute kernel (PTO ISA)", "C++/PTO", WEAK, "░ no", "NO-BACKEND"), + ] + rh = 34 + h = 70 + len(rows) * rh + 30 + o = [f'3 · Rust-suitability map — dominant reason per component'] + o.append(label(30, 56, "COMPONENT", MUTE, 12, "start", "700")) + o.append(label(560, 56, "TODAY", MUTE, 12, "start", "700")) + o.append(label(660, 56, "RUST?", MUTE, 12, "start", "700")) + o.append(label(830, 56, "DOMINANT REASON", MUTE, 12, "start", "700")) + y0 = 66 + for i, (comp, today, col, verdict, reason) in enumerate(rows): + y = y0 + i * rh + if i == 1: # divider before the host engine block + o.append(f'') + if i == 7: # device boundary + o.append(f'') + o.append(label(W-34, y+13, "L2 device boundary", ON_DEV, 11, "end", "700")) + o.append(f'') + o.append(label(50, y + 18, comp, INK, 13)) + o.append(label(560, y + 18, today, MUTE, 12, mono=True)) + o.append(label(660, y + 18, verdict, col if col != WEAK else MUTE, 12, mono=True, weight="600")) + o.append(label(830, y + 18, "[" + reason + "]", INK if col == STRONG else MUTE, 12, mono=True)) + o.append(label(30, y0 + len(rows)*rh + 18, + "Strong targets = the host coordination core: races, slot-lifetime UAF, untrusted wire bytes — compile-time-eliminated by Rust.", + MUTE, 12)) + return "".join(o), h + + +panels = [panel_level(), panel_engine(), panel_rust()] +gap = 28 +total_h = sum(h for _, h in panels) + gap * (len(panels) + 1) + +parts = [ + f'', + f'', + f'', + f'', +] +cy = gap +for frag, hh in panels: + parts.append(f'') + parts.append(f'{frag}') + cy += hh + gap +parts.append("") + +out = os.path.join(os.path.dirname(os.path.abspath(__file__)), "architecture.svg") +open(out, "w").write("\n".join(parts)) +print("wrote", out, f"({total_h}px tall)")