What to build
Move GPU LLM slots from Lemonade-forked baremetal binaries to podman containers built by the nightly toolbox fork, selected via profiles (image + bench-tuned flags), dispatched through hal0's existing remote-upstream proxy. Then rewire the dashboard to a hybrid (container + lemond) model.
Design: hal0-container-runtime-design-2026-06-08.md. Bench basis: hal0-container-bench-2026-06-08.md.
Decisions locked: container runtime (bench parity 52.8 vs 53.6 baremetal); profiles = flag-bundles on shared images; podman; slot OWNS container (1:1); phase-1 = GPU LLM slots only (lemond keeps embed/rerank/stt/tts; NPU/FLM untouched).
Per-slot optimal (bench): agent 35B MoE ace-saber = moe-rocmfp4 (MTP off) ~52.8 tok/s; chat 27B dense qwopus = dense-mtp-rocmfp4 (MTP on) ~24.4 tok/s.
Child slices
Tracked as separate issues (see Blocked-by chains). This epic is the umbrella; do not close until all children merged + live cutover done.
🤖 Generated with Claude Code
What to build
Move GPU LLM slots from Lemonade-forked baremetal binaries to podman containers built by the nightly toolbox fork, selected via profiles (image + bench-tuned flags), dispatched through hal0's existing remote-upstream proxy. Then rewire the dashboard to a hybrid (container + lemond) model.
Design:
hal0-container-runtime-design-2026-06-08.md. Bench basis:hal0-container-bench-2026-06-08.md.Decisions locked: container runtime (bench parity 52.8 vs 53.6 baremetal); profiles = flag-bundles on shared images; podman; slot OWNS container (1:1); phase-1 = GPU LLM slots only (lemond keeps embed/rerank/stt/tts; NPU/FLM untouched).
Per-slot optimal (bench): agent 35B MoE ace-saber = moe-rocmfp4 (MTP off) ~52.8 tok/s; chat 27B dense qwopus = dense-mtp-rocmfp4 (MTP on) ~24.4 tok/s.
Child slices
Tracked as separate issues (see Blocked-by chains). This epic is the umbrella; do not close until all children merged + live cutover done.
🤖 Generated with Claude Code