Skip to content

EPIC: containerized GPU backend (podman + profiles) + dashboard rewire #652

@thinmintdev

Description

@thinmintdev

What to build

Move GPU LLM slots from Lemonade-forked baremetal binaries to podman containers built by the nightly toolbox fork, selected via profiles (image + bench-tuned flags), dispatched through hal0's existing remote-upstream proxy. Then rewire the dashboard to a hybrid (container + lemond) model.

Design: hal0-container-runtime-design-2026-06-08.md. Bench basis: hal0-container-bench-2026-06-08.md.

Decisions locked: container runtime (bench parity 52.8 vs 53.6 baremetal); profiles = flag-bundles on shared images; podman; slot OWNS container (1:1); phase-1 = GPU LLM slots only (lemond keeps embed/rerank/stt/tts; NPU/FLM untouched).

Per-slot optimal (bench): agent 35B MoE ace-saber = moe-rocmfp4 (MTP off) ~52.8 tok/s; chat 27B dense qwopus = dense-mtp-rocmfp4 (MTP on) ~24.4 tok/s.

Child slices

Tracked as separate issues (see Blocked-by chains). This epic is the umbrella; do not close until all children merged + live cutover done.

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestslotsSlot roles / model assignment / perf tuning

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions