Skip to content

fix(swarm): make the multi-persona campaign actually work (stdout was corrupting ACP)#16

Merged
New1Direction merged 1 commit into
mainfrom
feat/swarm-campaign-live
Jun 15, 2026
Merged

fix(swarm): make the multi-persona campaign actually work (stdout was corrupting ACP)#16
New1Direction merged 1 commit into
mainfrom
feat/swarm-campaign-live

Conversation

@New1Direction

Copy link
Copy Markdown
Owner

What

The campaign's "theatrical swarm" was a stdout-pollution protocol bug, not fake work. Every korg worker subprocess was doing real work (git worktree, applied patch, real measured diff, signed results) and succeeding — but the leader recorded each one as crashed (exit_code=-1).

Root cause: the leader parses each worker's stdout as newline-delimited ACP/JSON. Log output was going to stdout too, so the worker's first reply was preceded by a tracing line (2026-…) → the leader's first read_acp_envelope failed to parse → EOF → false crash, before reading any result. This hit every persona in both deterministic and ollama campaigns. SP1–SP4 fixed the in-process paths; this was the subprocess path, invisible until a real campaign's worker stdout was read.

The fix (stdout discipline)

  • korg-core/telemetry.rs: init_tracing had no .with_writer → defaulted to stdout (while ironically checking stderr for ANSI). Logs now go to stderr.
  • korg-runtime/harness.rs: a stray [TelemetryEmitter] println!eprintln!; the legacy run() path's println!s → eprintln! (same latent hazard).

The worker already sent the authoritative TerminationReport (real success/doom_loop status + terminal_tx_id) — it was just never readable. (An earlier extra report I added was a misdiagnosis; review caught the double-send and it was removed. The e2e still passes, proving the stdout fix alone is the cure.)

Campaign provider selection + live reliability

  • src/main.rs: global --provider/--model/--base-url, exported as env at startup so every worker subprocess builds the selected provider via KorgConfig::load() (no config threading; covers TUI/web). run-once unified onto these flags (no behavior change: no flag → deterministic).
  • korg-runtime/session.rs: worker spawn forwards the LLM env to the child.
  • korg-runtime/personas.rs: implementer personas (Benjamin/Lucas) request response_format: json_object so live models reliably emit a parseable mutations block; prose personas unchanged; deterministic stub ignores it.

Proven

  • Deterministic campaign: all 4 workers terminate success, Benjamin attests a real measured mutations=1, DAG data-flow real (Benjamin's payload carries Captain+Harper output; Lucas's carries Benjamin's mutation).
  • Live ollama campaign: all personas complete on a real model (~5 min real latency); an implementer produced a real mutations=1.

Test plan

  • Gated e2e tests/campaign_e2e.rs (#[ignore]) — runs the real korg campaign, asserts workers complete, none crash, Benjamin attests mutations=1 (passes 71s locally)
  • cargo test -p korg-runtime — 143 pass; cargo test -p korg-core pass
  • fmt + clippy(correctness) clean
  • Independent Rust review (caught + fixed the double-TerminationReport)
  • CI green on the branch (timeout-guarded; the campaign e2e is #[ignore], not run in CI)

Also

  • Renames non-standard Tests/tests/ so integration tests are discoverable cross-platform.

Honesty unchanged: the mutations count is the real git diff measurement; this only fixes whether a worker is recorded as done and which provider it uses.

Known pre-existing follow-ups (not introduced here)

  • doom_loop exit status maps to crashed=true (session.rs:467 / workers.rs:870) — a controlled doom-loop signal shouldn't be a crash.
  • The post-work read_acp_envelope in run_as_stdio_worker has no timeout (works in practice — the leader drops stdin → EOF).

…s corrupting ACP

The campaign's "theatrical swarm" was a protocol bug, not fake work. Every
worker subprocess was doing real work (git worktree, applied patch, measured
diff, signed results) and SUCCEEDING — but the leader recorded each one as
`crashed` (exit_code=-1). The cause: stdout pollution corrupting the ACP
envelope channel the leader parses as JSON. The worker's first reply was
preceded by a log line ("2026-…") so the leader's first read failed to parse,
hit EOF, and stamped a false crash before reading any result. This affected
every persona in both deterministic and ollama campaigns.

The worker already sent a correct TerminationReport (with the real
success/doom_loop status) — it was simply never readable. The real fixes are
to stdout discipline:
- korg-core/telemetry.rs: init_tracing defaulted to STDOUT (while checking
  stderr for ANSI). Logs now go to STDERR; stdout stays a clean ACP channel.
- korg-runtime/harness.rs: stray `[TelemetryEmitter]` println! -> eprintln!;
  the legacy run() path's println!s -> eprintln! (same latent hazard).

Campaign provider selection + live reliability:
- src/main.rs: global --provider/--model/--base-url, exported as env at startup
  so every worker subprocess builds the selected provider via KorgConfig::load()
  (no config threading; covers TUI/web). run-once unified onto these flags
  (behavior unchanged: no flag -> deterministic).
- korg-runtime/session.rs: worker spawn forwards the LLM env to the child.
- korg-runtime/personas.rs: implementer personas (Benjamin/Lucas) request
  response_format=json_object so live models reliably emit a parseable mutations
  block; prose personas unchanged; the deterministic stub ignores it.

Proven: deterministic campaign — all workers terminate success, Benjamin
attests a real measured mutations=1, DAG data-flow real. Live ollama campaign —
all personas complete on a real model; an implementer produced a real mutation.
Gated e2e tests/campaign_e2e.rs guards it. Also renames the non-standard Tests/
-> tests/ so the integration test is discoverable cross-platform.

Honesty unchanged: the mutations count is the real git-diff measurement; this
only fixes whether a worker is RECORDED as done and which provider it uses.
@New1Direction New1Direction merged commit b85820b into main Jun 15, 2026
6 checks passed
@New1Direction New1Direction deleted the feat/swarm-campaign-live branch June 15, 2026 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant