Skip to content

Refresh README: e2e PR health, incident scanner, new init flow, retired agents#345

Merged
kai-linux merged 2 commits intomainfrom
docs/readme-refresh-2026-04-23
Apr 24, 2026
Merged

Refresh README: e2e PR health, incident scanner, new init flow, retired agents#345
kai-linux merged 2 commits intomainfrom
docs/readme-refresh-2026-04-23

Conversation

@kai-linux
Copy link
Copy Markdown
Owner

Summary

Sync the README with system state after the 2026-04-23 work (PRs #333#344).

Changes

  • Agent pool — active pool is now Claude + Codex (Gemini and DeepSeek retired after quality review, per earlier operator decision)
  • "The Loop" diagram — shows the four improvement loops (log analyzer, groomer, planner, incident scanner) feeding the backlog, and names pr_monitor's e2e-health terminal-close step
  • "Recursive Self-Improvement" — lists the two new acute loops: incident scanner every 6h, pr_monitor e2e health every 5 min. Notes that merged agent PRs now auto-close their linked issue via Closes #N
  • "Option C: Bootstrap From Scratch" — now walks through all 8 interactive init steps, including the new tuning-cadence prompt (Init: preserve existing config, prompt for cadence, write supporting docs #342) and the additional charter documents (NORTH_STAR / VISION / STRATEGY / PLANNING_PRINCIPLES). Calls out that re-running init preserves an existing config.yaml
  • "Optional: set up cron" — expanded to mirror the actual installed crontab layout, including run_incident_scanner.sh
  • "How It Works" table — adds incident_scanner.py, work_verifier.py, product_inspector.py, daily_digest.py; updates pr_monitor.py to mention e2e health

…sues

Closes the manual-diagnosis → manual-fix loop that took hours of operator
attention on 2026-04-23 when agents echoed the `.agent_result.md` prompt
template prose as the blocker_code value and the escalation kept firing.

New `orchestrator/incident_scanner.py` runs on a fast cadence
(suggested every 6h) and:

1. Ingests three signal streams from the last N hours (default 24):
   - `runtime/incidents/incidents.jsonl` (sev-classified alerts)
   - `runtime/mailbox/escalated/*.md` (blocked-task escalation notes)
   - `runtime/audit/audit.jsonl` filtered to anomaly event types
     (`pr_e2e_terminal_close`, `work_verifier_override`,
     `stuck_pr_merge`).

2. Aggregates signals into stable signatures with example contexts.

3. Runs deterministic rule matchers first so known-bad patterns
   don't need an LLM. Two rules land with this PR:
   - `template_echo`: detects `.agent_result.md` template placeholder
     text (\"One line. Required when STATUS...\", \"- bullet\", ...)
     flowing into error_patterns.
   - `repeated_terminal_close`: detects the same blocker signature
     hitting pr_monitor's e2e terminal close ≥3 times, meaning
     re-spawn isn't solving the class and a code fix is needed.

4. Falls back to a single LLM call (claude-sonnet-4-6) for remaining
   recurring signatures the rules didn't classify. The LLM's output
   is constrained to structured issue proposals.

5. Dedupes against its own recent-action log AND against existing
   open agent-os issues with the same title, then creates GitHub
   issues tagged `ready` / `prio:high` / `bot-generated` /
   `autonomous-fix`. The dispatcher/groomer pipeline picks them up
   like any other work — closing the loop.

Scanner never edits code, merges PRs, or changes branches. It only
files issues; everything downstream is the existing pipeline's job.

Anchor test (`test_template_echo_incident_would_have_been_auto_detected`)
replays the actual escalation that required manual intervention today
and asserts the scanner would have filed the fix issue autonomously.

Suggested crontab entry (every 6h at :15):
    15 */6 * * * /path/to/agent-os/bin/run_incident_scanner.sh \
      >> /path/to/agent-os/runtime/logs/incident_scanner.log 2>&1
…ed agents

Sync the README with the current system state after the 2026-04-23 work:

- Agent pool updated — Gemini and DeepSeek retired after quality review;
  active pool is now Claude + Codex.
- "The Loop" diagram shows four improvement loops feeding back to the
  backlog (log analyzer, groomer, planner, incident scanner) and names
  pr_monitor's e2e health step.
- "Recursive Self-Improvement" lists the two new acute loops: incident
  scanner every 6h, pr_monitor e2e health every 5 min. Notes that merged
  agent PRs now auto-close their linked issue via `Closes #N`.
- "Option C: Bootstrap From Scratch" now walks through all 8 init steps,
  including the new tuning-cadence prompt and the additional charter
  documents (NORTH_STAR / VISION / STRATEGY / PLANNING_PRINCIPLES).
  Calls out that re-running init preserves an existing config.yaml.
- "Optional: set up cron" expanded to mirror the actual installed
  crontab (dispatcher/queue/pr_monitor/telegram control every minute;
  groomer/planner hourly; incident_scanner every 6h; weekly
  scorer+log_analyzer; daily digest+product_inspector).
- "How It Works" table adds incident_scanner.py, work_verifier.py,
  product_inspector.py, daily_digest.py and updates pr_monitor.py to
  mention e2e health.
@kai-linux kai-linux merged commit 664dbfe into main Apr 24, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant