Refresh README: e2e PR health, incident scanner, new init flow, retired agents#345
Merged
Refresh README: e2e PR health, incident scanner, new init flow, retired agents#345
Conversation
…sues
Closes the manual-diagnosis → manual-fix loop that took hours of operator
attention on 2026-04-23 when agents echoed the `.agent_result.md` prompt
template prose as the blocker_code value and the escalation kept firing.
New `orchestrator/incident_scanner.py` runs on a fast cadence
(suggested every 6h) and:
1. Ingests three signal streams from the last N hours (default 24):
- `runtime/incidents/incidents.jsonl` (sev-classified alerts)
- `runtime/mailbox/escalated/*.md` (blocked-task escalation notes)
- `runtime/audit/audit.jsonl` filtered to anomaly event types
(`pr_e2e_terminal_close`, `work_verifier_override`,
`stuck_pr_merge`).
2. Aggregates signals into stable signatures with example contexts.
3. Runs deterministic rule matchers first so known-bad patterns
don't need an LLM. Two rules land with this PR:
- `template_echo`: detects `.agent_result.md` template placeholder
text (\"One line. Required when STATUS...\", \"- bullet\", ...)
flowing into error_patterns.
- `repeated_terminal_close`: detects the same blocker signature
hitting pr_monitor's e2e terminal close ≥3 times, meaning
re-spawn isn't solving the class and a code fix is needed.
4. Falls back to a single LLM call (claude-sonnet-4-6) for remaining
recurring signatures the rules didn't classify. The LLM's output
is constrained to structured issue proposals.
5. Dedupes against its own recent-action log AND against existing
open agent-os issues with the same title, then creates GitHub
issues tagged `ready` / `prio:high` / `bot-generated` /
`autonomous-fix`. The dispatcher/groomer pipeline picks them up
like any other work — closing the loop.
Scanner never edits code, merges PRs, or changes branches. It only
files issues; everything downstream is the existing pipeline's job.
Anchor test (`test_template_echo_incident_would_have_been_auto_detected`)
replays the actual escalation that required manual intervention today
and asserts the scanner would have filed the fix issue autonomously.
Suggested crontab entry (every 6h at :15):
15 */6 * * * /path/to/agent-os/bin/run_incident_scanner.sh \
>> /path/to/agent-os/runtime/logs/incident_scanner.log 2>&1
…ed agents Sync the README with the current system state after the 2026-04-23 work: - Agent pool updated — Gemini and DeepSeek retired after quality review; active pool is now Claude + Codex. - "The Loop" diagram shows four improvement loops feeding back to the backlog (log analyzer, groomer, planner, incident scanner) and names pr_monitor's e2e health step. - "Recursive Self-Improvement" lists the two new acute loops: incident scanner every 6h, pr_monitor e2e health every 5 min. Notes that merged agent PRs now auto-close their linked issue via `Closes #N`. - "Option C: Bootstrap From Scratch" now walks through all 8 init steps, including the new tuning-cadence prompt and the additional charter documents (NORTH_STAR / VISION / STRATEGY / PLANNING_PRINCIPLES). Calls out that re-running init preserves an existing config.yaml. - "Optional: set up cron" expanded to mirror the actual installed crontab (dispatcher/queue/pr_monitor/telegram control every minute; groomer/planner hourly; incident_scanner every 6h; weekly scorer+log_analyzer; daily digest+product_inspector). - "How It Works" table adds incident_scanner.py, work_verifier.py, product_inspector.py, daily_digest.py and updates pr_monitor.py to mention e2e health.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Sync the README with system state after the 2026-04-23 work (PRs #333–#344).
Changes
Closes #Nconfig.yamlrun_incident_scanner.shincident_scanner.py,work_verifier.py,product_inspector.py,daily_digest.py; updatespr_monitor.pyto mention e2e health