Conversation
…sues
Closes the manual-diagnosis → manual-fix loop that took hours of operator
attention on 2026-04-23 when agents echoed the `.agent_result.md` prompt
template prose as the blocker_code value and the escalation kept firing.
New `orchestrator/incident_scanner.py` runs on a fast cadence
(suggested every 6h) and:
1. Ingests three signal streams from the last N hours (default 24):
- `runtime/incidents/incidents.jsonl` (sev-classified alerts)
- `runtime/mailbox/escalated/*.md` (blocked-task escalation notes)
- `runtime/audit/audit.jsonl` filtered to anomaly event types
(`pr_e2e_terminal_close`, `work_verifier_override`,
`stuck_pr_merge`).
2. Aggregates signals into stable signatures with example contexts.
3. Runs deterministic rule matchers first so known-bad patterns
don't need an LLM. Two rules land with this PR:
- `template_echo`: detects `.agent_result.md` template placeholder
text (\"One line. Required when STATUS...\", \"- bullet\", ...)
flowing into error_patterns.
- `repeated_terminal_close`: detects the same blocker signature
hitting pr_monitor's e2e terminal close ≥3 times, meaning
re-spawn isn't solving the class and a code fix is needed.
4. Falls back to a single LLM call (claude-sonnet-4-6) for remaining
recurring signatures the rules didn't classify. The LLM's output
is constrained to structured issue proposals.
5. Dedupes against its own recent-action log AND against existing
open agent-os issues with the same title, then creates GitHub
issues tagged `ready` / `prio:high` / `bot-generated` /
`autonomous-fix`. The dispatcher/groomer pipeline picks them up
like any other work — closing the loop.
Scanner never edits code, merges PRs, or changes branches. It only
files issues; everything downstream is the existing pipeline's job.
Anchor test (`test_template_echo_incident_would_have_been_auto_detected`)
replays the actual escalation that required manual intervention today
and asserts the scanner would have filed the fix issue autonomously.
Suggested crontab entry (every 6h at :15):
15 */6 * * * /path/to/agent-os/bin/run_incident_scanner.sh \
>> /path/to/agent-os/runtime/logs/incident_scanner.log 2>&1
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the manual-diagnosis → manual-fix loop that took hours of operator attention on 2026-04-23 when agents echoed the `.agent_result.md` prompt template prose as the blocker_code value and the escalation kept firing.
New `orchestrator/incident_scanner.py` runs on a fast cadence (every 6h) and turns recurring runtime signals into self-fix GitHub issues that the dispatcher/groomer pipeline picks up like any other work — closing the loop between "something went wrong at runtime" and "agent fixes the class of bug."
What it does
Ingests 3 signal streams for the last N hours (default 24):
Aggregates signals into stable signatures with example contexts.
Deterministic rule matchers (no LLM) catch known-bad patterns. Two rules ship with this PR:
LLM fallback (one `claude-sonnet-4-6` call per scan) for recurring signatures the rules didn't classify. Output is constrained to structured issue proposals.
Dedup against recent scanner-action log AND existing open agent-os issues, then creates issues tagged `ready` / `prio:high` / `bot-generated` / `autonomous-fix`.
The scanner never edits code, merges PRs, or changes branches. It only files issues; the existing pipeline handles the rest.
Anchor test
`test_template_echo_incident_would_have_been_auto_detected` replays the actual escalation that required manual intervention today and asserts the scanner would have filed the fix issue autonomously — same diagnosis as the manual work, no operator in the loop.
Live dry-run validation
Ran against the real `runtime/` signals (48h window) on my workstation. Output:
```
Incident scanner: window=48h, target=kai-linux/agent-os
Aggregated 28 signal(s) into 27 signature(s).
Would file: [auto-fix] Agent echoed .agent_result.md template as blocker/summary
```
Exactly the issue I'd have filed by hand.
Install
Add to crontab after merge:
```
15 */6 * * * /home/kai/agent-os/bin/run_incident_scanner.sh >> /home/kai/agent-os/runtime/logs/incident_scanner.log 2>&1
```
Test plan
Opt-out
Set `AGENT_OS_INCIDENT_SCANNER_DISABLE_LLM=1` to run deterministic-only (no LLM calls).