Incident scanner: autonomous loop from runtime signals to self-fix issues by kai-linux · Pull Request #344 · kai-linux/agent-os

kai-linux · 2026-04-24T07:20:11Z

Summary

Closes the manual-diagnosis → manual-fix loop that took hours of operator attention on 2026-04-23 when agents echoed the `.agent_result.md` prompt template prose as the blocker_code value and the escalation kept firing.

New `orchestrator/incident_scanner.py` runs on a fast cadence (every 6h) and turns recurring runtime signals into self-fix GitHub issues that the dispatcher/groomer pipeline picks up like any other work — closing the loop between "something went wrong at runtime" and "agent fixes the class of bug."

What it does

Ingests 3 signal streams for the last N hours (default 24):
- `runtime/incidents/incidents.jsonl` (sev-classified alerts)
- `runtime/mailbox/escalated/*.md` (blocked-task escalation notes)
- `runtime/audit/audit.jsonl` filtered to anomaly event types (`pr_e2e_terminal_close`, `work_verifier_override`, `stuck_pr_merge`)
Aggregates signals into stable signatures with example contexts.
Deterministic rule matchers (no LLM) catch known-bad patterns. Two rules ship with this PR:
- `template_echo`: detects `.agent_result.md` prose placeholders ("One line. Required when STATUS...", "- bullet", ...) flowing into error_patterns.
- `repeated_terminal_close`: detects the same blocker signature hitting pr_monitor's e2e terminal close ≥3 times.
LLM fallback (one `claude-sonnet-4-6` call per scan) for recurring signatures the rules didn't classify. Output is constrained to structured issue proposals.
Dedup against recent scanner-action log AND existing open agent-os issues, then creates issues tagged `ready` / `prio:high` / `bot-generated` / `autonomous-fix`.

The scanner never edits code, merges PRs, or changes branches. It only files issues; the existing pipeline handles the rest.

Anchor test

`test_template_echo_incident_would_have_been_auto_detected` replays the actual escalation that required manual intervention today and asserts the scanner would have filed the fix issue autonomously — same diagnosis as the manual work, no operator in the loop.

Live dry-run validation

Ran against the real `runtime/` signals (48h window) on my workstation. Output:
```
Incident scanner: window=48h, target=kai-linux/agent-os
Aggregated 28 signal(s) into 27 signature(s).
Would file: [auto-fix] Agent echoed .agent_result.md template as blocker/summary
```
Exactly the issue I'd have filed by hand.

Install

Add to crontab after merge:
```
15 */6 * * * /home/kai/agent-os/bin/run_incident_scanner.sh >> /home/kai/agent-os/runtime/logs/incident_scanner.log 2>&1
```

Test plan

`pytest tests/test_incident_scanner.py` — 10 passed
`pytest tests/` minus slow queue suite — 594 passed
Dry-run against live runtime/ — produced expected output

Opt-out

Set `AGENT_OS_INCIDENT_SCANNER_DISABLE_LLM=1` to run deterministic-only (no LLM calls).

…sues Closes the manual-diagnosis → manual-fix loop that took hours of operator attention on 2026-04-23 when agents echoed the `.agent_result.md` prompt template prose as the blocker_code value and the escalation kept firing. New `orchestrator/incident_scanner.py` runs on a fast cadence (suggested every 6h) and: 1. Ingests three signal streams from the last N hours (default 24): - `runtime/incidents/incidents.jsonl` (sev-classified alerts) - `runtime/mailbox/escalated/*.md` (blocked-task escalation notes) - `runtime/audit/audit.jsonl` filtered to anomaly event types (`pr_e2e_terminal_close`, `work_verifier_override`, `stuck_pr_merge`). 2. Aggregates signals into stable signatures with example contexts. 3. Runs deterministic rule matchers first so known-bad patterns don't need an LLM. Two rules land with this PR: - `template_echo`: detects `.agent_result.md` template placeholder text (\"One line. Required when STATUS...\", \"- bullet\", ...) flowing into error_patterns. - `repeated_terminal_close`: detects the same blocker signature hitting pr_monitor's e2e terminal close ≥3 times, meaning re-spawn isn't solving the class and a code fix is needed. 4. Falls back to a single LLM call (claude-sonnet-4-6) for remaining recurring signatures the rules didn't classify. The LLM's output is constrained to structured issue proposals. 5. Dedupes against its own recent-action log AND against existing open agent-os issues with the same title, then creates GitHub issues tagged `ready` / `prio:high` / `bot-generated` / `autonomous-fix`. The dispatcher/groomer pipeline picks them up like any other work — closing the loop. Scanner never edits code, merges PRs, or changes branches. It only files issues; everything downstream is the existing pipeline's job. Anchor test (`test_template_echo_incident_would_have_been_auto_detected`) replays the actual escalation that required manual intervention today and asserts the scanner would have filed the fix issue autonomously. Suggested crontab entry (every 6h at :15): 15 */6 * * * /path/to/agent-os/bin/run_incident_scanner.sh \ >> /path/to/agent-os/runtime/logs/incident_scanner.log 2>&1

kai-linux merged commit 7cf6bee into main Apr 24, 2026
2 checks passed

kai-linux mentioned this pull request Apr 24, 2026

Refresh README: e2e PR health, incident scanner, new init flow, retired agents #345

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incident scanner: autonomous loop from runtime signals to self-fix issues#344

Incident scanner: autonomous loop from runtime signals to self-fix issues#344
kai-linux merged 1 commit intomainfrom
feat/incident-scanner-autonomous-fix

kai-linux commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kai-linux commented Apr 24, 2026

Summary

What it does

Anchor test

Live dry-run validation

Install

Test plan

Opt-out

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant