Skip to content

Incident scanner: autonomous loop from runtime signals to self-fix issues#344

Merged
kai-linux merged 1 commit intomainfrom
feat/incident-scanner-autonomous-fix
Apr 24, 2026
Merged

Incident scanner: autonomous loop from runtime signals to self-fix issues#344
kai-linux merged 1 commit intomainfrom
feat/incident-scanner-autonomous-fix

Conversation

@kai-linux
Copy link
Copy Markdown
Owner

Summary

Closes the manual-diagnosis → manual-fix loop that took hours of operator attention on 2026-04-23 when agents echoed the `.agent_result.md` prompt template prose as the blocker_code value and the escalation kept firing.

New `orchestrator/incident_scanner.py` runs on a fast cadence (every 6h) and turns recurring runtime signals into self-fix GitHub issues that the dispatcher/groomer pipeline picks up like any other work — closing the loop between "something went wrong at runtime" and "agent fixes the class of bug."

What it does

  1. Ingests 3 signal streams for the last N hours (default 24):

    • `runtime/incidents/incidents.jsonl` (sev-classified alerts)
    • `runtime/mailbox/escalated/*.md` (blocked-task escalation notes)
    • `runtime/audit/audit.jsonl` filtered to anomaly event types (`pr_e2e_terminal_close`, `work_verifier_override`, `stuck_pr_merge`)
  2. Aggregates signals into stable signatures with example contexts.

  3. Deterministic rule matchers (no LLM) catch known-bad patterns. Two rules ship with this PR:

    • `template_echo`: detects `.agent_result.md` prose placeholders ("One line. Required when STATUS...", "- bullet", ...) flowing into error_patterns.
    • `repeated_terminal_close`: detects the same blocker signature hitting pr_monitor's e2e terminal close ≥3 times.
  4. LLM fallback (one `claude-sonnet-4-6` call per scan) for recurring signatures the rules didn't classify. Output is constrained to structured issue proposals.

  5. Dedup against recent scanner-action log AND existing open agent-os issues, then creates issues tagged `ready` / `prio:high` / `bot-generated` / `autonomous-fix`.

The scanner never edits code, merges PRs, or changes branches. It only files issues; the existing pipeline handles the rest.

Anchor test

`test_template_echo_incident_would_have_been_auto_detected` replays the actual escalation that required manual intervention today and asserts the scanner would have filed the fix issue autonomously — same diagnosis as the manual work, no operator in the loop.

Live dry-run validation

Ran against the real `runtime/` signals (48h window) on my workstation. Output:
```
Incident scanner: window=48h, target=kai-linux/agent-os
Aggregated 28 signal(s) into 27 signature(s).
Would file: [auto-fix] Agent echoed .agent_result.md template as blocker/summary
```
Exactly the issue I'd have filed by hand.

Install

Add to crontab after merge:
```
15 */6 * * * /home/kai/agent-os/bin/run_incident_scanner.sh >> /home/kai/agent-os/runtime/logs/incident_scanner.log 2>&1
```

Test plan

  • `pytest tests/test_incident_scanner.py` — 10 passed
  • `pytest tests/` minus slow queue suite — 594 passed
  • Dry-run against live runtime/ — produced expected output

Opt-out

Set `AGENT_OS_INCIDENT_SCANNER_DISABLE_LLM=1` to run deterministic-only (no LLM calls).

…sues

Closes the manual-diagnosis → manual-fix loop that took hours of operator
attention on 2026-04-23 when agents echoed the `.agent_result.md` prompt
template prose as the blocker_code value and the escalation kept firing.

New `orchestrator/incident_scanner.py` runs on a fast cadence
(suggested every 6h) and:

1. Ingests three signal streams from the last N hours (default 24):
   - `runtime/incidents/incidents.jsonl` (sev-classified alerts)
   - `runtime/mailbox/escalated/*.md` (blocked-task escalation notes)
   - `runtime/audit/audit.jsonl` filtered to anomaly event types
     (`pr_e2e_terminal_close`, `work_verifier_override`,
     `stuck_pr_merge`).

2. Aggregates signals into stable signatures with example contexts.

3. Runs deterministic rule matchers first so known-bad patterns
   don't need an LLM. Two rules land with this PR:
   - `template_echo`: detects `.agent_result.md` template placeholder
     text (\"One line. Required when STATUS...\", \"- bullet\", ...)
     flowing into error_patterns.
   - `repeated_terminal_close`: detects the same blocker signature
     hitting pr_monitor's e2e terminal close ≥3 times, meaning
     re-spawn isn't solving the class and a code fix is needed.

4. Falls back to a single LLM call (claude-sonnet-4-6) for remaining
   recurring signatures the rules didn't classify. The LLM's output
   is constrained to structured issue proposals.

5. Dedupes against its own recent-action log AND against existing
   open agent-os issues with the same title, then creates GitHub
   issues tagged `ready` / `prio:high` / `bot-generated` /
   `autonomous-fix`. The dispatcher/groomer pipeline picks them up
   like any other work — closing the loop.

Scanner never edits code, merges PRs, or changes branches. It only
files issues; everything downstream is the existing pipeline's job.

Anchor test (`test_template_echo_incident_would_have_been_auto_detected`)
replays the actual escalation that required manual intervention today
and asserts the scanner would have filed the fix issue autonomously.

Suggested crontab entry (every 6h at :15):
    15 */6 * * * /path/to/agent-os/bin/run_incident_scanner.sh \
      >> /path/to/agent-os/runtime/logs/incident_scanner.log 2>&1
@kai-linux kai-linux merged commit 7cf6bee into main Apr 24, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant