Skip to content

Auto-resolve stale incidents before digest delivery#350

Merged
kai-linux merged 1 commit intomainfrom
fix/incident-router-skip-resolved-prs
Apr 24, 2026
Merged

Auto-resolve stale incidents before digest delivery#350
kai-linux merged 1 commit intomainfrom
fix/incident-router-skip-resolved-prs

Conversation

@kai-linux
Copy link
Copy Markdown
Owner

Summary

Today's 09:00 Telegram digest delivered five sev3 incidents. Four were stale by 13–14 hours:

Incident Created Underlying state
agent-os#330 stuck_pr_merge 2026-04-23 15:55 merged at 16:00 (5 min later)
liminalconsultants#53 work_verifier_block 2026-04-23 18:15 merged at 18:36
liminalconsultants#54 work_verifier_block 2026-04-23 18:25 merged at 18:36
liminalconsultants#55 work_verifier_block 2026-04-23 18:25 merged at 18:36
agent-os#349 work_verifier_block 2026-04-24 08:50 OPEN (real, handled separately)

Root cause

`incident_router.flush_pending` only skipped delivery when an incident was already `resolved_at / deduped_to / notified_at`. It never re-checked the underlying condition before shipping the digest. Any incident raised before the next digest hour fired regardless of whether its cause had since cleared.

Historical volume: 221 stuck_pr_merge + 48 work_verifier_block incidents in the file.

Fix

New `_is_incident_stale` probe runs right before delivery on PR-typed incidents. Queries `gh pr view` for the PR's current state and marks resolved when:

  • PR is MERGED or CLOSED, or
  • A `stuck_pr_merge` alert's PR is now CLEAN (condition cleared)

Fails open on gh errors so network blips never swallow a real alert.

Backfill

Swept the existing `runtime/incidents/incidents.jsonl`: 16 pending PR-typed incidents, 13 auto-resolved (stale). 3 remain pending (real current blocks).

PR #349 handled separately

Inspected the flagged paths (`bin/run_library_scout.sh`, `example.config.yaml`) — both benign. Documented findings on issue #348, closed the issue + PR.

Test plan

  • `pytest tests/test_incident_router_skip_resolved.py tests/test_incident_router.py` — 13 passed (6 new)
  • Broad suite: 614 passed
  • Live backfill resolved 13 historical stale incidents; next digest flush should show a clean queue

Five sev3 incidents landed in today's 09:00 Telegram digest; four were
about PRs that had merged the previous afternoon:
- pr_monitor stuck_pr_merge on agent-os#330 (created 15:55, merged 16:00)
- work_verifier blocks on liminalconsultants#53/54/55 (created
  18:15-25, merged 18:36)

`incident_router.flush_pending` only skipped delivery when an incident
was already marked `resolved_at / deduped_to / notified_at`. It never
re-checked the underlying PR state before shipping the digest, so any
incident raised before the next digest hour fired regardless of
whether its cause had since cleared. Historical count: 221
stuck_pr_merge + 48 work_verifier_block incidents, many of which were
stale at delivery time.

New `_is_incident_stale` queries GitHub via `gh pr view` for PR-typed
incidents (source in {pr_monitor, work_verifier} or event.type mentions
PR). Marks resolved when:
- PR is MERGED or CLOSED, or
- stuck_pr_merge alert but PR is now CLEAN (condition cleared).

Fails open (returns stale=False) on any gh error so a network blip
never suppresses a real alert.

Also backfilled the existing `runtime/incidents/incidents.jsonl`:
swept 16 pending PR-typed incidents, auto-resolved 13 that were for
already-merged PRs.
@kai-linux kai-linux merged commit 37895c8 into main Apr 24, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant