Auto-resolve stale incidents before digest delivery#350
Merged
Conversation
Five sev3 incidents landed in today's 09:00 Telegram digest; four were
about PRs that had merged the previous afternoon:
- pr_monitor stuck_pr_merge on agent-os#330 (created 15:55, merged 16:00)
- work_verifier blocks on liminalconsultants#53/54/55 (created
18:15-25, merged 18:36)
`incident_router.flush_pending` only skipped delivery when an incident
was already marked `resolved_at / deduped_to / notified_at`. It never
re-checked the underlying PR state before shipping the digest, so any
incident raised before the next digest hour fired regardless of
whether its cause had since cleared. Historical count: 221
stuck_pr_merge + 48 work_verifier_block incidents, many of which were
stale at delivery time.
New `_is_incident_stale` queries GitHub via `gh pr view` for PR-typed
incidents (source in {pr_monitor, work_verifier} or event.type mentions
PR). Marks resolved when:
- PR is MERGED or CLOSED, or
- stuck_pr_merge alert but PR is now CLEAN (condition cleared).
Fails open (returns stale=False) on any gh error so a network blip
never suppresses a real alert.
Also backfilled the existing `runtime/incidents/incidents.jsonl`:
swept 16 pending PR-typed incidents, auto-resolved 13 that were for
already-merged PRs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Today's 09:00 Telegram digest delivered five sev3 incidents. Four were stale by 13–14 hours:
Root cause
`incident_router.flush_pending` only skipped delivery when an incident was already `resolved_at / deduped_to / notified_at`. It never re-checked the underlying condition before shipping the digest. Any incident raised before the next digest hour fired regardless of whether its cause had since cleared.
Historical volume: 221 stuck_pr_merge + 48 work_verifier_block incidents in the file.
Fix
New `_is_incident_stale` probe runs right before delivery on PR-typed incidents. Queries `gh pr view` for the PR's current state and marks resolved when:
Fails open on gh errors so network blips never swallow a real alert.
Backfill
Swept the existing `runtime/incidents/incidents.jsonl`: 16 pending PR-typed incidents, 13 auto-resolved (stale). 3 remain pending (real current blocks).
PR #349 handled separately
Inspected the flagged paths (`bin/run_library_scout.sh`, `example.config.yaml`) — both benign. Documented findings on issue #348, closed the issue + PR.
Test plan