-
Notifications
You must be signed in to change notification settings - Fork 32
Open
Description
Problem
stage_heartbeat emission is gated on stdout/stderr growth, so quiet-but-active phases produce sparse or missing heartbeat events.
In run 01KJR25QS6RY52D7VS2ZRAAXMK, merge_implementation remained active while heartbeat cadence showed multi-minute gaps tied to low output growth.
Why this matters
- Healthy stages can appear stalled in detached monitoring flows.
- Triage quality drops because “alive but quiet” and “possibly stuck” are not clearly distinguished.
progress.ndjson/live.jsonunder-represent liveness during quiet command windows.
Evidence
Run artifact:
~/.local/state/kilroy/attractor/runs/01KJR25QS6RY52D7VS2ZRAAXMK/progress.ndjson
Observed heartbeat sequence for merge_implementation includes gaps while stage remains active, e.g.:
20:36:54Z (elapsed 240)->20:38:54Z (elapsed 360)(missing 300)20:38:54Z (elapsed 360)->20:42:54Z (elapsed 600)(4-minute gap)- later attempt:
20:53:05Z (elapsed 180)->20:55:05Z (elapsed 300)(missing 240)
Stage logs show ongoing activity around long-running commands (including npm install) while heartbeat signal remains irregular.
Relevant code:
internal/attractor/engine/codergen_router.go:- heartbeat loop setup around
:1149 - output-growth gate around
:1167 appendProgresscalled only when gate passes around:1171
- heartbeat loop setup around
- API path has similar growth-gated heartbeat behavior around
:317
Steps to reproduce / observe
- Run a codergen stage with a long command that has quiet periods.
- Monitor
progress.ndjsonforstage_heartbeattimestamps. - Compare with stage
stdout.log/stderr.logevolution. - Observe heartbeat gaps caused by lack of output growth, not process death.
Scope boundaries
This issue is about observability/liveness signaling.
This issue is not:
- A retry policy change.
- A command execution semantics change.
Potential directions (non-prescriptive)
- Emit liveness heartbeat on each interval while process/session is active.
- Keep byte counters as diagnostics rather than emission gates.
- Add explicit quiet-state metrics (
since_last_output_s,idle_for_s). - Separate “liveness” and “output progress” event semantics.
Definition of done
- Active stages produce predictable liveness signals even when output is quiet.
- Monitoring can reliably differentiate
alive but quietfrom missing liveness. - Reproduction no longer shows heartbeat gaps solely due to output gating.
- Tests cover both output-producing and quiet-active periods.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels