Skip to content

stage heartbeat is output-gated, causing false stall perception during quiet long-running operations #51

@danshapiro

Description

@danshapiro

Problem

stage_heartbeat emission is gated on stdout/stderr growth, so quiet-but-active phases produce sparse or missing heartbeat events.

In run 01KJR25QS6RY52D7VS2ZRAAXMK, merge_implementation remained active while heartbeat cadence showed multi-minute gaps tied to low output growth.

Why this matters

  • Healthy stages can appear stalled in detached monitoring flows.
  • Triage quality drops because “alive but quiet” and “possibly stuck” are not clearly distinguished.
  • progress.ndjson/live.json under-represent liveness during quiet command windows.

Evidence

Run artifact:

  • ~/.local/state/kilroy/attractor/runs/01KJR25QS6RY52D7VS2ZRAAXMK/progress.ndjson

Observed heartbeat sequence for merge_implementation includes gaps while stage remains active, e.g.:

  • 20:36:54Z (elapsed 240) -> 20:38:54Z (elapsed 360) (missing 300)
  • 20:38:54Z (elapsed 360) -> 20:42:54Z (elapsed 600) (4-minute gap)
  • later attempt: 20:53:05Z (elapsed 180) -> 20:55:05Z (elapsed 300) (missing 240)

Stage logs show ongoing activity around long-running commands (including npm install) while heartbeat signal remains irregular.

Relevant code:

  • internal/attractor/engine/codergen_router.go:
    • heartbeat loop setup around :1149
    • output-growth gate around :1167
    • appendProgress called only when gate passes around :1171
  • API path has similar growth-gated heartbeat behavior around :317

Steps to reproduce / observe

  1. Run a codergen stage with a long command that has quiet periods.
  2. Monitor progress.ndjson for stage_heartbeat timestamps.
  3. Compare with stage stdout.log/stderr.log evolution.
  4. Observe heartbeat gaps caused by lack of output growth, not process death.

Scope boundaries

This issue is about observability/liveness signaling.

This issue is not:

  • A retry policy change.
  • A command execution semantics change.

Potential directions (non-prescriptive)

  • Emit liveness heartbeat on each interval while process/session is active.
  • Keep byte counters as diagnostics rather than emission gates.
  • Add explicit quiet-state metrics (since_last_output_s, idle_for_s).
  • Separate “liveness” and “output progress” event semantics.

Definition of done

  • Active stages produce predictable liveness signals even when output is quiet.
  • Monitoring can reliably differentiate alive but quiet from missing liveness.
  • Reproduction no longer shows heartbeat gaps solely due to output gating.
  • Tests cover both output-producing and quiet-active periods.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions