Skip to content

fix: detect stale runs with expired heartbeats#192

Open
Aryan95614 wants to merge 1 commit intoNetflix:masterfrom
Aryan95614:fix/stale-run-detection
Open

fix: detect stale runs with expired heartbeats#192
Aryan95614 wants to merge 1 commit intoNetflix:masterfrom
Aryan95614:fix/stale-run-detection

Conversation

@Aryan95614
Copy link
Copy Markdown

Flows that crash (OOM, kill signal, orchestrator timeout) leave the UI showing "Running" forever because the backend may not send a status update. This has been open since 2021 (#15).

What changed

Added client-side stale detection using last_heartbeat_ts from the metadata service API response. When a running run's heartbeat exceeds 5 minutes:

  • Status indicator switches from "running" (blue) to "failed" (red)
  • Duration counter freezes at the last known heartbeat instead of counting up forever
  • Both the home page runs list and the individual run detail page are updated

How it works

isRunStale() in src/utils/run.ts checks if status === 'running' AND last_heartbeat_ts is older than 5 minutes. getRunDisplayStatus() returns 'failed' for stale runs so the existing status components render correctly without new status values.

The last_heartbeat_ts field was already sent by the metadata service but not included in the UI's Run type definition.

Files changed

  • src/types.ts -- add last_heartbeat_ts to Run interface
  • src/utils/run.ts -- add isRunStale(), getRunDisplayStatus(), update getRunDuration()
  • src/pages/Home/ResultGroup/ResultGroupCells.tsx -- use display status, stop auto-update for stale
  • src/pages/Run/RunHeader.tsx -- same changes for run detail page
  • src/utils/__tests__/run.test.cypress.ts -- tests for stale detection

Tests

7 new test cases covering: completed/failed/running without heartbeat (no false positives), recent heartbeat (not stale), expired heartbeat (stale), and display status mapping.

Fixes #15.

Runs that crash (OOM, kill signal, orchestrator timeout) leave the UI
showing "Running" forever because the backend may not send a status
update event.

This adds client-side stale detection using last_heartbeat_ts from the
metadata service. When a running run's heartbeat exceeds 5 minutes,
the UI displays it as "failed" and freezes the duration counter at the
last known heartbeat.

Changes:
- Add last_heartbeat_ts to Run interface (already sent by backend)
- Add isRunStale() and getRunDisplayStatus() utilities
- Stop auto-updating duration for stale runs
- Show failed status indicator for stale runs on home page and
  run detail page
- Add tests for staleness detection

Fixes Netflix#15.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Crashed Flow appear to run forever.

1 participant