fix: detect stale runs with expired heartbeats#192
Open
Aryan95614 wants to merge 1 commit intoNetflix:masterfrom
Open
fix: detect stale runs with expired heartbeats#192Aryan95614 wants to merge 1 commit intoNetflix:masterfrom
Aryan95614 wants to merge 1 commit intoNetflix:masterfrom
Conversation
Runs that crash (OOM, kill signal, orchestrator timeout) leave the UI showing "Running" forever because the backend may not send a status update event. This adds client-side stale detection using last_heartbeat_ts from the metadata service. When a running run's heartbeat exceeds 5 minutes, the UI displays it as "failed" and freezes the duration counter at the last known heartbeat. Changes: - Add last_heartbeat_ts to Run interface (already sent by backend) - Add isRunStale() and getRunDisplayStatus() utilities - Stop auto-updating duration for stale runs - Show failed status indicator for stale runs on home page and run detail page - Add tests for staleness detection Fixes Netflix#15.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Flows that crash (OOM, kill signal, orchestrator timeout) leave the UI showing "Running" forever because the backend may not send a status update. This has been open since 2021 (#15).
What changed
Added client-side stale detection using
last_heartbeat_tsfrom the metadata service API response. When a running run's heartbeat exceeds 5 minutes:How it works
isRunStale()insrc/utils/run.tschecks ifstatus === 'running'ANDlast_heartbeat_tsis older than 5 minutes.getRunDisplayStatus()returns'failed'for stale runs so the existing status components render correctly without new status values.The
last_heartbeat_tsfield was already sent by the metadata service but not included in the UI'sRuntype definition.Files changed
src/types.ts-- addlast_heartbeat_tsto Run interfacesrc/utils/run.ts-- addisRunStale(),getRunDisplayStatus(), updategetRunDuration()src/pages/Home/ResultGroup/ResultGroupCells.tsx-- use display status, stop auto-update for stalesrc/pages/Run/RunHeader.tsx-- same changes for run detail pagesrc/utils/__tests__/run.test.cypress.ts-- tests for stale detectionTests
7 new test cases covering: completed/failed/running without heartbeat (no false positives), recent heartbeat (not stale), expired heartbeat (stale), and display status mapping.
Fixes #15.