-
Notifications
You must be signed in to change notification settings - Fork 1
[Feature]: Workflow State Persistence (Checkpoint/Resume) #105
Copy link
Copy link
Open
Labels
enhancementNew feature or requestNew feature or requesthelp wantedExtra attention is neededExtra attention is needed
Description
Description
Enables the TaskRunner to save its execution state and resume from that state later, allowing for recovery from failures or pausing long-running workflows without re-executing completed tasks.
Proposed Solution
Feature: Workflow State Persistence (Checkpoint/Resume)
🎯 User Story
"As a DevOps engineer, I want to save the state of a running workflow and resume it later from where it left off (e.g., after a server crash, deployment, or manual pause), so that I don't have to re-run expensive or side-effect-heavy tasks."
❓ Why
- Cost Efficiency: Re-running tasks like AI model training, large data ingestion, or paid API calls wastes money and resources.
- Safety & Idempotency: Some tasks are not idempotent (e.g., "Charge Credit Card", "Send Email"). If a workflow crashes after these steps but before completion, re-running from scratch is dangerous.
- Resilience: Long-running workflows (minutes to hours) are vulnerable to transient infrastructure failures. Resuming from the last successful step allows recovery without total data loss.
🛠️ What Changes
- State Exposure:
TaskRunnerandTaskStateManagerneed to expose the current execution state (results of completed tasks). - Hydration:
TaskRunnerBuilderandTaskStateManagerneed a way to initialize with a pre-existing state (the snapshot). - Execution Logic:
WorkflowExecutorneeds to respect the hydrated state—skipping tasks that are already marked assuccessin the snapshot, while treating them as satisfied dependencies for downstream tasks.
✅ Acceptance Criteria
-
TaskRunner(orTaskStateManager) must expose a method to get a serializable snapshot of the current state (results). -
TaskRunnerBuildermust accept a snapshot to initialize the runner. - When
executeis called with a hydrated state:- Tasks marked as
successin the snapshot MUST NOT run again. - Tasks marked as
successin the snapshot MUST be treated as completed dependencies for pending tasks. - Tasks marked as
failure,cancelled, orskippedin the snapshot SHOULD be re-evaluated (run again).
- Tasks marked as
- Context (
TContext) changes made by tasks in the previous run must be manually restored by the user (since context can contain non-serializable objects), OR the snapshot must include a mechanism to warn/handle context.- Decision: For MVP, the user is responsible for providing the initial
contextto theTaskRunnerBuilder. The state snapshot only tracks task status/results. If the context needs to be in a certain state for step N+1, the user must provide that context. - Refinement: The snapshot should strictly contain
Record<string, TaskResult>.
- Decision: For MVP, the user is responsible for providing the initial
⚠️ Constraints
- The
TContextobject is often non-serializable (contains functions, sockets, etc.). Therefore, this feature only persists the execution graph state (which tasks finished). The user is responsible for re-hydrating thecontextto a state suitable for resumption if necessary.
Alternatives Considered
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requesthelp wantedExtra attention is neededExtra attention is needed