refactor(workflow): cleanup pass on the wfaas crate by slin1237 · Pull Request #1500 · lightseekorg/smg

slin1237 · 2026-05-15T17:34:40Z

Summary

Three-agent review of crates/workflow/ (reuse / quality / efficiency). Findings collapsed into one cleanup PR — no new features, behaviour-preserving except where flagged below.

One correctness-relevant change

wait_for_completion is now event-driven. Was polling state_store on a 100ms→2s adaptive cycle, taking the global state lock and cloning the full WorkflowState<D> (incl. workflow context) every iteration. Now registers a tokio::sync::Notify per instance in start_workflow and fires it from a new finalize helper. wait_for_completion snapshots the notifier, reads state once (covers "already terminal" race-free), then awaits the notifier under the user's timeout. Removes per-request polling latency (was up to 2s tail) and ~10 deep state clones per worker registration / removal / update.

Two bug fixes

cleanup_old_workflows was keeping future-timestamped workflows forever. signed_duration_since().to_std().unwrap_or_default() returned Duration::ZERO for a future updated_at (clock skew, manual fiddling), so the age was always 0 and never exceeded the TTL. Now treats it as Duration::MAX so they stay eligible for eviction.
execute_step_with_retry was silently turning backoff exhaustion into infinite retry. backoff.next_backoff().unwrap_or_else(|| Duration::from_secs(1)) masked the policy's "stop" signal. Now folds backoff exhaustion into the retry decision and falls through to the existing on_failure hook + StepResult::Failure return.

Two duplications collapsed

finalize helper absorbs the three open-coded "update state to terminal + publish workflow event" sites (deadlock branch, post-loop branch, cancel_workflow). Also fires the new completion notifier so the trio (state, event, notifier) stays in lockstep.
spawn_subscriber_tasks helper in EventBus. publish and publish_and_wait differed only in whether they awaited the join handles; both now go through one helper.

Polish

#[must_use] on every builder method on StepDefinition / WorkflowDefinition, so step.with_timeout(d); (statement) is flagged.
depends_on / depends_on_any now accept IntoIterator<Item: AsRef<str>> instead of &[&str]. Existing &["a", "b"] callers updated to ["a", "b"] (clippy needless_borrows_for_generic_args). From<&str> and From<String> for StepId.
Dropped a dead seen HashSet seeding loop in the scheduler — newly_ready_from_wait indices come from a HashMap and are unique by construction; dedup is only needed for the pending_check-derived deps_ready_indices.

What was deliberately NOT done (Tier 2)

These were flagged but scoped out:

Shard state_store by WorkflowInstanceId (DashMap) — biggest concurrency win for worker bursts but touches every state read site. Separate PR.
Coalesce per-step state updates (3 lock acquisitions per attempt) — needs careful refactoring of execute_step_with_retry. Separate PR.
Decompose execute_workflow (~420-line function) into drain_completions / find_ready_steps / launch_steps / wait_next_event. Separate PR.
wait_for_shutdown polling → Notify — minor; once per process.
Lift the duplicate app_context.as_ref().ok_or_else(...) (~30 sites in consumer steps) and start_workflow + wait_for_completion boilerplate (~7 sites in job_queue.rs) — both live outside this crate.

Test plan

cargo +nightly fmt --all (silent)
cargo clippy --all-targets --all-features -- -D warnings (clean)
cargo test — 3388 passed / 0 failed across the workspace.
No public-API surface change other than depends_on(_any) accepting more types and gaining #[must_use] on builders. Existing callers in model_gateway/src/workflow/ updated for the clippy lint.

Summary by CodeRabbit

New Features
- Workflow step dependencies now accept flexible iterable input types for enhanced usability.
- Added convenient conversion support for creating step identifiers from strings.
Bug Fixes
- Fixed step retry behavior when backoff delay exhaustion occurs.
- Improved workflow cleanup to correctly handle future-dated timestamps due to clock skew.
Improvements
- Added compiler hints (#[must_use]) to builder methods for better development experience.
- Refactored event publishing for improved efficiency.
- Enhanced workflow completion tracking with notifier-based waiting mechanism.

A multi-agent review of `crates/workflow/` surfaced one correctness-relevant change, two bug fixes, two duplications worth collapsing, and a handful of small polish items. **`wait_for_completion` is now event-driven.** It used to poll the state store on a 100ms→2s adaptive cycle, taking a write/read lock and cloning the full `WorkflowState<D>` (including the workflow context) on every iteration just to learn that the workflow had ended. The engine now registers a `tokio::sync::Notify` per instance in `start_workflow` and fires it from a new `finalize` helper. `wait_for_completion` snapshots the notifier, reads state once (covers the "already terminal" case race-free), then awaits the notifier under the user's timeout. Removes per-request polling latency (was up to 2s tail) and ~10 deep state clones per worker registration / removal / update. **`finalize` collapses three duplicated terminal sites.** The deadlock branch, the post-loop completion branch, and the explicit `cancel_workflow` path were each open-coding the `state_store.update + event_bus.publish` pair. They now share `finalize(instance_id, WorkflowOutcome)` so the trio (state, event, notifier) stays in lockstep. **`EventBus::publish` and `publish_and_wait`** differed only in whether they awaited the spawned subscriber tasks. Extracted a single `spawn_subscriber_tasks` helper; `publish` drops the handles, `publish_and_wait` awaits them. ~30 lines saved, no behaviour change. **Bugs fixed:** - `cleanup_old_workflows` used `unwrap_or_default()` on the `signed_duration_since().to_std()` result. A `updated_at` in the future (clock skew, manual fiddling) returned `Duration::ZERO` and the workflow was kept forever. Now treats those as `Duration::MAX` so they stay eligible for TTL eviction. - `execute_step_with_retry` masked an exhausted `backoff::next_backoff()` with `unwrap_or_else(|| Duration::from_secs(1))`, silently turning "stop retrying" into "retry forever at 1s". Now folds backoff exhaustion into the retry decision and falls through to the `on_failure` hook. **Polish:** - `#[must_use]` on every builder method on `StepDefinition` and `WorkflowDefinition` so accidental drops like `step.with_timeout(...);` are flagged. - `depends_on` / `depends_on_any` now accept `IntoIterator<Item: AsRef<str>>` instead of `&[&str]`, removing the call-site `&` and accepting any string-like iterable. `From<&str>` and `From<String>` impls on `StepId` for callers who already have one. - Dropped a dead `seen` HashSet seeding loop over `newly_ready_from_wait` — those indices come from a HashMap and are unique by construction; dedup is only required for the `pending_check`-derived `deps_ready_indices`. Net `+292 / -188` across 11 files; the `wait_for_completion` refactor accounts for most of the new lines (helper + better doc comments). All 3388 workspace tests pass. Signed-off-by: Simo Lin <25425177+slin1237@users.noreply.github.com>

coderabbitai · 2026-05-15T17:34:58Z

📝 Walkthrough

Walkthrough

The workflow engine API is generalized to accept flexible iterables for step dependencies, completion tracking is refactored to use per-instance notifiers, termination logic is unified through a finalize method, retry behavior is corrected to respect exhausted backoff, and all test and consumer call sites are migrated to the new syntax.

Changes

Workflow Engine Enhancement & API Migration

Layer / File(s)	Summary
Builder API Flexibility and Type Conversions `crates/workflow/src/definition.rs`, `crates/workflow/src/types.rs`	`StepDefinition::depends_on` and `depends_on_any` accept generic `IntoIterator<Item = S>` where `S: AsRef<str>` instead of fixed `&[&str]` slices. `must_use` attributes added to builder methods. `StepId` gains `From<&str>` and `From<String>` trait implementations for seamless conversions.
Per-Workflow Completion Notification System `crates/workflow/src/engine.rs`	New `completion_notifiers` map tracks per-instance `Notify` slots. `start_workflow` initializes notifier before publishing `WorkflowStarted`. `clone_for_execution` propagates notifiers so spawned tasks can signal completion.
Unified Workflow Termination & Finalization `crates/workflow/src/engine.rs`	New `WorkflowOutcome` enum and `finalize()` method centralize terminal state persistence, event publishing, and waiter notification. `cancel_workflow` delegates to finalize. Deadlock/failure handling and cancellation paths refactored to use finalize instead of inline updates.
Event-Driven Wait-for-Completion Mechanism `crates/workflow/src/engine.rs`	Replaces adaptive polling with notifier-based waiting. New `result_from_state` helper maps terminal `WorkflowStatus` to `Result<String, String>`. Subscribes to per-instance notifier, checks terminal cases, waits with timeout, and processes completion events.
Backoff-Respecting Retry Logic & Scheduling `crates/workflow/src/engine.rs`	`execute_step_with_retry` computes optional `retry_delay` from backoff policy; exhausted backoff (`None`) triggers warning and non-retry path instead of default delay. Step readiness deduplication refined: `newly_ready_from_wait` treated as unique; `deps_ready_indices` deduplicated via `seen` set.
EventBus Subscriber Task Spawning Refactor `crates/workflow/src/event.rs`	New private `spawn_subscriber_tasks` helper centralizes subscriber fan-out logic. `publish` delegates and drops handles (fire-and-forget). `publish_and_wait` delegates and awaits handles. Timeout/panic behavior preserved with reorganized spawning.
Workflow Cleanup Future-Date Edge Case `crates/workflow/src/state.rs`	`cleanup_old_workflows` treats future-dated workflows (clock skew) as max-age eligible by using `unwrap_or(Duration::MAX)` instead of zero default. Added explanatory comments.
Workflow Integration Test Suite Migration `crates/workflow/tests/workflow_test.rs`	Updated DAG dependency, cycle, missing-dependency, conditional, and `depends_on_any` tests to use new array-literal syntax (`depends_on(["..."])`) instead of borrowed slices. Test logic unchanged.
Model Gateway Workflow Definition Updates `model_gateway/src/workflow/mcp_registration.rs`, `model_gateway/src/workflow/steps/local/mod.rs`, `model_gateway/src/workflow/steps/mod.rs`, `model_gateway/src/workflow/wasm_module_registration.rs`, `model_gateway/src/workflow/wasm_module_removal.rs`	All workflow step dependency declarations migrated from borrowed-slice syntax (`depends_on(&["..."])`) to array-literal syntax (`depends_on(["..."])`). Affects MCP, WASM, worker registration/update/removal workflows. No logic changes.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

lightseekorg/smg#641: Overlaps with this PR's StepDefinition dependency builder/signature changes through cycle-detection refactoring built around the updated depends_on/depends_on_any API.
lightseekorg/smg#1491: Both PRs modify workflow wiring and step dependency declarations (e.g., model_gateway/src/workflow/steps/local/mod.rs), so this PR's API generalization directly impacts the newly added drain_workers step.

Suggested labels

workflow, model-gateway, tests

Suggested reviewers

CatherineSue
key4ng

Poem

🐰 The workflows now notify us with finesse,
No polling loops in the test-driven mess,
Dependencies bend to the array we pass,
Retries respect when backoff's run last,
And wait-for-completion moves gracefully fast!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'refactor(workflow): cleanup pass on the wfaas crate' accurately summarizes the primary changes: refactoring the workflow/wfaas crate with multiple improvements and bug fixes across multiple files.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch refactor/workflow-cleanup

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ec0607c8fc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-15T17:37:47Z

+                // The status is already Cancelled (set by `cancel_workflow`);
+                // we just need to fan the event out and exit. Use the
+                // event-bus directly to avoid a redundant state write.
                self.event_bus


Notify waiters on externally cancelled workflows

When a workflow is marked Cancelled by a shared/custom StateStore (or any path other than this engine's cancel_workflow) after wait_for_completion has already read the non-terminal state, this branch publishes the cancellation event and returns without going through finalize, so the per-instance Notify is never fired or removed. The old polling waiter would observe the cancelled state on its next poll, but the new event-driven waiter will sleep until its timeout even though the workflow has already terminated.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/workflow/src/engine.rs (1)

521-528: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Duplicate WorkflowCancelled event published.

When cancel_workflow is called, it invokes finalize(..., Cancelled) which already publishes WorkflowCancelled (line 1137-1139). When execute_workflow subsequently detects cancellation via is_cancelled() and reaches this branch, it publishes the same event again.

The comment states "we just need to fan the event out", but finalize has already done this.

Proposed fix

             if self.state_store.is_cancelled(instance_id).await? {
-                // The status is already Cancelled (set by `cancel_workflow`);
-                // we just need to fan the event out and exit. Use the
-                // event-bus directly to avoid a redundant state write.
-                self.event_bus
-                    .publish(WorkflowEvent::WorkflowCancelled { instance_id })
-                    .await;
+                // The status is already Cancelled and the event was published
+                // by `cancel_workflow` via `finalize`. Just exit cleanly.
                 return Ok(());
             }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/workflow/src/engine.rs` around lines 521 - 528, The duplicate
WorkflowCancelled event is published because cancel_workflow -> finalize(...)
already emits WorkflowEvent::WorkflowCancelled; in execute_workflow when
detecting cancellation via self.state_store.is_cancelled(instance_id).await? you
should not republish the same event. Remove or guard the
event_bus.publish(WorkflowEvent::WorkflowCancelled { instance_id }) call in
execute_workflow (the branch that currently fans out the event) so it only
returns Ok(()) after detecting cancellation, or add a check to avoid publishing
if finalize has already emitted it; reference self.state_store.is_cancelled,
execute_workflow, event_bus.publish, WorkflowEvent::WorkflowCancelled,
cancel_workflow, and finalize to locate and update the code.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/workflow/src/event.rs`:
- Around line 129-132: The code currently ignores JoinError by doing let _ =
handle.await in publish_and_wait after calling spawn_subscriber_tasks(event);
update publish_and_wait to handle each handle.await result instead of discarding
it: inspect the Result from handle.await, log or propagate JoinError with
context (including subscriber/task id or event info) and ensure failures are
surfaced (e.g., collect errors and return Err or processLogger/error macro);
reference the spawn_subscriber_tasks and the handles iteration where
handle.await is used to implement this error handling.

---

Outside diff comments:
In `@crates/workflow/src/engine.rs`:
- Around line 521-528: The duplicate WorkflowCancelled event is published
because cancel_workflow -> finalize(...) already emits
WorkflowEvent::WorkflowCancelled; in execute_workflow when detecting
cancellation via self.state_store.is_cancelled(instance_id).await? you should
not republish the same event. Remove or guard the
event_bus.publish(WorkflowEvent::WorkflowCancelled { instance_id }) call in
execute_workflow (the branch that currently fans out the event) so it only
returns Ok(()) after detecting cancellation, or add a check to avoid publishing
if finalize has already emitted it; reference self.state_store.is_cancelled,
execute_workflow, event_bus.publish, WorkflowEvent::WorkflowCancelled,
cancel_workflow, and finalize to locate and update the code.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 96393705-915b-456d-88cf-3dd0e38fe40b

📥 Commits

Reviewing files that changed from the base of the PR and between 02ff4b8 and ec0607c.

📒 Files selected for processing (11)

crates/workflow/src/definition.rs
crates/workflow/src/engine.rs
crates/workflow/src/event.rs
crates/workflow/src/state.rs
crates/workflow/src/types.rs
crates/workflow/tests/workflow_test.rs
model_gateway/src/workflow/mcp_registration.rs
model_gateway/src/workflow/steps/local/mod.rs
model_gateway/src/workflow/steps/mod.rs
model_gateway/src/workflow/wasm_module_registration.rs
model_gateway/src/workflow/wasm_module_removal.rs

coderabbitai · 2026-05-15T17:39:15Z

+        let handles = self.spawn_subscriber_tasks(event).await;
+        for handle in handles {
+            let _ = handle.await;
+        }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify that JoinHandle errors are currently discarded in publish_and_wait rg -n --type=rust -C3 'publish_and_wait|for handle in handles|handle\.await' crates/workflow/src/event.rs # Verify existing timeout logging and absence/presence of join-error logging paths rg -n --type=rust -C2 'timed out|JoinError|is_panic|subscriber task failed' crates/workflow/src/event.rs

Repository: lightseekorg/smg

Length of output: 749

🏁 Script executed:

#!/bin/bash # Find spawn_subscriber_tasks implementation to understand task type rg -n --type=rust -A10 'fn spawn_subscriber_tasks' crates/workflow/src/event.rs # Look for tests of publish_and_wait rg -n --type=rust 'test.*publish_and_wait|publish_and_wait.*test' crates/workflow/src/ # Check if warn! macro is used elsewhere in this file rg -n --type=rust 'warn!' crates/workflow/src/event.rs # Check imports to see what logging is available rg -n --type=rust '^use|^extern' crates/workflow/src/event.rs | head -20

Repository: lightseekorg/smg

Length of output: 737

🏁 Script executed:

#!/bin/bash # Search for tests that might test publish_and_wait behavior fd -t f '\.rs$' --type f | xargs grep -l 'publish_and_wait' 2>/dev/null # Look for other handle.await patterns in rust files to see if this is a widespread pattern rg -n --type=rust 'handle\.await' crates/workflow/src/ # Check if there's a tests directory find . -type d -name 'tests' -o -type d -name 'test' | head -5

Repository: lightseekorg/smg

Length of output: 276

🏁 Script executed:

#!/bin/bash # Check for tests in workflow crate find crates/workflow -name '*.rs' -type f | xargs grep -l 'publish_and_wait\|#\[test\]\|#\[tokio::test\]' 2>/dev/null # Look for any #[test] or #[tokio::test] in event.rs rg -n --type=rust '#\[.*test' crates/workflow/src/event.rs # Check if there's a mod tests section at the bottom of event.rs tail -50 crates/workflow/src/event.rs

Repository: lightseekorg/smg

Length of output: 1713

🏁 Script executed:

#!/bin/bash # Check workflow_test.rs for publish_and_wait tests rg -n --type=rust -B2 -A15 'publish_and_wait' crates/workflow/tests/workflow_test.rs # Also check if event.rs has any comments about error handling intention rg -n --type=rust -B2 -A2 'let _ = handle' crates/workflow/src/event.rs

Repository: lightseekorg/smg

Length of output: 226

🏁 Script executed:

#!/bin/bash # Check what's in workflow_test.rs wc -l crates/workflow/tests/workflow_test.rs # See if there are any tests that call publish_and_wait indirectly rg -n --type=rust 'pub_event|EventPublisher|publish' crates/workflow/tests/workflow_test.rs | head -20 # Check for any error handling patterns in the spawn_subscriber_tasks implementation rg -n --type=rust -A30 'fn spawn_subscriber_tasks' crates/workflow/src/event.rs | grep -A20 'into_iter'

Repository: lightseekorg/smg

Length of output: 1008

Handle subscriber task join failures instead of discarding them.

At line 131, let _ = handle.await drops any JoinError from subscriber task panics or cancellations. This prevents visibility into subscriber failures during event processing in publish_and_wait.

Suggested fix

pub async fn publish_and_wait(&self, event: WorkflowEvent) { let handles = self.spawn_subscriber_tasks(event).await; for handle in handles { - let _ = handle.await; + if let Err(join_err) = handle.await { + warn!( + is_panic = join_err.is_panic(), + error = %join_err, + "Event subscriber task failed" + ); + } } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let handles = self.spawn_subscriber_tasks(event).await;

for handle in handles {

let _ = handle.await;

}

let handles = self.spawn_subscriber_tasks(event).await;

for handle in handles {

if let Err(join_err) = handle.await {

warn!(

is_panic = join_err.is_panic(),

error = %join_err,

"Event subscriber task failed"

);

}

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/workflow/src/event.rs` around lines 129 - 132, The code currently ignores JoinError by doing let _ = handle.await in publish_and_wait after calling spawn_subscriber_tasks(event); update publish_and_wait to handle each handle.await result instead of discarding it: inspect the Result from handle.await, log or propagate JoinError with context (including subscriber/task id or event info) and ensure failures are surfaced (e.g., collect errors and return Err or processLogger/error macro); reference the spawn_subscriber_tasks and the handles iteration where handle.await is used to implement this error handling.

claude · 2026-05-15T17:39:57Z

+                // The status is already Cancelled (set by `cancel_workflow`);
+                // we just need to fan the event out and exit. Use the
+                // event-bus directly to avoid a redundant state write.


🟣 Pre-existing: cancel_workflow → finalize now publishes WorkflowCancelled and fires notify_waiters. Then the main loop hits this is_cancelled branch and publishes WorkflowCancelled a second time. This double-publish was already present before the refactor (old cancel_workflow also published the event directly), but now that finalize centralises state + event + notify, the loop's publish is redundant and the comment should reflect that.

Consider removing the publish call here and just doing return Ok(()), since finalize already handled everything.

gemini-code-assist

Code Review

This pull request refactors the workflow engine to replace adaptive polling with a notification-based completion system using tokio::sync::Notify. It introduces a centralized finalize method to ensure that state updates, event publishing, and waiter notifications remain synchronized across all termination paths. Additionally, the StepDefinition builder API was improved with #[must_use] attributes and more flexible dependency input types. Feedback was provided regarding a redundant event publication in the cancellation logic that should be removed to avoid duplicate events.

gemini-code-assist · 2026-05-15T17:40:09Z

                self.event_bus
                    .publish(WorkflowEvent::WorkflowCancelled { instance_id })
                    .await;


This publish call is redundant. The cancel_workflow method now calls the finalize helper, which already handles updating the state to Cancelled and publishing the WorkflowCancelled event. When the background execution task detects the cancellation via is_cancelled, it should simply exit the loop.

return Ok(());

claude · 2026-05-15T17:40:14Z

+        // Register the completion slot before publishing `WorkflowStarted`
+        // so a `wait_for_completion` call that races the spawned task
+        // always finds something to subscribe to. `finalize` removes it.
+        self.completion_notifiers
+            .write()
+            .insert(instance_id, Arc::new(Notify::new()));


🟡 Nit: If the spawned execute_workflow task returns Err (e.g. a state_store operation fails via ?), finalize is never called and this notifier entry is never removed or fired. Any wait_for_completion caller would then hang until its timeout rather than failing fast.

Consider either (a) adding a drop guard in the spawned task that fires notify_waiters() on all exit paths, or (b) calling finalize with WorkflowOutcome::Failed in the Err branch of the spawned task below (line ~491).

claude

Clean, well-reasoned refactor. The finalize consolidation, event-driven wait_for_completion, backoff-exhaustion fix, and clock-skew cleanup all look correct.

Comments posted: 1 🟣 Pre-existing, 1 🟡 Nit — no blocking issues.

🟣 Pre-existing (engine.rs:522): WorkflowCancelled event is published twice on cancel — once by finalize (via cancel_workflow) and once by the main loop's is_cancelled handler. Now that finalize centralises the trio, the loop's publish is redundant.
🟡 Nit (engine.rs:464): If the spawned execute_workflow task exits via Err (state store failure), finalize is never called and the completion_notifiers entry leaks. A drop guard or finalize call in the error branch would make wait_for_completion fail fast instead of timing out.

slin1237 requested a review from CatherineSue as a code owner May 15, 2026 17:34

github-actions Bot added tests Test changes workflow Workflow crate changes model-gateway Model gateway crate changes labels May 15, 2026

chatgpt-codex-connector Bot reviewed May 15, 2026

View reviewed changes

coderabbitai Bot requested changes May 15, 2026

View reviewed changes

claude Bot reviewed May 15, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 15, 2026

View reviewed changes

claude Bot reviewed May 15, 2026

View reviewed changes

claude Bot approved these changes May 15, 2026

View reviewed changes

Conversation

slin1237 commented May 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

One correctness-relevant change

Two bug fixes

Two duplications collapsed

Polish

What was deliberately NOT done (Tier 2)

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

slin1237 commented May 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 15, 2026 •

edited

Loading