Fix Ctrl+C cancellation race in baml-cli test#3083
Fix Ctrl+C cancellation race in baml-cli test#3083rossirpaulo wants to merge 1 commit intoBoundaryML:canaryfrom
Conversation
|
@rossirpaulo is attempting to deploy a commit to the Boundary Team on Vercel. A member of the Team first needs to authorize it. |
📝 WalkthroughWalkthroughThe change replaces the Ctrl-C handler's use of Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
| fn emit_cancel_signal(cancel_notify: &tokio::sync::Notify) { | ||
| // Persist one permit in case SIGINT arrives before the executor starts awaiting. | ||
| cancel_notify.notify_one(); | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Confirm the single-waiter assumption for this Notify.
notify_one() only wakes one waiter; if multiple tasks await the cancel token, others could miss cancellation. If multi-waiter is possible, consider broadcasting to current waiters and still persisting a permit for future ones.
Suggested adjustment to preserve broadcast + persistence
fn emit_cancel_signal(cancel_notify: &tokio::sync::Notify) {
// Persist one permit in case SIGINT arrives before the executor starts awaiting.
- cancel_notify.notify_one();
+ cancel_notify.notify_waiters();
+ cancel_notify.notify_one();
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| fn emit_cancel_signal(cancel_notify: &tokio::sync::Notify) { | |
| // Persist one permit in case SIGINT arrives before the executor starts awaiting. | |
| cancel_notify.notify_one(); | |
| } | |
| fn emit_cancel_signal(cancel_notify: &tokio::sync::Notify) { | |
| // Persist one permit in case SIGINT arrives before the executor starts awaiting. | |
| cancel_notify.notify_waiters(); | |
| cancel_notify.notify_one(); | |
| } |
There was a problem hiding this comment.
So with current architecture, multi-waiter miss is not happening.
Your concern is valid for future design though: if multiple tasks start awaiting the same cancel token, Notify becomes awkward for “wake all current waiters + stay cancelled for future waiters.” Naively combining notify_waiters() and notify_one() still has race windows. For that future case, a sticky broadcast cancellation primitive (for example CancellationToken-style semantics) is the safer model.
There was a problem hiding this comment.
@rossirpaulo, thank you for the clarification! You're absolutely right—the single-waiter design in the current architecture makes notify_one() the correct choice here, and my suggestion doesn't apply.
I appreciate you noting the future consideration as well. If the architecture ever evolves to support multiple concurrent waiters on the same cancellation signal, a dedicated primitive like tokio_util::sync::CancellationToken would indeed be the safer path forward, as it provides the "sticky broadcast" semantics you described without the race windows.
| #[cfg(test)] | ||
| mod tests { | ||
| use std::time::Duration; | ||
|
|
||
| #[tokio::test] | ||
| async fn notify_waiters_drops_signal_if_emitted_before_waiting() { | ||
| let notify = tokio::sync::Notify::new(); | ||
| notify.notify_waiters(); | ||
|
|
||
| let result = tokio::time::timeout(Duration::from_millis(20), notify.notified()).await; | ||
| assert!( | ||
| result.is_err(), | ||
| "notify_waiters should not persist signals for future waiters", | ||
| ); | ||
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn notify_one_persists_signal_if_emitted_before_waiting() { | ||
| let notify = tokio::sync::Notify::new(); | ||
| super::emit_cancel_signal(¬ify); | ||
|
|
||
| let result = tokio::time::timeout(Duration::from_millis(20), notify.notified()).await; | ||
| assert!( | ||
| result.is_ok(), | ||
| "notify_one should persist one signal for a future waiter", | ||
| ); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, verify the file and code at the specified location
cat -n engine/cli/src/commands.rs | sed -n '356,382p'Repository: BoundaryML/baml
Length of output: 1122
🏁 Script executed:
# Check for other timeout patterns in the codebase to understand conventions
rg "Duration::from_millis" --type rust -A 2 -B 2Repository: BoundaryML/baml
Length of output: 33354
🏁 Script executed:
# Also check for tokio test patterns
rg "tokio::time::timeout" --type rust -A 2 -B 2Repository: BoundaryML/baml
Length of output: 1115
Increase timeout to reduce CI flakiness in unit tests.
These 20ms timeouts are prone to flakiness under CI load. Increase to 200ms (or higher) to provide a safety margin for OS scheduling overhead without affecting test correctness.
Example: increase timeout
- let result = tokio::time::timeout(Duration::from_millis(20), notify.notified()).await;
+ let result = tokio::time::timeout(Duration::from_millis(200), notify.notified()).await;
Summary
baml-cli testby usingNotify::notify_one()instead ofnotify_waiters()Notifybehavior when the signal arrives before a waiter is registeredIssue
notify_waiters()only wakes tasks already waiting onnotified()and does not persist a permit.In
Commands::Test, the SIGINT handler can run before the test executor reachesnotified().await. When that timing happens, the cancellation signal is dropped. Result: Ctrl+C can be ignored and the test run can continue hanging.Why this fix
notify_one()stores a permit when there is no active waiter, so a laternotified().awaitwill complete immediately. That removes the race while preserving current cancellation flow.Concrete evidence
engine/cli/src/commands.rsusedcancel_clone.notify_waiters()in the Ctrl+C handlernotify.notified().awaitinengine/baml-runtime/src/test_executor/mod.rsValidation
RUSTUP_TOOLCHAIN=stable-aarch64-apple-darwin cargo test -p baml-cli notify_ -- --nocaptureRUSTUP_TOOLCHAIN=stable-aarch64-apple-darwin cargo test -p baml-runtime cancel_notify_returns_cancelled -- --nocaptureSummary by CodeRabbit
Tests
Chores