Skip to content

Fix Ctrl+C cancellation race in baml-cli test#3083

Open
rossirpaulo wants to merge 1 commit intoBoundaryML:canaryfrom
rossirpaulo:paulo/fix-ctrlc-notify-race
Open

Fix Ctrl+C cancellation race in baml-cli test#3083
rossirpaulo wants to merge 1 commit intoBoundaryML:canaryfrom
rossirpaulo:paulo/fix-ctrlc-notify-race

Conversation

@rossirpaulo
Copy link
Collaborator

@rossirpaulo rossirpaulo commented Feb 6, 2026

Summary

  • fix a Ctrl+C cancellation race in baml-cli test by using Notify::notify_one() instead of notify_waiters()
  • add focused regression tests that document Notify behavior when the signal arrives before a waiter is registered
  • keep the runtime executor path unchanged

Issue

notify_waiters() only wakes tasks already waiting on notified() and does not persist a permit.

In Commands::Test, the SIGINT handler can run before the test executor reaches notified().await. When that timing happens, the cancellation signal is dropped. Result: Ctrl+C can be ignored and the test run can continue hanging.

Why this fix

notify_one() stores a permit when there is no active waiter, so a later notified().await will complete immediately. That removes the race while preserving current cancellation flow.

Concrete evidence

  • buggy call site before fix: engine/cli/src/commands.rs used cancel_clone.notify_waiters() in the Ctrl+C handler
  • runtime awaits cancellation via notify.notified().await in engine/baml-runtime/src/test_executor/mod.rs
  • race exists when signal is emitted before waiter registration

Validation

  • RUSTUP_TOOLCHAIN=stable-aarch64-apple-darwin cargo test -p baml-cli notify_ -- --nocapture
  • RUSTUP_TOOLCHAIN=stable-aarch64-apple-darwin cargo test -p baml-runtime cancel_notify_returns_cancelled -- --nocapture

Summary by CodeRabbit

  • Tests

    • Added tests validating cancellation signal handling in CLI test execution.
  • Chores

    • Improved internal implementation of signal cancellation handling to ensure proper signal persistence and reliability during operation interruption.

@vercel
Copy link

vercel bot commented Feb 6, 2026

@rossirpaulo is attempting to deploy a commit to the Boundary Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 6, 2026

📝 Walkthrough

Walkthrough

The change replaces the Ctrl-C handler's use of notify_waiters() with a new emit_cancel_signal() helper that calls notify_one() instead. This ensures a persistent cancellation signal is available for future waiters. Tests validate the difference between these notify behaviors.

Changes

Cohort / File(s) Summary
Ctrl-C Handler and Cancellation Signal
engine/cli/src/commands.rs
Replaces notify_waiters() with notify_one() via new helper function to ensure persistent cancellation signal for CLI execution; adds tests validating Notify behavior differences.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Fix Ctrl+C cancellation race in baml-cli test' directly and accurately describes the main change: fixing a race condition in the Ctrl+C cancellation signal handling for the baml-cli test command.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@rossirpaulo rossirpaulo removed the codex label Feb 6, 2026
@rossirpaulo rossirpaulo requested a review from hellovai February 6, 2026 03:51
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Comment on lines +345 to +348
fn emit_cancel_signal(cancel_notify: &tokio::sync::Notify) {
// Persist one permit in case SIGINT arrives before the executor starts awaiting.
cancel_notify.notify_one();
}
Copy link
Contributor

@coderabbitai coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Confirm the single-waiter assumption for this Notify.

notify_one() only wakes one waiter; if multiple tasks await the cancel token, others could miss cancellation. If multi-waiter is possible, consider broadcasting to current waiters and still persisting a permit for future ones.

Suggested adjustment to preserve broadcast + persistence
 fn emit_cancel_signal(cancel_notify: &tokio::sync::Notify) {
     // Persist one permit in case SIGINT arrives before the executor starts awaiting.
-    cancel_notify.notify_one();
+    cancel_notify.notify_waiters();
+    cancel_notify.notify_one();
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fn emit_cancel_signal(cancel_notify: &tokio::sync::Notify) {
// Persist one permit in case SIGINT arrives before the executor starts awaiting.
cancel_notify.notify_one();
}
fn emit_cancel_signal(cancel_notify: &tokio::sync::Notify) {
// Persist one permit in case SIGINT arrives before the executor starts awaiting.
cancel_notify.notify_waiters();
cancel_notify.notify_one();
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So with current architecture, multi-waiter miss is not happening.

Your concern is valid for future design though: if multiple tasks start awaiting the same cancel token, Notify becomes awkward for “wake all current waiters + stay cancelled for future waiters.” Naively combining notify_waiters() and notify_one() still has race windows. For that future case, a sticky broadcast cancellation primitive (for example CancellationToken-style semantics) is the safer model.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rossirpaulo, thank you for the clarification! You're absolutely right—the single-waiter design in the current architecture makes notify_one() the correct choice here, and my suggestion doesn't apply.

I appreciate you noting the future consideration as well. If the architecture ever evolves to support multiple concurrent waiters on the same cancellation signal, a dedicated primitive like tokio_util::sync::CancellationToken would indeed be the safer path forward, as it provides the "sticky broadcast" semantics you described without the race windows.

Comment on lines +356 to +382
#[cfg(test)]
mod tests {
use std::time::Duration;

#[tokio::test]
async fn notify_waiters_drops_signal_if_emitted_before_waiting() {
let notify = tokio::sync::Notify::new();
notify.notify_waiters();

let result = tokio::time::timeout(Duration::from_millis(20), notify.notified()).await;
assert!(
result.is_err(),
"notify_waiters should not persist signals for future waiters",
);
}

#[tokio::test]
async fn notify_one_persists_signal_if_emitted_before_waiting() {
let notify = tokio::sync::Notify::new();
super::emit_cancel_signal(&notify);

let result = tokio::time::timeout(Duration::from_millis(20), notify.notified()).await;
assert!(
result.is_ok(),
"notify_one should persist one signal for a future waiter",
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, verify the file and code at the specified location
cat -n engine/cli/src/commands.rs | sed -n '356,382p'

Repository: BoundaryML/baml

Length of output: 1122


🏁 Script executed:

# Check for other timeout patterns in the codebase to understand conventions
rg "Duration::from_millis" --type rust -A 2 -B 2

Repository: BoundaryML/baml

Length of output: 33354


🏁 Script executed:

# Also check for tokio test patterns
rg "tokio::time::timeout" --type rust -A 2 -B 2

Repository: BoundaryML/baml

Length of output: 1115


Increase timeout to reduce CI flakiness in unit tests.

These 20ms timeouts are prone to flakiness under CI load. Increase to 200ms (or higher) to provide a safety margin for OS scheduling overhead without affecting test correctness.

Example: increase timeout
-        let result = tokio::time::timeout(Duration::from_millis(20), notify.notified()).await;
+        let result = tokio::time::timeout(Duration::from_millis(200), notify.notified()).await;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant