Skip to content

c-bridge: drop worker Arc clone before waking lang to fix finalize_shutdown race#1330

Open
EdmondDantes wants to merge 1 commit into
temporalio:mainfrom
true-async:fix/finalize-shutdown-arc-race
Open

c-bridge: drop worker Arc clone before waking lang to fix finalize_shutdown race#1330
EdmondDantes wants to merge 1 commit into
temporalio:mainfrom
true-async:fix/finalize-shutdown-arc-race

Conversation

@EdmondDantes

Copy link
Copy Markdown

Problem

temporal_core_worker_finalize_shutdown takes sole ownership of the worker via Arc::try_unwrap(worker.worker.take().unwrap()):

let core_worker = match Arc::try_unwrap(worker.worker.take().unwrap()) {
    Ok(core_worker) => core_worker,
    Err(arc) => { /* "Cannot finalize, expected 1 reference, got {N}" */ return; }
};
core_worker.finalize_shutdown().await;

But every poll/complete/validate FFI fn clones that Arc into a spawned Tokio task and drops the clone only at task-scope end — after invoking the completion callback that wakes the foreign (lang) thread:

let core_worker = worker.worker.as_ref().unwrap().clone();
worker.runtime.core.tokio_handle().spawn(async move {
    let (success, fail) = match core_worker.poll_workflow_activation().await { ... };
    unsafe { callback(user_data.into(), success, fail); } // wakes lang...
}); // ...core_worker dropped only HERE, after the wake

So when lang reacts to a poll returning shutdown by calling finalize_shutdown, the poll task's clone may still be alive (strong_count == 2), try_unwrap returns Err, and finalize fails. Because the worker is already take()n out of the Option, it is spent and the call cannot be retried. This is timing-dependent: it surfaces intermittently under load (e.g. CI), not on a quiet machine.

Fix

Drop the worker clone as soon as the await result is in hand and before invoking the callback (notify-after-release ordering), in all seven spawning fns (validate, poll_workflow_activation, poll_activity_task, poll_nexus_task, complete_workflow_activation, complete_activity_task, complete_nexus_task). The woken lang thread is then guaranteed to observe sole ownership.

Pure reordering — no added synchronization, no behavior change on the success path. +28 −0, one file.

Validation

Found and fixed while building a PHP SDK on this C bridge. Reproduced as intermittent finalize_shutdown failures on CI; with this change the worker create/poll/shutdown lifecycle and the full workflow/activity/cancellation integration suite pass deterministically against a live dev server.

…allback

Each poll/complete/validate FFI fn clones `worker.worker` into a spawned
Tokio task and drops that clone only at task-scope end — after invoking the
completion callback that wakes the foreign (lang) thread.
temporal_core_worker_finalize_shutdown then takes sole ownership via
Arc::try_unwrap, which intermittently fails with "Cannot finalize, expected
1 reference, got N" when a just-woken lang thread calls finalize while a
poll/complete task's clone is still being dropped. Because finalize takes the
worker out of the Option before try_unwrap, the worker is consumed and the
call cannot be retried.

Drop the clone as soon as the await result is in hand and before the callback
fires (notify-after-release ordering), in all seven spawning fns. The woken
lang thread is then guaranteed to observe sole ownership, eliminating the
race at the source with no added synchronization. Pure reordering; no
behavior change on the success path.
@EdmondDantes EdmondDantes requested a review from a team as a code owner June 14, 2026 10:26
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Edmond seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

EdmondDantes added a commit to true-async/php-temporal that referenced this pull request Jun 14, 2026
A thin PHP extension that embeds the official Temporal Rust core
(temporalio/sdk-rust, via sdk-core-c-bridge) in-process and exposes it to the
TrueAsync reactor as an async transport. The core runs gRPC on its own Tokio
threads; completions wake the reactor through a cross-thread trigger, so each
call looks synchronous while the coroutine parks underneath. The Tokio-side
callback path is kept Zend/TSRM-free by design.

Surface (everything user-facing comes from the reused true-async/sdk-php fork):
- Core\Connection: connect + async rpcCall(service, method, bytes): bytes
- Core\Worker: poll/complete for activity tasks and workflow activations,
  activity heartbeat recording, and the shutdown lifecycle
- TemporalException / ConnectionException / ServiceException

The vendored core points at true-async/sdk-rust (v0.4.0 + a worker-shutdown
Arc-race fix submitted upstream as temporalio/sdk-rust#1330).

See CHANGELOG.md for the feature list and docs/DESIGN.md for the architecture.

@Sushisource Sushisource left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. Thanks for the fix!

My only suggestion would be I think we can drop the repetitive comment in each spot. Maybe we can have just one instance of it somewhere generic?

while building a PHP SDK on this C bridge

Would be curious to hear more about this! We do have https://github.com/temporalio/sdk-php/ which you may know was initially built before Core existed on top of Go

@EdmondDantes

Copy link
Copy Markdown
Author

This makes sense. Thanks for the fix!

My only suggestion would be I think we can drop the repetitive comment in each spot. Maybe we can have just one instance of it somewhere generic?

while building a PHP SDK on this C bridge

Would be curious to hear more about this! We do have https://github.com/temporalio/sdk-php/ which you may know was initially built before Core existed on top of Go

Yes, I'm familiar with it. I was asked to create a clone that would work without Temporal. In principle, I could adapt that project to run on regular PHP-core.I mean the core/runtime without TrueAsync.

@Sushisource

Copy link
Copy Markdown
Member

This makes sense. Thanks for the fix!
My only suggestion would be I think we can drop the repetitive comment in each spot. Maybe we can have just one instance of it somewhere generic?

while building a PHP SDK on this C bridge

Would be curious to hear more about this! We do have https://github.com/temporalio/sdk-php/ which you may know was initially built before Core existed on top of Go

Yes, I'm familiar with it. I was asked to create a clone that would work without Temporal. In principle, I could adapt that project to run on regular PHP-core.I mean the core/runtime without TrueAsync.

Interesting. Having PHP to run on top of our Core SDK would be nice to have, but would constitute a V2 of the SDK essentially. Have you got in touch with the folks who work on the existing PHP SDK? They might be interested in your work.

@wolfy-j and @xepozz FYI

Anyway, happy to merge this if we drop the repetitive comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants