c-bridge: drop worker Arc clone before waking lang to fix finalize_shutdown race#1330
c-bridge: drop worker Arc clone before waking lang to fix finalize_shutdown race#1330EdmondDantes wants to merge 1 commit into
Conversation
…allback Each poll/complete/validate FFI fn clones `worker.worker` into a spawned Tokio task and drops that clone only at task-scope end — after invoking the completion callback that wakes the foreign (lang) thread. temporal_core_worker_finalize_shutdown then takes sole ownership via Arc::try_unwrap, which intermittently fails with "Cannot finalize, expected 1 reference, got N" when a just-woken lang thread calls finalize while a poll/complete task's clone is still being dropped. Because finalize takes the worker out of the Option before try_unwrap, the worker is consumed and the call cannot be retried. Drop the clone as soon as the await result is in hand and before the callback fires (notify-after-release ordering), in all seven spawning fns. The woken lang thread is then guaranteed to observe sole ownership, eliminating the race at the source with no added synchronization. Pure reordering; no behavior change on the success path.
|
Edmond seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
A thin PHP extension that embeds the official Temporal Rust core (temporalio/sdk-rust, via sdk-core-c-bridge) in-process and exposes it to the TrueAsync reactor as an async transport. The core runs gRPC on its own Tokio threads; completions wake the reactor through a cross-thread trigger, so each call looks synchronous while the coroutine parks underneath. The Tokio-side callback path is kept Zend/TSRM-free by design. Surface (everything user-facing comes from the reused true-async/sdk-php fork): - Core\Connection: connect + async rpcCall(service, method, bytes): bytes - Core\Worker: poll/complete for activity tasks and workflow activations, activity heartbeat recording, and the shutdown lifecycle - TemporalException / ConnectionException / ServiceException The vendored core points at true-async/sdk-rust (v0.4.0 + a worker-shutdown Arc-race fix submitted upstream as temporalio/sdk-rust#1330). See CHANGELOG.md for the feature list and docs/DESIGN.md for the architecture.
There was a problem hiding this comment.
This makes sense. Thanks for the fix!
My only suggestion would be I think we can drop the repetitive comment in each spot. Maybe we can have just one instance of it somewhere generic?
while building a PHP SDK on this C bridge
Would be curious to hear more about this! We do have https://github.com/temporalio/sdk-php/ which you may know was initially built before Core existed on top of Go
Yes, I'm familiar with it. I was asked to create a clone that would work without Temporal. In principle, I could adapt that project to run on regular PHP-core.I mean the core/runtime without TrueAsync. |
Interesting. Having PHP to run on top of our Core SDK would be nice to have, but would constitute a V2 of the SDK essentially. Have you got in touch with the folks who work on the existing PHP SDK? They might be interested in your work. Anyway, happy to merge this if we drop the repetitive comments. |
Problem
temporal_core_worker_finalize_shutdowntakes sole ownership of the worker viaArc::try_unwrap(worker.worker.take().unwrap()):But every poll/complete/validate FFI fn clones that
Arcinto a spawned Tokio task and drops the clone only at task-scope end — after invoking the completion callback that wakes the foreign (lang) thread:So when lang reacts to a poll returning shutdown by calling
finalize_shutdown, the poll task's clone may still be alive (strong_count == 2),try_unwrapreturnsErr, and finalize fails. Because the worker is alreadytake()n out of theOption, it is spent and the call cannot be retried. This is timing-dependent: it surfaces intermittently under load (e.g. CI), not on a quiet machine.Fix
Drop the worker clone as soon as the await result is in hand and before invoking the callback (notify-after-release ordering), in all seven spawning fns (
validate,poll_workflow_activation,poll_activity_task,poll_nexus_task,complete_workflow_activation,complete_activity_task,complete_nexus_task). The woken lang thread is then guaranteed to observe sole ownership.Pure reordering — no added synchronization, no behavior change on the success path.
+28 −0, one file.Validation
Found and fixed while building a PHP SDK on this C bridge. Reproduced as intermittent
finalize_shutdownfailures on CI; with this change the worker create/poll/shutdown lifecycle and the full workflow/activity/cancellation integration suite pass deterministically against a live dev server.