In crates/forkd-controller/src/http.rs, try_acquire_branch_slot (lines 117-139) keys the in-flight set on the target snapshot tag of the branch operation, not the source sandbox id:
pub fn try_acquire_branch_slot(self: &Arc<Self>, tag: &str) -> Result<BranchSlot, ...> {
...
if !in_flight.insert(tag.to_string()) { return Err(AlreadyInFlight); }
...
}
The comment on branch_in_flight (lines 53-58) explains why: "two concurrent POST /branch calls targeting the same tag from racing to clobber memory.bin". That's clearly correct — two clients picking the same explicit tag shouldn't be allowed through.
But the default tag is auto-generated as branch-<sandbox-id>-<unix-now-secs> (line 531). So the realistic collision case is two clients branching the same sandbox at the same second — which gets rejected with 409 "branch for tag 'branch-…' is already in progress". From the client's POV that's surprising: they didn't pick the same tag, the daemon did, and the daemon then refused them.
Two design questions:
-
Would per-sandbox-id exclusion be more useful? A given source sandbox can only have one in-flight pause/snapshot/resume cycle anyway (pause is exclusive — line 599 takes the Vm out of live_vms), so keying on sandbox id rather than tag gives the same correctness with a clearer error ("sandbox X is already being branched"), and the auto-tag collision goes away.
-
Should the auto-tag include sub-second precision? Even with tag-scoped locking, generating tags that include unix_now() in seconds means two branches of the same sandbox in the same second collide. A nanosecond suffix or a short random would prevent that without changing the semantics.
I have my own preference (sandbox-id exclusion + sub-second auto-tag) but don't want to send a PR without knowing the maintainers' rationale for the current shape. Happy to write either version.
In
crates/forkd-controller/src/http.rs,try_acquire_branch_slot(lines 117-139) keys the in-flight set on the target snapshot tag of the branch operation, not the source sandbox id:The comment on
branch_in_flight(lines 53-58) explains why: "two concurrentPOST /branchcalls targeting the same tag from racing to clobber memory.bin". That's clearly correct — two clients picking the same explicittagshouldn't be allowed through.But the default tag is auto-generated as
branch-<sandbox-id>-<unix-now-secs>(line 531). So the realistic collision case is two clients branching the same sandbox at the same second — which gets rejected with 409 "branch for tag 'branch-…' is already in progress". From the client's POV that's surprising: they didn't pick the same tag, the daemon did, and the daemon then refused them.Two design questions:
Would per-sandbox-id exclusion be more useful? A given source sandbox can only have one in-flight pause/snapshot/resume cycle anyway (pause is exclusive — line 599 takes the
Vmout oflive_vms), so keying on sandbox id rather than tag gives the same correctness with a clearer error ("sandbox X is already being branched"), and the auto-tag collision goes away.Should the auto-tag include sub-second precision? Even with tag-scoped locking, generating tags that include
unix_now()in seconds means two branches of the same sandbox in the same second collide. A nanosecond suffix or a short random would prevent that without changing the semantics.I have my own preference (sandbox-id exclusion + sub-second auto-tag) but don't want to send a PR without knowing the maintainers' rationale for the current shape. Happy to write either version.