Skip to content

statsig-go: serialize purego FFI dispatches to dodge upstream race#12

Closed
nsaini-figma wants to merge 1 commit into
mainfrom
nsaini/serialize-purego-ffi-dispatch
Closed

statsig-go: serialize purego FFI dispatches to dodge upstream race#12
nsaini-figma wants to merge 1 commit into
mainfrom
nsaini/serialize-purego-ffi-dispatch

Conversation

@nsaini-figma
Copy link
Copy Markdown
Collaborator

Summary

Adds a process-wide sync.Mutex on *StatsigFFI and takes it across every purego-registered call this binding dispatches. Workaround — not a root-cause fix — for a concurrent-FFI-return-value race in github.com/ebitengine/purego (v0.9.x) that has been causing intermittent labmate crashes since the cutover from the upstream binding at v0.15.1 to this fork at statsig-go/v0.19.4-figma1.

Why

The labmate FFI SIGSEGV (memmove on non-canonical pointer + glibc "double free or corruption (out)" + nil-deref at *gateJson) was narrowed end-to-end to a purego concurrency bug. Two goroutines making purego dispatches in flight at the same time can have their return pointers swapped between callers: caller A gets B's *c_char and vice versa. Once the wrong pointer lands in Go, runtime.memmove over unsafe.Slice(ptr, len) faults, and the deferred free_string(ptr) makes two callers race to free the same buffer.

Discrimination matrix (full writeup forthcoming, not committed here):

  • Plain-C pthread reproducer against the same libstatsig_ffi.so: 32 threads, ~16M ops, clean. So the Rust side is exonerated.
  • Upstream Rust SDK bisect 0.15.1 → 0.19.3 + this fork: every version crashes when paired with the current Go binding. So this is not a Rust-side regression.
  • Minimal purego-only repro (no statsig at all, single func(uint64) *byte): 2 goroutines crash within seconds with return pointers literally swapped. So the bug is purego itself.
  • A sync.Mutex around UseRustString makes the production reproducer run ~13M ops over 30s with zero crashes.

What changed

  • statsig-go/statsig_ffi.go: add mu sync.Mutex to StatsigFFI with full motivation in comment. Rewrite UseRustString to hold mu across handler() + free_string.
  • statsig-go/statsig.go: lock around every direct purego dispatch (NewStatsig, NewStatsigWithOptions, Initialize, Shutdown, FlushEvents, LogEvent, CheckGateWithOptions, the 4 ManuallyLog*Exposure methods, and release). The Get*Gate / Config / Experiment / Layer / ParameterStore / ClientInitResponse methods inherit protection via UseRustString.
  • statsig-go/statsig_types.go: lock around Layer.logExposure and the primitive-returning ParameterStore getters (GetBool, GetNumber, GetInt). GetString / GetMap / GetSlice are already UseRustString-protected.
  • statsig-go/statsig_user.go, statsig-go/statsig_options.go: lock around create + finalizer-release.
  • statsig-go/data_store.go, statsig-go/observability_client.go, statsig-go/persistent_storage.go: lock around the _create constructor calls and finalizer _release calls. The Rust→Go callbacks (Get/Set/Increment/etc.) deliberately do NOT take mu — see lock-ordering note on the mu field.

Invariants worth calling out

  • Lock scope is the C-side hop only. JSON marshal/unmarshal stays parallel. Per-call critical section is ~5–10 µs on the hot path.
  • Callbacks must not take mu. Comment on the mu field spells this out — Rust→Go callbacks fired synchronously from inside an FFI call this binding has dispatched would self-deadlock otherwise. None of the existing callback closures touch the mutex, but anyone adding new callback wiring needs to keep this invariant.
  • No .so change, no version bump. Pure Go-side patch. No statsig-go/v0.19.4-figma2 tag yet — wait for review and a labmate canary before cutting.
  • 10 FFI fields without public Go wrappers (statsig_override_*, statsig_remove_*_override, statsig_identify, async statsig_shutdown/statsig_flush_events, statsig_check_gate_performance) are untouched. When wrappers are added, they'll need the same pattern.
  • Upstream issue to be filed against ebitengine/purego with the minimal func(uint64) *byte repro. The mutex here is a tourniquet pending that fix.

Test plan

  • go build ./statsig-go/... passes.
  • Local reproducer (32 goroutines × 30s × 5 runs, ~83k ops/sec, ~11.5M total gate calls): zero crashes on patched binding. Same workload reliably crashes within seconds against unpatched v0.19.4-figma1.
  • go test ./statsig-go/... once CI runs.
  • Reviewer optionally re-runs the discrimination matrix locally — repro at /tmp/statsig-repro/ on devbox.
  • Labmate canary deploy against a v0.19.4-figma2 build of this branch before promoting fleet-wide.

🤖 Generated with Claude Code

Add a process-wide sync.Mutex on *StatsigFFI and take it across every
purego-registered call this binding dispatches. JSON marshal/unmarshal
on the Go side stays outside the critical section.

Workaround for a concurrent-FFI-return-value race in
github.com/ebitengine/purego v0.9.x. Two goroutines making purego calls
in flight simultaneously can have their return pointers swapped — the
*c_char one caller expected lands in the other caller's read, and vice
versa. Downstream this surfaces as the labmate crashes (SIGSEGV in
runtime.memmove on a non-canonical pointer, nil-deref at the *gateJson
read in GetFeatureGateWithOptions, or glibc "double free or corruption
(out)" when two callers free the same buffer).

Repros at every upstream Rust SDK version we tested (0.15.1, 0.16.0,
0.17.0, 0.18.0, 0.19.0-0.19.3, this fork's 0.19.4-figma1) when paired
with the current Go binding. Does NOT reproduce when the same .so is
called from plain C with pthreads under identical workload — purego is
the only path that exhibits it. Minimal purego-only repro: 10 lines of
C, 80 lines of Go, two concurrent goroutines calling a function with
signature `func(uint64) *byte` see their return pointers swapped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant