Skip to content

statsig-go: pin purego to v0.8.0 to dodge concurrent-FFI race#14

Merged
nsaini-figma merged 1 commit into
mainfrom
nsaini/downgrade-purego-to-v0.8.0
May 20, 2026
Merged

statsig-go: pin purego to v0.8.0 to dodge concurrent-FFI race#14
nsaini-figma merged 1 commit into
mainfrom
nsaini/downgrade-purego-to-v0.8.0

Conversation

@nsaini-figma
Copy link
Copy Markdown
Collaborator

Summary

Downgrades the purego dependency in statsig-go/go.mod from v0.9.0 to v0.8.0 — the last release before upstream PR #282 merged (2024-10-17) and introduced a process-wide sync.Pool of *syscall15Args in func.go's RegisterFunc reflect closure. That pool, under concurrent dispatch from multiple goroutines, lets two callers observe each other's return values.

One-line change in statsig-go/go.mod (plus the corresponding go.sum update). No code changes in the binding itself.

Supersedes both previously drafted approaches:

Why

Symptom in consumers: SIGSEGV in runtime.memmove on non-canonical pointers, glibc double free or corruption (out), nil-deref at the deref of returned *byte values, and — most insidiously — silently-swapped feature-flag evaluation results that pass type checks downstream. All traced to a concurrent-FFI return-value race in purego v0.9.x.

The minimal trigger is a function with signature func(uint64) *byte called from two or more goroutines simultaneously. Each goroutine can get back the other goroutine's return pointer. The full discrimination matrix is in the upstream issue draft; the relevant data points for this change:

  • The minimal purego-only repro mismatches within seconds at HEAD with workers ≥ 2 against v0.9.0 / v0.9.1.
  • The same repro against v0.8.0 (no thePool references in func.go or syscall_sysv.go) ran for ~153M total dispatches across workers ∈ {2, 4, 8, 32} with zero mismatches.
  • The full statsig-go gate-evaluation workload against v0.8.0 ran for 5 × 30s × 32 workers (~36M gate calls) with zero crashes, zero corruption messages, ~260k ops/sec sustained — equivalent to the patched-v0.9.0 approach in purego: vendor v0.9.0 with concurrent-FFI race revert #13.

What's actually in v0.8.1 → v0.9.1 that we'd be giving up

Looking at the release notes between v0.8.0 (Sept 2024) and v0.9.1 (Nov 2025):

  • PR #282 itself — the racy memory-usage optimization. We don't want this.
  • PRs #328, #361, #408, #413, #431, #391, #403, #436 — struct argument/return support extensions, new architectures (s390x, ppc64le, linux/386, linux/arm32). statsig-go's linux/amd64 consumers use none of this.
  • PR #357 — darwin int/string fix. Not relevant for linux deploys.
  • PRs #319, #318, #343-race and fakecgo fixes. Test infra, not user-facing.
  • Various small bug fixes none of which match the statsig usage profile.

Net: for statsig-go's public API surface (purego.Dlopen, purego.RegisterLibFunc, purego.RTLD_NOW, purego.RTLD_GLOBAL), v0.8.0 is functionally equivalent to v0.9.x. The gap exists on paper but is invisible to consumers on linux/amd64.

Invariants worth calling out

  • Binding API unchanged. No public interface or behavior change in statsig-go. The four purego APIs used by statsig_ffi.go (Dlopen, RegisterLibFunc, RTLD_NOW, RTLD_GLOBAL) all exist with identical signatures in v0.8.0.
  • MVS implications for consumers. Because Go module resolution uses maximum-version selection, downstream consumers that have github.com/ebitengine/purego listed as an indirect dependency in their own go.mod (with version v0.9.0) will need to bump that line to v0.8.0 as well — otherwise MVS picks v0.9.0 and the bug returns. Either drop the indirect and re-tidy after pulling this version of statsig-go, or set the indirect to v0.8.0 explicitly.
  • Upstream tracking. When upstream lands a real fix for the underlying race (open as a draft against ebitengine/purego), bump this dependency forward and drop any compensating consumer-side go.mod lines.

Test plan

  • go build ./statsig-go/... clean against v0.8.0.
  • Concurrent-FFI repro at 32 goroutines × 30s × 5 runs against the downgraded binding — ~260k ops/sec sustained, zero crashes, zero corruption messages. Same workload against v0.9.0 reliably crashes within ~5 seconds.
  • Minimal purego-only repro at workers ∈ {2, 4, 8, 32} × 20s — 153M total dispatches, zero return-value mismatches. Same workload against v0.9.0 / v0.9.1 mismatches within seconds.
  • go test ./statsig-go/... once CI runs.
  • Reviewer optionally re-verifies the discrimination matrix locally.
  • Once tagged statsig-go/v0.19.4-figma2, validate consumer integration end-to-end by pointing a consumer's go.mod at the new tag and updating its purego indirect to v0.8.0.

🤖 Generated with Claude Code

Downgrades the purego dependency from v0.9.0 to v0.8.0 (the last
release before upstream PR #282 merged on 2024-10-17). PR #282
introduced a process-wide sync.Pool of *syscall15Args in func.go's
RegisterFunc reflect closure. Under concurrent dispatch from multiple
goroutines, two callers can observe each other's return values —
surfacing as SIGSEGV in runtime.memmove on non-canonical pointers,
glibc "double free or corruption (out)", nil-deref at the deref of
returned *byte values, and silently-swapped feature-flag evaluation
results.

The minimal trigger is a function with signature
  func(uint64) *byte
called from two or more goroutines simultaneously. Each goroutine can
get back the other goroutine's return pointer. The full discrimination
matrix is in the upstream issue draft; the relevant data points for
this change:

- The minimal purego-only repro mismatches within seconds at HEAD with
  workers >= 2 against v0.9.0 / v0.9.1.
- The same repro against v0.8.0 (no `thePool` references in func.go or
  syscall_sysv.go) ran for ~153M total dispatches across workers in
  {2, 4, 8, 32} with zero mismatches.
- The full statsig-go gate-evaluation workload against v0.8.0 ran for
  5 x 30s x 32 workers (~36M gate calls) with zero crashes,
  zero corruption messages, ~260k ops/sec sustained — equivalent to
  the patched-v0.9.0 approach previously drafted in PR #13.

What we give up between v0.8.0 and v0.9.x:
  - PR #282 itself (the racy memory-usage optimization).
  - PR #328, #361, #408, #413, #431, #391, #403, #436 — struct
    argument/return support, new architectures (s390x, ppc64le,
    linux/386, linux/arm32). statsig-go's linux/amd64 consumers use
    none of this.
  - PR #357 — darwin int/string fix. Not relevant for linux deploys.
  - PR #319, #318, #343 — `-race` and `fakecgo` fixes. Test infra,
    not user-facing.
  - Various small bug fixes none of which match the statsig usage
    profile.

Net: v0.8.0 is functionally equivalent to v0.9.x for this binding's
public API surface. The gap exists on paper but is invisible to
consumers.

This change supersedes the previously drafted approaches:
  - #12 (binding-side sync.Mutex workaround) — caps throughput at
    ~83k ops/sec per process due to serialized FFI.
  - #13 (vendor purego with the pool revert) — carries ~19k lines
    of upstream code in this repo for an 8-line delta.

When upstream lands a real fix for the underlying race, bump this
dependency forward.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nsaini-figma nsaini-figma marked this pull request as ready for review May 19, 2026 19:23
@nsaini-figma nsaini-figma merged commit d654890 into main May 20, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants