Skip to content

fix(poster): harden upload pipeline against deadlocks, panics, and throttle races#230

Merged
javi11 merged 1 commit into
mainfrom
session/silly-brown-09d895
May 26, 2026
Merged

fix(poster): harden upload pipeline against deadlocks, panics, and throttle races#230
javi11 merged 1 commit into
mainfrom
session/silly-brown-09d895

Conversation

@javi11
Copy link
Copy Markdown
Owner

@javi11 javi11 commented May 26, 2026

Summary

Defensive review of internal/poster surfaced four production-reachable crash and resource-leak paths. Each is fixed with a minimal local change; no behavioural changes on the happy path.

Issue Site Symptom
Unguarded checkQueue send poster.go:642 postLoop deadlock + goroutine leak if checkLoop exits early
rand.Intn(0) on empty groups poster.go:993 panic when GroupPolicy=each_file and cfg.Groups is empty
Throttle bucket race throttle.go concurrent consumers overdraw past zero, silently bypassing rate
ReadAt hang on stalled mount poster.go:528 read-ahead goroutine + upload pipeline hang forever on dead NFS/FUSE/USB

Details

checkQueue send guardcheckQueue has a 100-slot buffer. If checkLoop returns from any of its error paths (verify failure, ctx cancel, pause error, max-retries-exhausted), nothing drains the buffer. After 100 more posts, postLoop blocked forever on the bare send. Now wrapped in a select { case checkQueue <- post: case <-ctx.Done(): /* cleanup */ }. On the cancel branch, the post is closed and its postsInFlight / post.wg counters are released so the closer goroutine can complete.

Empty-groups guard — Config validation may prevent empty Groups, but the poster did not enforce it locally. A misconfigured runtime now gets a typed error instead of a panic.

Throttle correctness — The previous CAS-based logic had two races: after a successful CAS, two goroutines could both pass newTokens >= bytes and both subtract (overdraw), and the post-sleep path always subtracted without re-checking. Replaced with a mutex-protected critical section. Atomic field types retained so existing tests compile unchanged.

Read-ahead stall guard*os.File.ReadAt cannot be interrupted by context on regular files. New readAtWithStallGuard helper wraps it with a 60s wall-clock timeout. The OS-level goroutine still leaks until the kernel releases the syscall (fundamental Go limitation), but the upload pipeline gets a clean error and operators see a clear log line.

Out of scope

Three items from the initial investigation report turned out, on closer review, not to be real bugs:

  • postsInFlight.Add(1) ordering — the retry path Adds before Doneing the original, so the counter cannot transit zero between them.
  • errChan buffer — sized 4 with max 3 outstanding writers, never fills.
  • postArticle unbounded make — dead code; only postArticleWithBody is on the hot path.

These sites were inspected but not modified.

Test plan

  • go test -race -count=3 -timeout=180s ./internal/poster/... — passes
  • go vet ./internal/poster/... — clean
  • go build ./... — clean
  • Manual: post a folder under a throttled rate with MaxConcurrentUploads > 1, confirm observed throughput matches configured rate (regression test for H3)
  • Manual: stop the verify server mid-upload, confirm postLoop exits instead of hanging (regression test for C3)

…rottle races

Investigation of internal/poster surfaced four production-reachable
crash and resource-leak paths. Each is fixed with a minimal local
change; no behavioural changes on the happy path.

- checkQueue send is now ctx-guarded. If checkLoop returns early
  (e.g. on a verify error) the 100-slot buffer no longer wedges
  postLoop forever; the post is cleaned up and the loop exits.
- addPost now errors cleanly when GroupPolicy=each_file but
  cfg.Groups is empty, instead of panicking inside rand.Intn(0).
- Throttle bucket math is mutex-protected. The previous lock-free
  CAS loop let concurrent consumers overdraw past zero, silently
  bypassing the configured rate under contention (more observable
  since the shared worker pool change in #228).
- Read-ahead ReadAt is wrapped in a 60s stall guard so an
  unresponsive mount surfaces a clean error instead of hanging
  the upload pipeline. The OS read goroutine still leaks until
  the kernel returns (Go limitation on regular files), but the
  upper pipeline stays alive.

Verified with `go test -race -count=3 ./internal/poster/...` and
`go vet`.
@javi11 javi11 merged commit 1bb4acc into main May 26, 2026
3 of 5 checks passed
@javi11 javi11 deleted the session/silly-brown-09d895 branch May 26, 2026 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant