fix(poster): harden upload pipeline against deadlocks, panics, and throttle races#230
Merged
Conversation
…rottle races Investigation of internal/poster surfaced four production-reachable crash and resource-leak paths. Each is fixed with a minimal local change; no behavioural changes on the happy path. - checkQueue send is now ctx-guarded. If checkLoop returns early (e.g. on a verify error) the 100-slot buffer no longer wedges postLoop forever; the post is cleaned up and the loop exits. - addPost now errors cleanly when GroupPolicy=each_file but cfg.Groups is empty, instead of panicking inside rand.Intn(0). - Throttle bucket math is mutex-protected. The previous lock-free CAS loop let concurrent consumers overdraw past zero, silently bypassing the configured rate under contention (more observable since the shared worker pool change in #228). - Read-ahead ReadAt is wrapped in a 60s stall guard so an unresponsive mount surfaces a clean error instead of hanging the upload pipeline. The OS read goroutine still leaks until the kernel returns (Go limitation on regular files), but the upper pipeline stays alive. Verified with `go test -race -count=3 ./internal/poster/...` and `go vet`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Defensive review of
internal/postersurfaced four production-reachable crash and resource-leak paths. Each is fixed with a minimal local change; no behavioural changes on the happy path.checkQueuesendrand.Intn(0)on empty groupsGroupPolicy=each_fileandcfg.Groupsis emptyReadAthang on stalled mountDetails
checkQueue send guard —
checkQueuehas a 100-slot buffer. IfcheckLoopreturns from any of its error paths (verify failure, ctx cancel, pause error, max-retries-exhausted), nothing drains the buffer. After 100 more posts,postLoopblocked forever on the bare send. Now wrapped in aselect { case checkQueue <- post: case <-ctx.Done(): /* cleanup */ }. On the cancel branch, the post is closed and itspostsInFlight/post.wgcounters are released so the closer goroutine can complete.Empty-groups guard — Config validation may prevent empty
Groups, but the poster did not enforce it locally. A misconfigured runtime now gets a typed error instead of a panic.Throttle correctness — The previous CAS-based logic had two races: after a successful CAS, two goroutines could both pass
newTokens >= bytesand both subtract (overdraw), and the post-sleep path always subtracted without re-checking. Replaced with a mutex-protected critical section. Atomic field types retained so existing tests compile unchanged.Read-ahead stall guard —
*os.File.ReadAtcannot be interrupted by context on regular files. NewreadAtWithStallGuardhelper wraps it with a 60s wall-clock timeout. The OS-level goroutine still leaks until the kernel releases the syscall (fundamental Go limitation), but the upload pipeline gets a clean error and operators see a clear log line.Out of scope
Three items from the initial investigation report turned out, on closer review, not to be real bugs:
postsInFlight.Add(1)ordering — the retry path Adds before Doneing the original, so the counter cannot transit zero between them.errChanbuffer — sized 4 with max 3 outstanding writers, never fills.postArticleunboundedmake— dead code; onlypostArticleWithBodyis on the hot path.These sites were inspected but not modified.
Test plan
go test -race -count=3 -timeout=180s ./internal/poster/...— passesgo vet ./internal/poster/...— cleango build ./...— cleanMaxConcurrentUploads > 1, confirm observed throughput matches configured rate (regression test for H3)