fix(poster): share one global worker pool to cap upload concurrency#228
Merged
Conversation
Each Post() call previously built its own conc/pool sized to the sum of provider MaxConnections, and PAR2 ran in parallel with the main upload via errgroup in postie.Post — so in-flight posting concurrency was 2 × MaxConcurrentUploads × numOfConnections, on top of per-call read-ahead buffers. On memory-constrained hosts (e.g. Synology NAS in #184) this exhausted RAM and the kernel SIGKILL'd the process, leaving no panic or postie_crash file behind. Bound total posting concurrency to numOfConnections by spawning a single shared worker pool in the poster. All concurrent Post() calls submit uploadJob entries to one channel drained by long-lived workers. NNTP connection lifecycle is unchanged — workers only hold a connection while actively posting. Workers init lazily via sync.Once so existing tests that build the struct directly still get a working pool. Close() drains via a shutdown channel. Also: - Drop oversized read-ahead body buffers (> 4× default) on return so one outlier article cannot inflate the pool's working set forever. - Serialize check-then-Send in queue.AddFile{,WithPriority,WithOptions} with a Go-level mutex (goqite.Send uses its own DB handle so a SQL transaction cannot span IsPathInQueue + Send). Closes the TOCTOU window the watcher and processor could race through under heavy load. - Add a 10s debug-level runtime watchdog in processor.Start logging NumGoroutine + heap stats, so the next silent OOM-kill leaves a trend line in the logs. Fixes #184
Pulls in javi11/par2go#6 (commit bc8dbc7) which drains the C SIMD compute workers via proc.End() before proc.Close() on every chunk-loop exit path. Without it, cancelling a PAR2 job in postie aborts the host process with `Assertion failed: (isMultipleOfStride(len)), mul_add_multi_packpf, gf16mul.h:377`. Pseudo-version: revert to a tagged par2go release once one is cut.
Replaces the bc8dbc7 pseudo-version with the tagged v0.0.9 release that contains the proc.End() drain fix from javi11/par2go#6. Resolves the SIMD stride assertion (mul_add_multi_packpf, gf16mul.h:377) that abort()'d the host process when a PAR2 job was cancelled mid-compute.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the silent crash reported in #184 where Postie dies during bulk uploads on memory-constrained hosts (e.g. Synology NAS) once
MaxConcurrentUploadsis raised above 1 with PAR2 enabled. No panic, nopostie_crashfile → externalSIGKILLfrom the OOM killer.Root cause
Every
Post()call previously built its ownconc/poolsized tosum(provider.MaxConnections)(see poster.go:338 pre-change). On top of that,postie.Postruns PAR2 creation+post and main-file post concurrently viaerrgroup, so the effective concurrentPost()count is2 × MaxConcurrentUploads. Worst case atMaxConcurrentUploads=4with 40 connections: ~320 goroutines and ~300 MB of read-ahead body buffers in flight — enough to OOM-kill a 2-4 GB NAS.Fix
poster. Total posting concurrency is now bounded bynumOfConnectionsfor the entire process, regardless of how manyPost()calls are in flight. All concurrentPost()calls submituploadJobentries to one channel drained by long-lived workers. NNTP connection lifecycle is unchanged — workers only hold a connection fromuploadPoolwhile actively posting an article. Worker startup is lazy viasync.Onceso existing struct-literal tests keep working.Close()drains via ashutdownchannel.addMu sync.Mutexserializes the check-then-Sendsequence inAddFile,AddFileWithPriority, andAddFileWithOptions.goqite.Senduses its own DB handle so a SQL transaction can't spanIsPathInQueue+Send; a Go mutex achieves the same atomicity. LeftSetMaxOpenConns(1)alone per the explicit design note at queue.go:164.runtime.NumGoroutine()+HeapAlloc/HeapInuse/Sysinprocessor.Start, so the next silent OOM-kill leaves a diagnosable trend line.Connection lifecycle
The shared pool holds long-lived goroutines, not NNTP connections. Workers acquire a connection from
uploadPoolonly while actively posting and release it back via the existingnntppoolflow. When noPost()is in flight, connections drain on the pool's normal idle schedule.Test plan
go test -race ./internal/poster/... ./internal/queue/... ./internal/processor/... ./internal/watcher/... ./pkg/postie/...— all greenTestSharedUploadPoolwith two subtests:Post()calls withMaxConnections=4; asserts peak in-flight article posts ≤ 4.Close()drains within 2s and subsequentPost()returnsErrPosterClosed.MaxConcurrentUploads=2. If it still dies, capturedmesg | grep -i killedto confirm OOM vs. in-process crash.Fixes #184