fix(poster): share one global worker pool to cap upload concurrency by javi11 · Pull Request #228 · javi11/postie

javi11 · 2026-05-25T18:23:03Z

Summary

Fixes the silent crash reported in #184 where Postie dies during bulk uploads on memory-constrained hosts (e.g. Synology NAS) once MaxConcurrentUploads is raised above 1 with PAR2 enabled. No panic, no postie_crash file → external SIGKILL from the OOM killer.

Root cause

Every Post() call previously built its own conc/pool sized to sum(provider.MaxConnections) (see poster.go:338 pre-change). On top of that, postie.Post runs PAR2 creation+post and main-file post concurrently via errgroup, so the effective concurrent Post() count is 2 × MaxConcurrentUploads. Worst case at MaxConcurrentUploads=4 with 40 connections: ~320 goroutines and ~300 MB of read-ahead body buffers in flight — enough to OOM-kill a 2-4 GB NAS.

Fix

Single shared upload worker pool in poster. Total posting concurrency is now bounded by numOfConnections for the entire process, regardless of how many Post() calls are in flight. All concurrent Post() calls submit uploadJob entries to one channel drained by long-lived workers. NNTP connection lifecycle is unchanged — workers only hold a connection from uploadPool while actively posting an article. Worker startup is lazy via sync.Once so existing struct-literal tests keep working. Close() drains via a shutdown channel.
Buffer pool hygiene — drop oversized read-ahead buffers (> 4× default) on return so one outlier article cannot inflate the pool's working set forever.
Queue TOCTOU — addMu sync.Mutex serializes the check-then-Send sequence in AddFile, AddFileWithPriority, and AddFileWithOptions. goqite.Send uses its own DB handle so a SQL transaction can't span IsPathInQueue + Send; a Go mutex achieves the same atomicity. Left SetMaxOpenConns(1) alone per the explicit design note at queue.go:164.
Runtime watchdog — 10s debug-level log of runtime.NumGoroutine() + HeapAlloc/HeapInuse/Sys in processor.Start, so the next silent OOM-kill leaves a diagnosable trend line.

Connection lifecycle

The shared pool holds long-lived goroutines, not NNTP connections. Workers acquire a connection from uploadPool only while actively posting and release it back via the existing nntppool flow. When no Post() is in flight, connections drain on the pool's normal idle schedule.

Test plan

go test -race ./internal/poster/... ./internal/queue/... ./internal/processor/... ./internal/watcher/... ./pkg/postie/... — all green
New TestSharedUploadPool with two subtests:
- concurrent posts share workers and stay bounded — 8 concurrent Post() calls with MaxConnections=4; asserts peak in-flight article posts ≤ 4.
- Close drains workers and rejects new Post calls — verifies Close() drains within 2s and subsequent Post() returns ErrPosterClosed.
On-target validation: ask reporter (@Oleekae) to retest on the NAS with MaxConcurrentUploads=2. If it still dies, capture dmesg | grep -i killed to confirm OOM vs. in-process crash.

Fixes #184

Each Post() call previously built its own conc/pool sized to the sum of provider MaxConnections, and PAR2 ran in parallel with the main upload via errgroup in postie.Post — so in-flight posting concurrency was 2 × MaxConcurrentUploads × numOfConnections, on top of per-call read-ahead buffers. On memory-constrained hosts (e.g. Synology NAS in #184) this exhausted RAM and the kernel SIGKILL'd the process, leaving no panic or postie_crash file behind. Bound total posting concurrency to numOfConnections by spawning a single shared worker pool in the poster. All concurrent Post() calls submit uploadJob entries to one channel drained by long-lived workers. NNTP connection lifecycle is unchanged — workers only hold a connection while actively posting. Workers init lazily via sync.Once so existing tests that build the struct directly still get a working pool. Close() drains via a shutdown channel. Also: - Drop oversized read-ahead body buffers (> 4× default) on return so one outlier article cannot inflate the pool's working set forever. - Serialize check-then-Send in queue.AddFile{,WithPriority,WithOptions} with a Go-level mutex (goqite.Send uses its own DB handle so a SQL transaction cannot span IsPathInQueue + Send). Closes the TOCTOU window the watcher and processor could race through under heavy load. - Add a 10s debug-level runtime watchdog in processor.Start logging NumGoroutine + heap stats, so the next silent OOM-kill leaves a trend line in the logs. Fixes #184

Pulls in javi11/par2go#6 (commit bc8dbc7) which drains the C SIMD compute workers via proc.End() before proc.Close() on every chunk-loop exit path. Without it, cancelling a PAR2 job in postie aborts the host process with `Assertion failed: (isMultipleOfStride(len)), mul_add_multi_packpf, gf16mul.h:377`. Pseudo-version: revert to a tagged par2go release once one is cut.

Replaces the bc8dbc7 pseudo-version with the tagged v0.0.9 release that contains the proc.End() drain fix from javi11/par2go#6. Resolves the SIMD stride assertion (mul_add_multi_packpf, gf16mul.h:377) that abort()'d the host process when a PAR2 job was cancelled mid-compute.

javi11 added 3 commits May 25, 2026 20:22

javi11 merged commit a318266 into main May 25, 2026
3 checks passed

javi11 deleted the session/nifty-elion-e51e23 branch May 25, 2026 19:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(poster): share one global worker pool to cap upload concurrency#228

fix(poster): share one global worker pool to cap upload concurrency#228
javi11 merged 3 commits into
mainfrom
session/nifty-elion-e51e23

javi11 commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

javi11 commented May 25, 2026

Summary

Root cause

Fix

Connection lifecycle

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant