Skip to content

fix(poster): share one global worker pool to cap upload concurrency#228

Merged
javi11 merged 3 commits into
mainfrom
session/nifty-elion-e51e23
May 25, 2026
Merged

fix(poster): share one global worker pool to cap upload concurrency#228
javi11 merged 3 commits into
mainfrom
session/nifty-elion-e51e23

Conversation

@javi11
Copy link
Copy Markdown
Owner

@javi11 javi11 commented May 25, 2026

Summary

Fixes the silent crash reported in #184 where Postie dies during bulk uploads on memory-constrained hosts (e.g. Synology NAS) once MaxConcurrentUploads is raised above 1 with PAR2 enabled. No panic, no postie_crash file → external SIGKILL from the OOM killer.

Root cause

Every Post() call previously built its own conc/pool sized to sum(provider.MaxConnections) (see poster.go:338 pre-change). On top of that, postie.Post runs PAR2 creation+post and main-file post concurrently via errgroup, so the effective concurrent Post() count is 2 × MaxConcurrentUploads. Worst case at MaxConcurrentUploads=4 with 40 connections: ~320 goroutines and ~300 MB of read-ahead body buffers in flight — enough to OOM-kill a 2-4 GB NAS.

Fix

  • Single shared upload worker pool in poster. Total posting concurrency is now bounded by numOfConnections for the entire process, regardless of how many Post() calls are in flight. All concurrent Post() calls submit uploadJob entries to one channel drained by long-lived workers. NNTP connection lifecycle is unchanged — workers only hold a connection from uploadPool while actively posting an article. Worker startup is lazy via sync.Once so existing struct-literal tests keep working. Close() drains via a shutdown channel.
  • Buffer pool hygiene — drop oversized read-ahead buffers (> 4× default) on return so one outlier article cannot inflate the pool's working set forever.
  • Queue TOCTOUaddMu sync.Mutex serializes the check-then-Send sequence in AddFile, AddFileWithPriority, and AddFileWithOptions. goqite.Send uses its own DB handle so a SQL transaction can't span IsPathInQueue + Send; a Go mutex achieves the same atomicity. Left SetMaxOpenConns(1) alone per the explicit design note at queue.go:164.
  • Runtime watchdog — 10s debug-level log of runtime.NumGoroutine() + HeapAlloc/HeapInuse/Sys in processor.Start, so the next silent OOM-kill leaves a diagnosable trend line.

Connection lifecycle

The shared pool holds long-lived goroutines, not NNTP connections. Workers acquire a connection from uploadPool only while actively posting and release it back via the existing nntppool flow. When no Post() is in flight, connections drain on the pool's normal idle schedule.

Test plan

  • go test -race ./internal/poster/... ./internal/queue/... ./internal/processor/... ./internal/watcher/... ./pkg/postie/... — all green
  • New TestSharedUploadPool with two subtests:
    • concurrent posts share workers and stay bounded — 8 concurrent Post() calls with MaxConnections=4; asserts peak in-flight article posts ≤ 4.
    • Close drains workers and rejects new Post calls — verifies Close() drains within 2s and subsequent Post() returns ErrPosterClosed.
  • On-target validation: ask reporter (@Oleekae) to retest on the NAS with MaxConcurrentUploads=2. If it still dies, capture dmesg | grep -i killed to confirm OOM vs. in-process crash.

Fixes #184

javi11 added 3 commits May 25, 2026 20:22
Each Post() call previously built its own conc/pool sized to the sum of
provider MaxConnections, and PAR2 ran in parallel with the main upload via
errgroup in postie.Post — so in-flight posting concurrency was
2 × MaxConcurrentUploads × numOfConnections, on top of per-call read-ahead
buffers. On memory-constrained hosts (e.g. Synology NAS in #184) this
exhausted RAM and the kernel SIGKILL'd the process, leaving no panic or
postie_crash file behind.

Bound total posting concurrency to numOfConnections by spawning a single
shared worker pool in the poster. All concurrent Post() calls submit
uploadJob entries to one channel drained by long-lived workers. NNTP
connection lifecycle is unchanged — workers only hold a connection while
actively posting. Workers init lazily via sync.Once so existing tests that
build the struct directly still get a working pool. Close() drains via a
shutdown channel.

Also:
 - Drop oversized read-ahead body buffers (> 4× default) on return so one
   outlier article cannot inflate the pool's working set forever.
 - Serialize check-then-Send in queue.AddFile{,WithPriority,WithOptions}
   with a Go-level mutex (goqite.Send uses its own DB handle so a SQL
   transaction cannot span IsPathInQueue + Send). Closes the TOCTOU window
   the watcher and processor could race through under heavy load.
 - Add a 10s debug-level runtime watchdog in processor.Start logging
   NumGoroutine + heap stats, so the next silent OOM-kill leaves a trend
   line in the logs.

Fixes #184
Pulls in javi11/par2go#6 (commit bc8dbc7) which drains the C SIMD compute
workers via proc.End() before proc.Close() on every chunk-loop exit
path. Without it, cancelling a PAR2 job in postie aborts the host
process with `Assertion failed: (isMultipleOfStride(len)),
mul_add_multi_packpf, gf16mul.h:377`.

Pseudo-version: revert to a tagged par2go release once one is cut.
Replaces the bc8dbc7 pseudo-version with the tagged v0.0.9 release that
contains the proc.End() drain fix from javi11/par2go#6. Resolves the
SIMD stride assertion (mul_add_multi_packpf, gf16mul.h:377) that
abort()'d the host process when a PAR2 job was cancelled mid-compute.
@javi11 javi11 merged commit a318266 into main May 25, 2026
3 checks passed
@javi11 javi11 deleted the session/nifty-elion-e51e23 branch May 25, 2026 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Propagation check blocks next article processing and reduces upload throughput

1 participant