fix: connection-lifecycle#601
Merged
Merged
Conversation
Owner
|
After merge the first PR this becomed corrupted, can you fix it @iPromKnight ? |
…ness Reworks the streaming and metadata-virtual-file paths to bound retry, goroutine, and connection-slot fan-out under realistic Plex/Jellyfin scrub patterns and provider-recovery storms. Adds a deterministic fakepool test seam so each invariant is pinned by a regression test that asserts MaxInFlight / goroutine / call-count bounds directly. Cross-process budgeting between streams and imports (admission control) is intentionally NOT in this PR — that work belongs in the separate admission-controller PR. This change touches only the connection-lifecycle layer. S1 + S3 (usenet/usenet_reader.go) retry.Attempts(5) -> retry.Attempts(2) so a permanently failing segment produces at most 2 wire calls instead of 5. Replace the 20ms FixedDelay with retry.CombineDelay(BackOffDelay, RandomDelay) + retry.MaxJitter(100ms) so concurrent retriers desync into [50, 150]ms instead of thundering-herd against a recovering provider. ErrArticleNotFound still never retries; DeadlineExceeded still retries immediately on a fresh round-robin connection. S5 + EOF straddle (nzbfilesystem/metadata_remote_file.go) Add randomReadCache: a per-file LRU (size 8) of full-segment bytes for the ephemeral ReadAt path. 200 small reads within an 8-segment hot window now produce 8 wire calls instead of 200. Clamp the slice bound to min(end-off+1, len(p)) so reads that straddle EOF in a partially-filled final segment cannot panic with "slice bounds out of range". S6 (nzbfilesystem/metadata_remote_file.go) Replace closeWg.Go's unbounded goroutine fan-out in closeCurrentReader with a per-file bounded closer pool (closerWorkerCount = 4). A 50-seek scrub burst produces a goroutine delta of ~4 instead of ~50, each holding any in-flight pool slot it had open. When the channel fills, the next Seek runs Close inline as backpressure. S8 (api/speedtest_coordinator.go + provider_speedtest_handler.go) handleTestProviderSpeed previously called nntppool.NewClient inline per HTTP request, opening Provider.MaxConnections fresh slots per invocation. Route through pool.Manager.GetPool when the provider is active; otherwise through a process-wide speedtestCoordinator with singleflight dedupe by providerID and a 5-minute per-provider client cache. A janitor goroutine ticks every clientTTL/5 and evicts expired entries; shutdown is idempotent. S9 (nzbfilesystem/metadata_remote_file.go + usenet/usenet_reader.go) When a streaming HTTP client disconnects mid-read, the request goroutine calls mvf.Close synchronously; a concurrent mvf.Read holds mvf.mu for the full segment-download latency (15s × retries). Add UsenetReader.Interrupt (non-blocking ctx-cancel + cond broadcast + segment closer) and an atomic interruptHandle on MetadataVirtualFile so Close fires the interrupt BEFORE contending for mvf.mu. Close now returns in microseconds regardless of segment latency. Also move streamTracker.Remove + streamID write inside mvf.mu's critical section to satisfy the race detector. Test infrastructure Narrow pool.NntpClient interface so tests can substitute a deterministic fake without standing up real NNTP connections. internal/testsupport/fakepool exposes per-message-ID latency / error injection and an in-flight high-water counter (MaxInFlight) used as the primary observability primitive for the invariants. Companion helpers: testsupport/segments (deterministic payload / message-ID generation) and testsupport/goroutines (snapshot + AssertReturnedToBaseline for closer-accumulation scenarios). Pinned by the new storm tests: - TestStorm_RetryAmplifiesPerMessageCallCount (S1) - TestStorm_RetryUsesFixedDelayInsteadOfExponentialBackoff (S3) - TestStorm_RandomReadAtCreatesEphemeralReaderPerCall (S5) - TestRandomReadCache_EOFReadDoesNotPanic - TestStorm_SeekSpamAccumulatesCloserGoroutines (S6) - TestStorm_SpeedTestBypassesPoolManager (S8) - TestStorm_ClientDisconnectHoldsPoolSlotForUpTo30s (S9) - TestStorm_ConcurrentDisconnectsPinManyGoroutines (S9) go test -race -count=1 clean across the touched packages.
0b9b64f to
c8899e4
Compare
Contributor
Author
yep - rebased on main :) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Connection-lifecycle storms (S1/S3/S5/S6/S8/S9)
Re-opening of #589 (branch was accidentally deleted after merge to dev).
Scope is connection-lifecycle and observability only. Cross-process budgeting between streams and imports is out of scope — that lives in #595.
Fixes
S1 / S3 — retry.Attempts(5) → Attempts(2) with [50, 150]ms jitter, so a permanently failing segment caps at 2 wire calls and concurrent retriers desync against a recovering provider instead of thundering-herd.
S5 — per-file segment LRU on the ephemeral ReadAt path coalesces Plex/Jellyfin scrub bursts (200 small reads in an 8-segment window → 8 wire calls). Plus an EOF-straddle slice-bound fix that was panicking on partially-filled final segments.
S6 — bounded closer-worker pool absorbs seek-spam (50-seek burst → ~4 goroutines instead of ~50, none pinning pool slots).
S8 — speed-test routes through pool.Manager + a singleflight-dedupe coordinator with a per-provider client cache and janitor sweep. Endpoint can no longer be rate-abused into exhausting a provider's connection budget.
S9 — MetadataVirtualFile.Close fires UsenetReader.Interrupt before contending for mvf.mu, so client disconnects mid-read return in microseconds instead of pinning the goroutine + connection slot for the full segment latency. Big one for Plex/Jellyfin scrubbing.
Test infrastructure
Deterministic fakepool seam (per-message latency / error injection, in-flight counters, gate primitive), segments payload generator, goroutines snapshot helper. Every fix is pinned by a regression test asserting MaxInFlight / BodyCalls / goroutine count directly.
Running result
Combined with #595 in my dev branch — solid through 24h+ playback + import workload on the gluetun/k8s setup that originally surfaced the conntrack storms. Go heap steady at ~26MB, no kernel WireGuard instability.
Clean merge against current main.