feat(slice-2.5): streaming NDJSON writer + rate-limit + 429 retry (M1+M2)#2
Merged
Conversation
…+M2) Closes bug-hunter M1 + M2 ahead of Slice 3 live 500-row run. - M1: streaming NDJSON writer in scripts/run-pipeline.ts replaces buffered-in-memory pattern. Per-row createWriteStream append + periodic fsync. Mid-run SIGTERM preserves rows 1..N-1 in the output file. NDJSON readers (collect-certs, compute-recall) already handle empty/malformed lines safely. - M2: rate-limit/concurrency + 429 retry in src/gateway-client.ts. Adds rateLimitRpm option + --rate-limit-rpm CLI flag. Gates per-row dispatches at 60_000/rpm intervals. Adds 429 to retry-eligible status range alongside existing 5xx + connection-error policy; exponential backoff w/ jitter (base 500ms, jitter 0-200ms, max 2 retries). 4xx-non-429 does not retry. 3 new tests: streaming-writer SIGTERM survival, rate-limit RPM enforcement, 429 retry with 4xx-non-429 no-retry. Total tests: 40 -> 43. Refs: prd-2026-05-17-paper-1-autonomous-finish.md (Slice 2.5).
…-1 + WARN-1) - BLOCKER-1: capture writeStream.fd via 'open' event so fsync actually fires. Previous fd-typeof-number guard silently skipped fsync because createWriteStream opens asynchronously; M1 acceptance test was passing by OS-buffering luck, not by durability invariant. Closes the same root cause as MEDIUM-3 by adding a one-line stderr warn the first time fsync fails (operator visibility). - BLOCKER-2: wrap JSON.parse in try/catch in scripts/collect-certs.ts and scripts/compute-recall.ts with skip-with-warn. Previous throw-on-malformed-JSON blocked the next pipeline step when SIGKILL left a partial tail, contradicting the M1 recovery contract documented at scripts/run-pipeline.ts:382-386. - HIGH-1: parse Retry-After header on 429 (numeric seconds OR HTTP-date) and honor max(retryAfterMs, computedBackoff). Anthropic's rate-limit responses carry Retry-After: 60 type hints; previous fixed-backoff would hit 429 again and chip into the 5% failure budget for no good reason. +3 unit tests. - HIGH-2: SIGTERM test now asserts row_index consecutive from target[0..N-1] + file ends with newline (no partial-line tail). Was previously asserting only line count + each line is JSON, which passed by luck. - MEDIUM-1: rate-limit nowFn defaults to performance.now() not Date.now(). Wall-clock is vulnerable to NTP step adjustments; monotonic clock is the correct primitive for per-row interval gating. - WARN-1: rename "Anthropic Tier-1" to "Anthropic API rate-limit tier" in 5 code comments. Removes ambiguity with Lucairn's locked Developer/Pro/Enterprise tier scheme; repo flips public on Paper 1 ship-day. Bonus: head-comment in src/gateway-client.ts documents the gateway's actual 429 surface based on grep of dual-sandbox-architecture/services/gateway/internal/api/proxy.go. Deferred to Slice 3 inline or post-publish: MEDIUM-2 (wrapped-429 detection if the gateway wraps), LOW-1 (--rate-limit-rpm=1.5 truncation), LOW-2 (sliding-window rate-limit), LOW-3 (test stderr capture). INFOs (PRD wording stale) noted but no action. Refs: prd-2026-05-17-paper-1-autonomous-finish.md (Slice 2.5).
… return Codex r1 finding [2] at 375f03a: previous writeLine resolved when stream.write() returned true (synchronously), BEFORE the write callback fired. fsync(fd) on the captured fd could then run before the chunk's data was committed to the kernel page cache, defeating the M1 per-row durability invariant in practice. Correct synchronization: resolve only when the write callback fires — that's when the chunk has been handled by the underlying fs resource and the data is in the kernel page cache. fsync at that point flushes it to physical storage as M1 promises. Back-pressure is naturally embedded in the new shape: if the writable's internal buffer is full, the next chunk's callback is delayed until the buffer drains. No separate drain-event handler needed. Gates green: pnpm typecheck PASS, pnpm test 46/46 PASS, SIGTERM smoke produces 1 line + trailing newline + row_index=18 (first sorted ground-truth index).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Slice 2.5 hardening of the Lucairn Research Program harness. Closes bug-hunter M1 + M2 ahead of Slice 3 live 500-row dispatch.
PRD (Status: Locked):
Opus Advisor/specs/prd-2026-05-17-paper-1-autonomous-finish.md— Slice 2.5 scope at the "Slice 2.5: Hardening for live-run readiness (M1+M2)" section.What's in this PR
M1 — streaming NDJSON writer (
scripts/run-pipeline.ts). Replaces buffered-in-memory pattern withcreateWriteStream+'open'event fd capture + per-rowfsyncSync. Mid-run SIGTERM preserves rows 1..N-1 in the output file. Single-shot stderr warn on fsync failure (operator visibility on tmpfs/network-mount filesystems). Stream cleanup infinally{}on any throw path.M2 — rate-limit + 429 retry (
src/gateway-client.ts+scripts/run-pipeline.ts). AddsrateLimitRpm?: numberoption +--rate-limit-rpm=NCLI flag. Per-row dispatches gate on monotonicperformance.now()clock (not wall-clock — immune to NTP step adjustments). HTTP 429 added to retry-eligible status range withRetry-Afterheader parsing (numeric seconds AND HTTP-date forms); honorsMath.max(retryAfterMs, computedBackoff). 4xx-non-429 does not retry. Exponential backoff with jitter (base 500ms, jitter 0–200ms, max 2 retries).Downstream consumer tolerance (
scripts/collect-certs.ts+scripts/compute-recall.ts). Try/catch aroundJSON.parse(line)with skip-with-warn (NOT silent-skip). SIGKILL-induced partial tails no longer block the next pipeline step.Gateway 429 surface documented (head-comment in
src/gateway-client.ts). Grep'ddual-sandbox-architecture/services/gateway/internal/middleware/ratelimit.go:101-114+proxy.go:748-755: own rate-limit emits HTTP 429 withRetry-After(handled directly); upstream Anthropic 429s map through circuit-breaker to HTTP 503 withretrySeconds(handled by 5xx classifier).Reviewer chain — PASS after fix-up at
375f03a375f03a. LOW/INFO + MED-2 deferred.Acceptance gates
pnpm install --frozen-lockfilepnpm typecheck+pnpm typecheck:test+pnpm buildpnpm test(46 tests: 40 Slice 2 + 3 Slice 2.5 + 3 Retry-After)pnpm run pipeline -- --rows=5 --mockpnpm run pipeline -- --rows=10 --mock --rate-limit-rpm=10Test plan
🤖 Generated with Claude Code