Skip to content

feat(slice-2.5): streaming NDJSON writer + rate-limit + 429 retry (M1+M2)#2

Merged
Declade merged 3 commits into
mainfrom
feat/slice-2.5-hardening
May 17, 2026
Merged

feat(slice-2.5): streaming NDJSON writer + rate-limit + 429 retry (M1+M2)#2
Declade merged 3 commits into
mainfrom
feat/slice-2.5-hardening

Conversation

@Declade
Copy link
Copy Markdown
Owner

@Declade Declade commented May 17, 2026

Summary

Slice 2.5 hardening of the Lucairn Research Program harness. Closes bug-hunter M1 + M2 ahead of Slice 3 live 500-row dispatch.

PRD (Status: Locked): Opus Advisor/specs/prd-2026-05-17-paper-1-autonomous-finish.md — Slice 2.5 scope at the "Slice 2.5: Hardening for live-run readiness (M1+M2)" section.

What's in this PR

  • M1 — streaming NDJSON writer (scripts/run-pipeline.ts). Replaces buffered-in-memory pattern with createWriteStream + 'open' event fd capture + per-row fsyncSync. Mid-run SIGTERM preserves rows 1..N-1 in the output file. Single-shot stderr warn on fsync failure (operator visibility on tmpfs/network-mount filesystems). Stream cleanup in finally{} on any throw path.

  • M2 — rate-limit + 429 retry (src/gateway-client.ts + scripts/run-pipeline.ts). Adds rateLimitRpm?: number option + --rate-limit-rpm=N CLI flag. Per-row dispatches gate on monotonic performance.now() clock (not wall-clock — immune to NTP step adjustments). HTTP 429 added to retry-eligible status range with Retry-After header parsing (numeric seconds AND HTTP-date forms); honors Math.max(retryAfterMs, computedBackoff). 4xx-non-429 does not retry. Exponential backoff with jitter (base 500ms, jitter 0–200ms, max 2 retries).

  • Downstream consumer tolerance (scripts/collect-certs.ts + scripts/compute-recall.ts). Try/catch around JSON.parse(line) with skip-with-warn (NOT silent-skip). SIGKILL-induced partial tails no longer block the next pipeline step.

  • Gateway 429 surface documented (head-comment in src/gateway-client.ts). Grep'd dual-sandbox-architecture/services/gateway/internal/middleware/ratelimit.go:101-114 + proxy.go:748-755: own rate-limit emits HTTP 429 with Retry-After (handled directly); upstream Anthropic 429s map through circuit-breaker to HTTP 503 with retrySeconds (handled by 5xx classifier).

Reviewer chain — PASS after fix-up at 375f03a

Reviewer Pre-fix Resolution
bug-hunter-reviewer 2 BLOCKER + 2 HIGH + 3 MED + 3 LOW + 4 INFO BLOCKER-1 (fsync silently skipped on async-open path), BLOCKER-2 (consumers throw on partial tail), HIGH-1 (Retry-After ignored), HIGH-2 (SIGTERM test asserts wrong invariant), MEDIUM-1 (wall-clock rate-limit), MEDIUM-3 (silent fsync swallow) all closed in 375f03a. LOW/INFO + MED-2 deferred.
claim-enforcement-guard 0 BLOCKER + 1 WARN WARN-1 closed ("Anthropic Tier-1" renamed → "Anthropic API rate-limit tier" in 5 occurrences).
personal-info-leak-detector 0 BLOCKER + 0 WARN PASS clean.

Acceptance gates

Gate Result
pnpm install --frozen-lockfile PASS
pnpm typecheck + pnpm typecheck:test + pnpm build PASS
pnpm test (46 tests: 40 Slice 2 + 3 Slice 2.5 + 3 Retry-After) PASS
pnpm run pipeline -- --rows=5 --mock PASS — regression check
pnpm run pipeline -- --rows=10 --mock --rate-limit-rpm=10 PASS — 54s wall-clock matches 9 × 6s math
SIGTERM smoke: kill mid-run → output has rows 0..N-1 consecutive + trailing newline PASS — empirically verified, 2 runs
Banned-literal sweep (25 banned terms) PASS — 0 hits
Banned tier-name sweep PASS — 0 hits

Test plan

  • Codex round 1 substantive-PASS [N/N]
  • All acceptance gates re-run at PR merge (post-Codex-PASS)
  • Slice 3 dispatch begins immediately on merge per the locked PRD

🤖 Generated with Claude Code

Declade added 3 commits May 17, 2026 13:08
…+M2)

Closes bug-hunter M1 + M2 ahead of Slice 3 live 500-row run.

- M1: streaming NDJSON writer in scripts/run-pipeline.ts replaces buffered-in-memory
  pattern. Per-row createWriteStream append + periodic fsync. Mid-run SIGTERM preserves
  rows 1..N-1 in the output file. NDJSON readers (collect-certs, compute-recall) already
  handle empty/malformed lines safely.

- M2: rate-limit/concurrency + 429 retry in src/gateway-client.ts. Adds rateLimitRpm
  option + --rate-limit-rpm CLI flag. Gates per-row dispatches at 60_000/rpm intervals.
  Adds 429 to retry-eligible status range alongside existing 5xx + connection-error
  policy; exponential backoff w/ jitter (base 500ms, jitter 0-200ms, max 2 retries).
  4xx-non-429 does not retry.

3 new tests: streaming-writer SIGTERM survival, rate-limit RPM enforcement, 429 retry
with 4xx-non-429 no-retry. Total tests: 40 -> 43.

Refs: prd-2026-05-17-paper-1-autonomous-finish.md (Slice 2.5).
…-1 + WARN-1)

- BLOCKER-1: capture writeStream.fd via 'open' event so fsync actually fires.
  Previous fd-typeof-number guard silently skipped fsync because createWriteStream
  opens asynchronously; M1 acceptance test was passing by OS-buffering luck, not
  by durability invariant. Closes the same root cause as MEDIUM-3 by adding a
  one-line stderr warn the first time fsync fails (operator visibility).

- BLOCKER-2: wrap JSON.parse in try/catch in scripts/collect-certs.ts and
  scripts/compute-recall.ts with skip-with-warn. Previous throw-on-malformed-JSON
  blocked the next pipeline step when SIGKILL left a partial tail, contradicting
  the M1 recovery contract documented at scripts/run-pipeline.ts:382-386.

- HIGH-1: parse Retry-After header on 429 (numeric seconds OR HTTP-date) and
  honor max(retryAfterMs, computedBackoff). Anthropic's rate-limit responses
  carry Retry-After: 60 type hints; previous fixed-backoff would hit 429 again
  and chip into the 5% failure budget for no good reason. +3 unit tests.

- HIGH-2: SIGTERM test now asserts row_index consecutive from target[0..N-1]
  + file ends with newline (no partial-line tail). Was previously asserting
  only line count + each line is JSON, which passed by luck.

- MEDIUM-1: rate-limit nowFn defaults to performance.now() not Date.now().
  Wall-clock is vulnerable to NTP step adjustments; monotonic clock is the
  correct primitive for per-row interval gating.

- WARN-1: rename "Anthropic Tier-1" to "Anthropic API rate-limit tier" in 5
  code comments. Removes ambiguity with Lucairn's locked Developer/Pro/Enterprise
  tier scheme; repo flips public on Paper 1 ship-day.

Bonus: head-comment in src/gateway-client.ts documents the gateway's actual 429
surface based on grep of dual-sandbox-architecture/services/gateway/internal/api/proxy.go.

Deferred to Slice 3 inline or post-publish: MEDIUM-2 (wrapped-429 detection if
the gateway wraps), LOW-1 (--rate-limit-rpm=1.5 truncation), LOW-2 (sliding-window
rate-limit), LOW-3 (test stderr capture). INFOs (PRD wording stale) noted but
no action.

Refs: prd-2026-05-17-paper-1-autonomous-finish.md (Slice 2.5).
… return

Codex r1 finding [2] at 375f03a: previous writeLine resolved when
stream.write() returned true (synchronously), BEFORE the write callback
fired. fsync(fd) on the captured fd could then run before the chunk's
data was committed to the kernel page cache, defeating the M1 per-row
durability invariant in practice.

Correct synchronization: resolve only when the write callback fires —
that's when the chunk has been handled by the underlying fs resource
and the data is in the kernel page cache. fsync at that point flushes
it to physical storage as M1 promises.

Back-pressure is naturally embedded in the new shape: if the writable's
internal buffer is full, the next chunk's callback is delayed until
the buffer drains. No separate drain-event handler needed.

Gates green: pnpm typecheck PASS, pnpm test 46/46 PASS, SIGTERM smoke
produces 1 line + trailing newline + row_index=18 (first sorted
ground-truth index).
@Declade Declade merged commit 88fdb0d into main May 17, 2026
0 of 6 checks passed
@Declade Declade deleted the feat/slice-2.5-hardening branch May 17, 2026 11:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant