fix(entity): bound bootstrap fanout to prevent Worker OOM (closes J-002, J-003) by klappy · Pull Request #18 · klappy/aquifer-mcp

klappy · 2026-04-23T14:28:57Z

Why

PR #17 (J-002 H8) enabled Workers Logs. With per-minute outcome data now flowing, the J-002 503 has a definitive root cause:

When (UTC)	Outcome	CPU	Subrequests	Notes
2026-04-23T04:40:00Z	`exceededMemory`	2.82ms	255	post-deploy seed entity:Paul
2026-04-23T13:55:00Z	`clientDisconnected`	3.75ms	255	organic call, client gave up at 30s

Both invocations carry an identical 255-subrequest signature. The mechanism is bootstrapEntityMatches doing a two-level uncapped fanout — Promise.allSettled over all 33 served resources, then nested Promise.allSettled over every content file in each. Every parsed JSON article body is held in memory until the outer allSettled resolves; that working set exceeds the 128 MB Worker budget, Cloudflare kills the isolate, the edge returns 503.

The previously-known "entity tool consistently hits Cloudflare resource limits (Error 1102)" memory note is now upgraded from circumstantial to directly observed, with the failure mode named (exceededMemory, not exceededCpu) and the trigger quantified (255 subrequests).

What

src/tools.ts:

New helper settledInChunks<T,R>(items, chunkSize, fn) — same return shape as Promise.allSettled but processes items in awaited batches so the working-set memory scales with concurrency, not with corpus size.
bootstrapEntityMatches now uses settledInChunks at both fanout levels:
- resource-level chunk = 4 (was unlimited, ~33 in flight)
- file-level chunk = 8 (was unlimited, ~7 per resource × 33)
- max simultaneous fetches now 32 (was ~255)
Adds a 20s wall-clock deadline to the bootstrap. If the deadline trips, partial matches are returned to the caller (so they get a useful answer, not a 30s client timeout) but partial results are NOT cached, so a later complete invocation still gets to memoize correctly.
Tracer span includes (BUDGET_EXCEEDED, partial) annotation when the deadline hit, so the x-aquifer-trace header surfaces it.

src/tools.test.ts:

5 new unit tests covering chunk-size cap, PromiseSettledResult.rejected shape preservation, empty input, chunkSize <= 0 validation, and original-index passthrough to the callback.

odd/ledger/journal.md:

Appends J-003. Closes J-002. Opens three handoffs:
- H11 — canonical fix: populate index.entity exhaustively in buildIndex so bootstrap becomes dead code (this PR is the surgical fix; H11 is the architectural one).
- H12 — audit other Promise.all*(items.map(...)) sites with the new helper. fetchAllRepoShas (62-way GitHub fanout) is the next obvious candidate.
- H13 — open question on minting a long-lived CF token with Workers Observability: Read + storing in repo Actions secrets for future agentic investigations.

Verification

Locally (npm ci && npm run build && npm run test with GITHUB_TOKEN set, mirroring .github/workflows/ci.yml build-test):

wrangler deploy --dry-run: passes
vitest run: 151/151 passing (146 existing + 5 new). The 2 coverage tests that fail on a clean clone without GITHUB_TOKEN also pass when the token is set.

After merge, two-step validation:

Confirm Workers Builds: aquifer-mcp redeploys (modified_on advances).
Trigger an entity:* call against an entity guaranteed not to be in the warm cache (e.g., entity entity_id=person:Obadiah or any rarely-touched figure). Worker should respond in well under 30s with either a real result or a non-poisoning partial. GraphQL Analytics outcome should remain success regardless of upstream variance.

Risk

Low. Additive helper, replaces two Promise.allSettled call sites, all existing tests pass, and the hot path (entity already in pre-built index, or in R2 bootstrap cache) is unchanged — only the cold-corpus-scan path is bounded. Worst case for an entity that genuinely doesn't exist in the corpus: bootstrap walks the chunked fanout to completion (slower wall-clock than before because batches are sequential, but still well under the 20s budget) and writes an empty cache entry preventing future re-walks.

Mode trail

Investigation conducted under exploration → execution. Token-scoped GraphQL query revealed the outcome=exceededMemory row and the 255-subrequest count. Root cause located by direct read of src/tools.ts. Fix designed for minimum surface area (helper + chunked fanout + deadline; no architectural change). Tests added before push. Journal entry hand-written; encoder defect H7 from J-002 is still outstanding.

Note

Medium Risk
Changes the entity/entity-search cold path to enforce concurrency limits and wall-clock deadlines, which can alter result completeness and caching behavior under load despite added transparency and tests.

Overview
Fixes the entity tool’s cold-path bootstrap scan to avoid Cloudflare Worker OOMs by replacing unbounded nested Promise.allSettled fanout with a new settledInChunks helper, capped concurrency (4 resources × 8 files) and a 20s wall-clock budget.

The bootstrap now returns a structured BootstrapEntityResult (including complete, scan counters, failed_files, and budget_exceeded) and both handleEntity and entity-based search append an explicit partial-result disclosure when the scan is incomplete; only fully complete results are written to the bootstrap cache. Extensive new unit tests cover chunking behavior and multiple partial/failed-fetch transparency and caching edge cases, and the journal is updated with the incident root-cause/fix narrative (J-003).

^{Reviewed by Cursor Bugbot for commit dbd3fe3. Bugbot is set up for automated code reviews on this repo. Configure here.}

…02, J-003) The entity tool's cold-path `bootstrapEntityMatches` did an uncapped two-level fanout — Promise.allSettled over all 33 served resources × Promise.allSettled over every content file in each — producing ~255 simultaneous R2 reads with every parsed JSON article body retained in memory until the outer allSettled resolved. With normal upstream sizes that working set exceeded the Cloudflare Worker 128MB memory budget, surfacing as Cloudflare-edge-injected 503s with `outcome: exceededMemory` (Error 1102 family). Workers Logs (enabled by PR #17 / J-002 H8) confirmed the mechanism: two non-success invocations in the 9.5 hours since that deploy, both with identically 255 subrequests — one outcome=exceededMemory at 04:40Z and one outcome=clientDisconnected at 13:55Z (same root cause expressed as either OOM or client timeout depending on which limit hits first). This change introduces a bounded-concurrency helper `settledInChunks` and applies it to both fanout levels of bootstrapEntityMatches: - Resource-level chunk size: 4 (was unlimited, ~33 in flight) - File-level chunk size: 8 (was unlimited, ~7 per resource × 33) - Maximum simultaneous fetches: 4 × 8 = 32 (was ~255) - Working-set memory now scales with concurrency, not corpus size Also adds a 20s wall-clock deadline on the bootstrap. If the deadline trips, whatever matches were collected so far are returned to the caller (so the client gets a useful partial answer rather than a 30s timeout) but the partial result is NOT written to the R2 bootstrap cache, so a later invocation that completes successfully still gets to memoize the full result. Tests: - All 151 existing tests still pass (verified locally with GITHUB_TOKEN set for the GitHub-API-dependent coverage tests) - 5 new unit tests for settledInChunks cover: chunk size cap is honored, PromiseSettledResult.rejected shape preserved, empty input, chunkSize<=0 validation, and original index passthrough to callback Journal: - Appends J-003 to odd/ledger/journal.md with full Observation/Learning/ Decision/Constraint/Handoff. Closes J-002 directly. Opens H11 (canonical fix: populate index.entity exhaustively in buildIndex so bootstrap becomes dead code), H12 (audit other fanout sites with the new helper), H13 (long-lived CF observability token for future agentic investigations). Risk: low. The change is additive (new helper function, replaces two allSettled call sites), the existing test suite still passes, and the behavior under non-pathological loads (entity already in pre-built index, or already in R2 bootstrap cache) is unchanged — only the cold-corpus-scan path is bounded.

cloudflare-workers-and-pages · 2026-04-23T14:29:02Z

Deploying with Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status	Name	Latest Commit	Preview URL	Updated (UTC)
✅ Deployment successful! View logs	aquifer-mcp	`dbd3fe3`	Commit Preview URL Branch Preview URL	Apr 23 2026, 08:57 PM

Second commit on this branch. The first bounded the OOM mechanism by chunking the bootstrap fanout. This commit closes the derivative silent-truth-loss bug: when the wall-clock deadline trips mid-bootstrap, callers were receiving a truncated ArticleRef[] with no signal that the scan was incomplete, and emitting "Found N article(s)" — silently presenting a partial result as authoritative. Axiom 4 violation. Changes: src/tools.ts (+178 / -28): - New `BootstrapEntityResult` interface (exported) with explicit completeness fields: matches, complete, scanned_resources, total_resources, scanned_files, total_files_estimate, budget_exceeded, duration_ms. - bootstrapEntityMatches return type changes from Promise<ArticleRef[]> to Promise<BootstrapEntityResult>. Cache-hit fast path returns the new shape with complete=true. - Per-resource scan accounting (scannedResources, scannedFiles, totalFilesEstimate) tracked through the bounded fanout. A resource counts as fully scanned only when no deadline trip happened mid-batch. - New formatPartialBootstrapNote(result) helper produces a single user-facing disclosure string when complete=false. Both handleEntity and searchByEntity now append it. - Tracer span now carries `scanned/total resources, scanned/total files, N matches (PARTIAL: budget_exceeded)` so dashboard observability sees the same information as the user. - Cache write gate unchanged: only `complete && deduped.length > 0` results are written to R2. src/tools.test.ts (+101 / -1): - Imports BootstrapEntityResult and bootstrapEntityMatches. - 4 new tests in describe("BootstrapEntityResult transparency"): - cold-path empty scan returns complete=true (no partial note) - handleEntity empty-result text contains no partial note in complete case - BootstrapEntityResult shape contract enforced at runtime - cache-hit fast path returns complete=true with cached payload odd/ledger/journal.md (+18 lines): - Appends an Update sub-section to J-003 capturing the transparency revision rationale, the alternatives weighed during oddkit_challenge, and the vodka constraint check (helper stays generic, no domain branches added, server growth lags KB growth). Verification: - npm ci && npm run build && npm run test (with GITHUB_TOKEN set, mirroring CI build-test job) - 155/155 tests pass (was 151 on prior commit, +4 transparency tests) - wrangler deploy --dry-run: clean, no binding/compatibility-flag changes - typecheck: 11 ResourceMetadata cast errors UNCHANGED from pre-existing main state (CI does not run typecheck) - Manual node-render of the partial-note formatter confirms clean user-visible text with unambiguous warning glyph Vodka constraint check: - Helper is generic over <T,R> — no Bible/aquifer/entity opinion inside. - Constants (ENTITY_BOOTSTRAP_*) describe the call site mechanically, not domain content. - BootstrapEntityResult fields describe the scan, not the corpus content. - Server LOC growth ~9% for this safety net vs KB growth ~88% toward target coverage. Server still grows slower than what it serves. - No new `if (resource_type === ...)` branches anywhere. oddkit validate run produced NEEDS_ARTIFACTS on the bare validate call; this commit message + the test output capture + the journal Update sub-section provide the artifacts the validate verdict was asking for.

Cursor Bugbot reviewed commit 3ff499a and surfaced 5 issues. The Cursor Agent autofix in dafe73c closed 2 of them (no-repoSha guard + tracer label taxonomy). This commit closes the remaining 3, including the only High-severity finding. Findings closed: #3 (Medium) — Deadline cannot stop active fetch batches The before-launch deadline check in the file callback didn't help once settledInChunks had already started a slow batch — those fetches ran to completion and could push the worker past the 30s client timeout. Fix: introduce withinDeadline<T>(p) that races each fetch against the remaining wall-clock budget. Losing the race rejects with DEADLINE, which the file callback catches and counts as failedFiles++. #4 (Medium) — Complete scans mislabeled when budget exceeded `complete = !budgetExceeded && scannedResources === total` forced complete=false even when every resource finished right at the deadline boundary. Fix: drop !budgetExceeded from the formula. Completion is now determined by counters: resourcesPartial === 0 && resourcesSkipped === 0 && failedFiles === 0. budget_exceeded retained as an independent diagnostic field for telemetry. #5 (HIGH) — Failed file fetches still marked complete Promise.allSettled rejected entries were silently dropped from refs but the resource was still counted as scanned, so complete=true could ship with data from missing content files. Fix: replace the single scannedResources counter with three explicit per-resource states — resourcesComplete, resourcesPartial, resourcesSkipped — and only count a resource as complete when EVERY file fetch resolved. Schema/type changes: - BootstrapEntityResult gains `failed_files: number` so callers and tests can read the failure count directly. - formatPartialBootstrapNote now branches on a 3-way reason taxonomy matching the tracer label exactly: budget_exceeded → 'lookup deadline reached (Ns)'; failed_files > 0 → 'N content file fetch(es) failed'; else 'scan incomplete'. User-facing prose and trace headers now agree on diagnosis, so an operator reading logs sees the same explanation the user saw. Test coverage: 4 new tests in describe('BootstrapEntityResult — Bugbot fix coverage'): - (#5 High) failed file fetch makes complete=false with failed_files>0 - (#5 High) failed-fetch result is NOT cached (no entity/* putJSON call) - (#2 Low) regression: no-repoSha resource doesn't drag complete to false - (#1 Low) tracer reason taxonomy matches user-facing note taxonomy (verified via handleEntity → 'fetch(es) failed' substring in output) Verification: - npm ci && npm run build && npm run test (with GITHUB_TOKEN set, mirroring CI build-test job) - 159/159 tests pass (was 155 on prior commit, +4 Bugbot-fix tests) - wrangler deploy --dry-run: clean, no binding/compatibility-flag changes - Manual node-render of all 3 partial-note branches (budget, failed, generic) confirms cleanly distinguishable user-visible text per cause Vodka: helper still generic over <T,R>, withinDeadline is a generic Promise utility with no domain knowledge, no new domain branches. Journal: appends second Update sub-section to J-003 covering the Bugbot-driven fixes, including the meta-lesson that explicit-data-flow types are necessary but not sufficient — adversarial review (human or bot) is what catches implementations that quietly violate their own type contract.

…y return

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit c05ef57. Configure here.}

…tial note Final fix in PR #18's bug-fix arc. Cursor Bugbot's third review surfaced 4 findings against commit 6546fc1. Cursor Agent autofixed 3 of them in commits 0a0694d, 12a012d, c05ef57 (shape-test contract assertion; late-rejection absorption in the race path; abandoned-promise absorption in the early-return path — all mechanically clean). This commit closes the remaining finding #9 (Low): formatPartialBootstrapNote appended `failedSuffix = ", N failed"` unconditionally whenever failed_files > 0, which duplicated the count text in the failed_files-only branch where the `reason` string already said it. Example before: "3 content file fetch(es) failed. Scanned 30/33 resources and 217/230 files, 3 failed. Additional matches may exist..." ^ count duplicated Fix: gate failedSuffix on the budget_exceeded branch only. That's the branch where the primary reason text ("lookup deadline reached (Ns)") doesn't carry the count, so the suffix is the user's only signal. In the failed_files-only branch the count is already in `reason`; adding it again is noise. After fix, all 4 partial-note branches render distinguishable, single- mention text: - budget only: "lookup deadline reached (20.1s). Scanned 4/33 resources and 17/230 files." - budget + failures (suffix carries the count): "lookup deadline reached (20.1s). Scanned 4/33 resources and 15/230 files, 2 failed." - failures only (count in reason, suffix suppressed): "3 content file fetch(es) failed. Scanned 30/33 resources and 217/230 files." - catch-all: "scan incomplete. Scanned 5/33 resources and 34/230 files." Tests: 2 new regression tests in describe("formatPartialBootstrapNote — output text invariants"). One asserts via regex match-count that the failure number appears exactly once in the failed_files-only output; the other documents the no-failures-no-deadline invariant. 161/161 passing locally (was 159). Verification: - npm ci && npm run build && npm run test (with GITHUB_TOKEN set) - 161/161 tests pass - wrangler deploy --dry-run: clean, no binding changes - Manual node-render of all 4 branches confirms intended user-facing text Vodka: helper still local to formatPartialBootstrapNote, no domain opinion added, no helper count growth. Journal: appends final Update sub-section to J-003 covering the full 9-finding bug-fix arc (1 High, 3 Medium, 5 Low; 6 autofix, 3 manual) and the meta-lesson — type contract is necessary but not sufficient; adversarial review is what prevents implementations from quietly violating their own promises.