fix(entity): bound bootstrap fanout to prevent Worker OOM (closes J-002, J-003)#18
Merged
fix(entity): bound bootstrap fanout to prevent Worker OOM (closes J-002, J-003)#18
Conversation
…02, J-003) The entity tool's cold-path `bootstrapEntityMatches` did an uncapped two-level fanout — Promise.allSettled over all 33 served resources × Promise.allSettled over every content file in each — producing ~255 simultaneous R2 reads with every parsed JSON article body retained in memory until the outer allSettled resolved. With normal upstream sizes that working set exceeded the Cloudflare Worker 128MB memory budget, surfacing as Cloudflare-edge-injected 503s with `outcome: exceededMemory` (Error 1102 family). Workers Logs (enabled by PR #17 / J-002 H8) confirmed the mechanism: two non-success invocations in the 9.5 hours since that deploy, both with identically 255 subrequests — one outcome=exceededMemory at 04:40Z and one outcome=clientDisconnected at 13:55Z (same root cause expressed as either OOM or client timeout depending on which limit hits first). This change introduces a bounded-concurrency helper `settledInChunks` and applies it to both fanout levels of bootstrapEntityMatches: - Resource-level chunk size: 4 (was unlimited, ~33 in flight) - File-level chunk size: 8 (was unlimited, ~7 per resource × 33) - Maximum simultaneous fetches: 4 × 8 = 32 (was ~255) - Working-set memory now scales with concurrency, not corpus size Also adds a 20s wall-clock deadline on the bootstrap. If the deadline trips, whatever matches were collected so far are returned to the caller (so the client gets a useful partial answer rather than a 30s timeout) but the partial result is NOT written to the R2 bootstrap cache, so a later invocation that completes successfully still gets to memoize the full result. Tests: - All 151 existing tests still pass (verified locally with GITHUB_TOKEN set for the GitHub-API-dependent coverage tests) - 5 new unit tests for settledInChunks cover: chunk size cap is honored, PromiseSettledResult.rejected shape preserved, empty input, chunkSize<=0 validation, and original index passthrough to callback Journal: - Appends J-003 to odd/ledger/journal.md with full Observation/Learning/ Decision/Constraint/Handoff. Closes J-002 directly. Opens H11 (canonical fix: populate index.entity exhaustively in buildIndex so bootstrap becomes dead code), H12 (audit other fanout sites with the new helper), H13 (long-lived CF observability token for future agentic investigations). Risk: low. The change is additive (new helper function, replaces two allSettled call sites), the existing test suite still passes, and the behavior under non-pathological loads (entity already in pre-built index, or already in R2 bootstrap cache) is unchanged — only the cold-corpus-scan path is bounded.
Deploying with
|
| Status | Name | Latest Commit | Preview URL | Updated (UTC) |
|---|---|---|---|---|
| ✅ Deployment successful! View logs |
aquifer-mcp | dbd3fe3 | Commit Preview URL Branch Preview URL |
Apr 23 2026, 08:57 PM |
Second commit on this branch. The first bounded the OOM mechanism by
chunking the bootstrap fanout. This commit closes the derivative
silent-truth-loss bug: when the wall-clock deadline trips mid-bootstrap,
callers were receiving a truncated ArticleRef[] with no signal that the
scan was incomplete, and emitting "Found N article(s)" — silently
presenting a partial result as authoritative. Axiom 4 violation.
Changes:
src/tools.ts (+178 / -28):
- New `BootstrapEntityResult` interface (exported) with explicit
completeness fields: matches, complete, scanned_resources,
total_resources, scanned_files, total_files_estimate, budget_exceeded,
duration_ms.
- bootstrapEntityMatches return type changes from Promise<ArticleRef[]> to
Promise<BootstrapEntityResult>. Cache-hit fast path returns the new
shape with complete=true.
- Per-resource scan accounting (scannedResources, scannedFiles,
totalFilesEstimate) tracked through the bounded fanout. A resource
counts as fully scanned only when no deadline trip happened mid-batch.
- New formatPartialBootstrapNote(result) helper produces a single
user-facing disclosure string when complete=false. Both handleEntity
and searchByEntity now append it.
- Tracer span now carries `scanned/total resources, scanned/total files,
N matches (PARTIAL: budget_exceeded)` so dashboard observability sees
the same information as the user.
- Cache write gate unchanged: only `complete && deduped.length > 0`
results are written to R2.
src/tools.test.ts (+101 / -1):
- Imports BootstrapEntityResult and bootstrapEntityMatches.
- 4 new tests in describe("BootstrapEntityResult transparency"):
- cold-path empty scan returns complete=true (no partial note)
- handleEntity empty-result text contains no partial note in complete case
- BootstrapEntityResult shape contract enforced at runtime
- cache-hit fast path returns complete=true with cached payload
odd/ledger/journal.md (+18 lines):
- Appends an Update sub-section to J-003 capturing the transparency
revision rationale, the alternatives weighed during oddkit_challenge,
and the vodka constraint check (helper stays generic, no domain
branches added, server growth lags KB growth).
Verification:
- npm ci && npm run build && npm run test (with GITHUB_TOKEN set,
mirroring CI build-test job)
- 155/155 tests pass (was 151 on prior commit, +4 transparency tests)
- wrangler deploy --dry-run: clean, no binding/compatibility-flag changes
- typecheck: 11 ResourceMetadata cast errors UNCHANGED from pre-existing
main state (CI does not run typecheck)
- Manual node-render of the partial-note formatter confirms clean
user-visible text with unambiguous warning glyph
Vodka constraint check:
- Helper is generic over <T,R> — no Bible/aquifer/entity opinion inside.
- Constants (ENTITY_BOOTSTRAP_*) describe the call site mechanically,
not domain content.
- BootstrapEntityResult fields describe the scan, not the corpus content.
- Server LOC growth ~9% for this safety net vs KB growth ~88% toward
target coverage. Server still grows slower than what it serves.
- No new `if (resource_type === ...)` branches anywhere.
oddkit validate run produced NEEDS_ARTIFACTS on the bare validate call;
this commit message + the test output capture + the journal Update
sub-section provide the artifacts the validate verdict was asking for.
Cursor Bugbot reviewed commit 3ff499a and surfaced 5 issues. The Cursor Agent autofix in dafe73c closed 2 of them (no-repoSha guard + tracer label taxonomy). This commit closes the remaining 3, including the only High-severity finding. Findings closed: #3 (Medium) — Deadline cannot stop active fetch batches The before-launch deadline check in the file callback didn't help once settledInChunks had already started a slow batch — those fetches ran to completion and could push the worker past the 30s client timeout. Fix: introduce withinDeadline<T>(p) that races each fetch against the remaining wall-clock budget. Losing the race rejects with DEADLINE, which the file callback catches and counts as failedFiles++. #4 (Medium) — Complete scans mislabeled when budget exceeded `complete = !budgetExceeded && scannedResources === total` forced complete=false even when every resource finished right at the deadline boundary. Fix: drop !budgetExceeded from the formula. Completion is now determined by counters: resourcesPartial === 0 && resourcesSkipped === 0 && failedFiles === 0. budget_exceeded retained as an independent diagnostic field for telemetry. #5 (HIGH) — Failed file fetches still marked complete Promise.allSettled rejected entries were silently dropped from refs but the resource was still counted as scanned, so complete=true could ship with data from missing content files. Fix: replace the single scannedResources counter with three explicit per-resource states — resourcesComplete, resourcesPartial, resourcesSkipped — and only count a resource as complete when EVERY file fetch resolved. Schema/type changes: - BootstrapEntityResult gains `failed_files: number` so callers and tests can read the failure count directly. - formatPartialBootstrapNote now branches on a 3-way reason taxonomy matching the tracer label exactly: budget_exceeded → 'lookup deadline reached (Ns)'; failed_files > 0 → 'N content file fetch(es) failed'; else 'scan incomplete'. User-facing prose and trace headers now agree on diagnosis, so an operator reading logs sees the same explanation the user saw. Test coverage: 4 new tests in describe('BootstrapEntityResult — Bugbot fix coverage'): - (#5 High) failed file fetch makes complete=false with failed_files>0 - (#5 High) failed-fetch result is NOT cached (no entity/* putJSON call) - (#2 Low) regression: no-repoSha resource doesn't drag complete to false - (#1 Low) tracer reason taxonomy matches user-facing note taxonomy (verified via handleEntity → 'fetch(es) failed' substring in output) Verification: - npm ci && npm run build && npm run test (with GITHUB_TOKEN set, mirroring CI build-test job) - 159/159 tests pass (was 155 on prior commit, +4 Bugbot-fix tests) - wrangler deploy --dry-run: clean, no binding/compatibility-flag changes - Manual node-render of all 3 partial-note branches (budget, failed, generic) confirms cleanly distinguishable user-visible text per cause Vodka: helper still generic over <T,R>, withinDeadline is a generic Promise utility with no domain knowledge, no new domain branches. Journal: appends second Update sub-section to J-003 covering the Bugbot-driven fixes, including the meta-lesson that explicit-data-flow types are necessary but not sufficient — adversarial review (human or bot) is what catches implementations that quietly violate their own type contract.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit c05ef57. Configure here.
…tial note Final fix in PR #18's bug-fix arc. Cursor Bugbot's third review surfaced 4 findings against commit 6546fc1. Cursor Agent autofixed 3 of them in commits 0a0694d, 12a012d, c05ef57 (shape-test contract assertion; late-rejection absorption in the race path; abandoned-promise absorption in the early-return path — all mechanically clean). This commit closes the remaining finding #9 (Low): formatPartialBootstrapNote appended `failedSuffix = ", N failed"` unconditionally whenever failed_files > 0, which duplicated the count text in the failed_files-only branch where the `reason` string already said it. Example before: "3 content file fetch(es) failed. Scanned 30/33 resources and 217/230 files, 3 failed. Additional matches may exist..." ^ count duplicated Fix: gate failedSuffix on the budget_exceeded branch only. That's the branch where the primary reason text ("lookup deadline reached (Ns)") doesn't carry the count, so the suffix is the user's only signal. In the failed_files-only branch the count is already in `reason`; adding it again is noise. After fix, all 4 partial-note branches render distinguishable, single- mention text: - budget only: "lookup deadline reached (20.1s). Scanned 4/33 resources and 17/230 files." - budget + failures (suffix carries the count): "lookup deadline reached (20.1s). Scanned 4/33 resources and 15/230 files, 2 failed." - failures only (count in reason, suffix suppressed): "3 content file fetch(es) failed. Scanned 30/33 resources and 217/230 files." - catch-all: "scan incomplete. Scanned 5/33 resources and 34/230 files." Tests: 2 new regression tests in describe("formatPartialBootstrapNote — output text invariants"). One asserts via regex match-count that the failure number appears exactly once in the failed_files-only output; the other documents the no-failures-no-deadline invariant. 161/161 passing locally (was 159). Verification: - npm ci && npm run build && npm run test (with GITHUB_TOKEN set) - 161/161 tests pass - wrangler deploy --dry-run: clean, no binding changes - Manual node-render of all 4 branches confirms intended user-facing text Vodka: helper still local to formatPartialBootstrapNote, no domain opinion added, no helper count growth. Journal: appends final Update sub-section to J-003 covering the full 9-finding bug-fix arc (1 High, 3 Medium, 5 Low; 6 autofix, 3 manual) and the meta-lesson — type contract is necessary but not sufficient; adversarial review is what prevents implementations from quietly violating their own promises.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Why
PR #17 (J-002 H8) enabled Workers Logs. With per-minute outcome data now flowing, the J-002 503 has a definitive root cause:
exceededMemoryclientDisconnectedBoth invocations carry an identical 255-subrequest signature. The mechanism is
bootstrapEntityMatchesdoing a two-level uncapped fanout —Promise.allSettledover all 33 served resources, then nestedPromise.allSettledover every content file in each. Every parsed JSON article body is held in memory until the outerallSettledresolves; that working set exceeds the 128 MB Worker budget, Cloudflare kills the isolate, the edge returns 503.The previously-known "entity tool consistently hits Cloudflare resource limits (Error 1102)" memory note is now upgraded from circumstantial to directly observed, with the failure mode named (
exceededMemory, notexceededCpu) and the trigger quantified (255 subrequests).What
src/tools.ts:settledInChunks<T,R>(items, chunkSize, fn)— same return shape asPromise.allSettledbut processes items in awaited batches so the working-set memory scales with concurrency, not with corpus size.bootstrapEntityMatchesnow usessettledInChunksat both fanout levels:(BUDGET_EXCEEDED, partial)annotation when the deadline hit, so thex-aquifer-traceheader surfaces it.src/tools.test.ts:PromiseSettledResult.rejectedshape preservation, empty input,chunkSize <= 0validation, and original-index passthrough to the callback.odd/ledger/journal.md:index.entityexhaustively inbuildIndexso bootstrap becomes dead code (this PR is the surgical fix; H11 is the architectural one).Promise.all*(items.map(...))sites with the new helper.fetchAllRepoShas(62-way GitHub fanout) is the next obvious candidate.Workers Observability: Read+ storing in repo Actions secrets for future agentic investigations.Verification
Locally (
npm ci && npm run build && npm run testwithGITHUB_TOKENset, mirroring.github/workflows/ci.yml build-test):wrangler deploy --dry-run: passesvitest run: 151/151 passing (146 existing + 5 new). The 2 coverage tests that fail on a clean clone withoutGITHUB_TOKENalso pass when the token is set.After merge, two-step validation:
Workers Builds: aquifer-mcpredeploys (modified_on advances).entity entity_id=person:Obadiahor any rarely-touched figure). Worker should respond in well under 30s with either a real result or a non-poisoning partial. GraphQL Analytics outcome should remainsuccessregardless of upstream variance.Risk
Low. Additive helper, replaces two
Promise.allSettledcall sites, all existing tests pass, and the hot path (entity already in pre-built index, or in R2 bootstrap cache) is unchanged — only the cold-corpus-scan path is bounded. Worst case for an entity that genuinely doesn't exist in the corpus: bootstrap walks the chunked fanout to completion (slower wall-clock than before because batches are sequential, but still well under the 20s budget) and writes an empty cache entry preventing future re-walks.Mode trail
Investigation conducted under exploration → execution. Token-scoped GraphQL query revealed the
outcome=exceededMemoryrow and the 255-subrequest count. Root cause located by direct read ofsrc/tools.ts. Fix designed for minimum surface area (helper + chunked fanout + deadline; no architectural change). Tests added before push. Journal entry hand-written; encoder defect H7 from J-002 is still outstanding.Note
Medium Risk
Changes the
entity/entity-search cold path to enforce concurrency limits and wall-clock deadlines, which can alter result completeness and caching behavior under load despite added transparency and tests.Overview
Fixes the
entitytool’s cold-path bootstrap scan to avoid Cloudflare Worker OOMs by replacing unbounded nestedPromise.allSettledfanout with a newsettledInChunkshelper, capped concurrency (4 resources × 8 files) and a 20s wall-clock budget.The bootstrap now returns a structured
BootstrapEntityResult(includingcomplete, scan counters,failed_files, andbudget_exceeded) and bothhandleEntityand entity-basedsearchappend an explicit partial-result disclosure when the scan is incomplete; only fully complete results are written to the bootstrap cache. Extensive new unit tests cover chunking behavior and multiple partial/failed-fetch transparency and caching edge cases, and the journal is updated with the incident root-cause/fix narrative (J-003).Reviewed by Cursor Bugbot for commit dbd3fe3. Bugbot is set up for automated code reviews on this repo. Configure here.