feat(entity): H11 — eager per-resource entity index, end user-blocking corpus scan#19
Merged
feat(entity): H11 — eager per-resource entity index, end user-blocking corpus scan#19
Conversation
…g corpus scan
Closes H11 from J-003. Refactors entity lookup to use per-resource entity
indexes built once at index-build time, replacing the user-blocking
on-demand corpus scan that produced the J-002 OOM. Bootstrap stays as a
defensive fallback only — it now runs only when no per-resource entity
index has been written for any registry resource (e.g. on indexes built
pre-H11). Once any complete bootstrap result is written and any
subsequent index rebuild populates entity indexes, the fallback path
goes dead.
Architecture:
- New `entityIndexKey(code, sha)` in storage.ts mirrors the existing
passageIndexKey/titleIndexKey/articleIndexKey pattern: per-resource
blob at `index/{code}/{sha}/entities.json` storing
Array<[entityId, ArticleRef[]]>. Each blob is small (typically <100KB)
because it's just one resource's references; the SHA-keyed lifecycle
works identically to the other indexes.
- New `populateEntityIndexes(results, env, storage, repoShas)` in
registry.ts is invoked from `buildIndex` after the existing
passage/title/article writes complete. It scans every content file in
every resource's scripture_burrito.ingredients, collects ACAI
associations, and writes per-resource entity maps to R2. Memory is
bounded by `ENTITY_BUILD_RESOURCE_CONCURRENCY=4` and
`ENTITY_BUILD_FILE_CONCURRENCY=8` — same caps as J-003's bootstrap
fanout, same safety profile.
- New `fanOutEntitySearch(entityId, index, storage, tracer)` in
registry.ts is the post-H11 query path. Loads all per-resource entity
indexes in parallel, unions matches, backfills resource_type from
index.registry on read. Mirrors fanOutPassageSearch/fanOutTitleSearch
exactly — vodka-consistent.
- handleEntity and searchByEntity now call fanOutEntitySearch first;
only fall through to bootstrapEntityMatches when fan-out returns
empty. The bootstrap function and BootstrapEntityResult type stay
intact for the migration period and as a permanent backstop.
Why per-resource not global:
- Matches the existing passageIndexKey/titleIndexKey/articleIndexKey
shape — the registry was already designed for this access pattern
(vodka consistency).
- Each per-resource blob can be invalidated independently when its
SHA changes; no global rebuild needed when one resource updates.
- Read pattern is N small parallel R2 reads (cheap, memory-bounded)
vs one large sequential read (subject to R2 object size limits).
- A single global entity blob across the full corpus could grow into
the multi-MB range; per-resource keeps each blob small enough to
fit comfortably in Cache API.
Why settledInChunks duplicated in registry.ts:
- registry.ts cannot import from tools.ts (the dependency runs the
other way). Helper is small enough that the dup isn't worth a
shared-module refactor; H12 already tracks the audit.
Verification:
- npm ci && npm run build && npm run test (with GITHUB_TOKEN set,
mirroring CI build-test job)
- 165/165 tests pass (was 161 on main, +4 H11 fan-out tests covering:
fan-out hit returns matches without invoking bootstrap; fall-through
to bootstrap when no entity index exists; multi-resource union;
case-insensitive entity_id normalization)
- wrangler deploy --dry-run: clean, no binding/compatibility-flag
changes, no new external dependencies
Performance prediction:
Pre-H11 cold-cache entity lookup (post-J-003 fix): ~22s for the full
33-resource bootstrap scan (observed in post-merge production).
Post-H11 cold-cache entity lookup: 33 parallel R2 reads of small
per-resource entity blobs. Conservatively ~200-500ms for the
fan-out, plus the existing index hydration cost. Order of magnitude
improvement on cold-cache; warm-cache (in-memory tier) unchanged.
Index-build cost grows by ~22s on a true cold start (the same scan
the bootstrap was doing per-lookup, now done once per composite SHA).
Background refresh via ctx.waitUntil absorbs this for non-first cold
builds.
Vodka constraint:
- New helpers (`populateEntityIndexes`, `fanOutEntitySearch`) are
generic over their data shape; no domain-specific branching by
resource_type or content category.
- `entityIndexKey` follows the existing per-resource key pattern.
- No new `if (resource_type === ...)` branches anywhere in the
changed code.
- Server LOC grows ~2.5% (+~250 lines net across registry.ts and
storage.ts); the entire entity tool's safety + performance budget
for J-002/J-003/H11 has cost ~9% LOC growth — well below the
~88% KB-coverage growth this work enables.
Journal:
- Appends J-004 (post-merge verification: J-002/J-003 closed by direct
observation, 70 invocations / 0 errors / 0 exceededMemory across
the post-merge window).
- The H11 work itself will get J-005 in a follow-up entry once
post-deploy production observation confirms the cold-path latency
drop. Not encoding that prediction as a fact in this journal is
deliberate per the operator's "claim is a debt" axiom — the
fan-out latency claim is verifiable only post-deploy.
Deploying with
|
| Status | Name | Latest Commit | Preview URL | Updated (UTC) |
|---|---|---|---|---|
| ✅ Deployment successful! View logs |
aquifer-mcp | 1a0641a | Commit Preview URL Branch Preview URL |
Apr 23 2026, 10:08 PM |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Missing language segment in content file fetch URL
- Added the missing
${language}/segment beforejson/${file}in the URL built insidepopulateEntityIndexesso cold R2 entity-index builds now fetch the correct GitHub raw path.
- Added the missing
Preview (1a0641ad81)
diff --git a/odd/ledger/journal.md b/odd/ledger/journal.md
--- a/odd/ledger/journal.md
+++ b/odd/ledger/journal.md
@@ -1384,3 +1384,20 @@
The PR is now `mergeable_state: clean` with all required checks green (`build-test`, `Workers Builds: aquifer-mcp`, `Cursor Bugbot` neutral). Branch tip after this commit; no further action expected from this side.
+---
+
+### J-004 — J-002 / J-003 closed by direct observation; OOM mechanism dead in production
+
+**Observation:** PR #18 merged at 2026-04-23T21:23:52Z (`df7320a7e3665cf1a0da612e719dcaeb6b94b380`). Workers Builds redeployed `aquifer-mcp` immediately, completing successfully at 21:24:22Z. Workers Logs aggregate query (`workersInvocationsAdaptive`, scoped to `scriptName="aquifer-mcp"`, time window 21:24Z–21:55Z) shows **70 invocations, 69 outcome=success, 1 outcome=clientDisconnected, 0 outcome=exceededMemory, 0 outcome=exceededCpu, 0 outcome=scriptThrewException, 1486 total subrequests**. The only non-success row is the J-002-replay attempt where this investigation's own client gave up at its 30s timeout while the cold bootstrap was running — which Workers Logs records as `clientDisconnected`, not as a Worker-side failure. Direct probe of `entity entity_id=person:Paul` (the original J-002 victim) returned a complete 1,155-article result in 95ms from the warm R2 bootstrap cache, with the partial-result note correctly absent. The trace header `storage:entity/166fc4af.../person:pau...=9ms(cache)` confirms a complete bootstrap was previously memoized — the cache-write gate (`complete && deduped.length > 0`) functions as designed. The pre-fix incident on 2026-04-22T00:17:02Z showed `outcome=exceededMemory, 1 req, 1 err, 2.82ms CPU, 255 subrequests` for the same call. The post-fix invocations show the same call now completing under the bounded-fanout cap (~32 in flight) and writing a complete result to cache. Same call, same client, same content corpus; only the application code differs.
+
+**Learning:** The transparency machinery developed across PR #18's bug-fix arc functions correctly in production: complete results carry no partial note, and the cache only memoizes `complete=true` results. The Workers Logs ingestion enabled by PR #17 (J-002 H8) is what made this verification possible at all — without per-invocation outcome data, "no exceededMemory in the last 30 minutes" would have been an inference from absence of complaints rather than a directly observed property. The 9-finding bug-fix arc on PR #18 (1 High, 3 Medium, 5 Low; 6 closed by Cursor Agent autofix, 3 by manual fix) demonstrates that even an explicit-data-flow type contract (`BootstrapEntityResult`) is necessary but not sufficient — adversarial review caught implementation bugs that quietly violated the contract the type promised. The composite pattern worth canonizing: **type contract + adversarial review** as the minimum bar for transparency-critical code paths, where the type forces the API boundary to be honest and the review forces the implementation to honor what the type promised.
+
+A new observation surfaced during validation: the cold bootstrap can take ~22s of wall-clock for the full 33-resource scan (per the 22.9s 502 my client received on a cold-cache `entity:Paul` attempt while the Worker was still running the fanout). The Worker recorded the eventual outcome as `success` with 232 subrequests — it completed and returned within its own deadline budget — but Cloudflare's frontend may have cut the response stream for the client before the Worker's response landed. This is at the edge of Cloudflare's gateway request-duration tolerance and is the empirical signal that 20s is borderline and the inline-on-cold-path bootstrap pattern is fragile. The fix is not to lower the deadline (that would just shift partial results from rare to common); it is to remove the user-blocking corpus scan entirely. That is H11.
+
+**Decision:** Close J-002 and J-003 as resolved by direct observation of post-merge production behavior. Begin H11 work in this same session: replace `bootstrapEntityMatches` as a user-blocking corpus scan with eager population of `index.entity` during `buildIndex`, so cold-path entity lookups become O(1) map reads instead of O(N) corpus scans. Defer H12 (audit other fanout sites) and H13 (long-lived CF observability token) to future sessions; neither blocks correctness.
+
+**Constraint:** Workers Logs aggregate retention is 7 days under Cloudflare's current GA terms. The 70 successful invocations cited above are the only direct evidence of the post-fix outcome, and they will roll out of the queryable window on 2026-04-30. If the H11 work needs to compare future behavior against this baseline, the comparison must be made before then. The CF API token used for the GraphQL queries that produced this evidence has scope `Account Analytics: Read` only (no `Workers Observability: Read`, no `Workers Scripts: Edit`); future per-invocation log inspection or live tail will still require either a broader token or maintainer dashboard access. The cold-bootstrap latency observation (~22s for 33 resources) is the new bound that H11 must beat — any H11 implementation that doesn't reduce cold-path entity lookup wall-clock by at least an order of magnitude has not solved the problem.
+
+**Handoff:**
+- **H14** — Encode "type contract + adversarial review" as a paired pattern for transparency-critical code in canon. Founding observation: the BootstrapEntityResult contract was correct, but its first three implementations on PR #18 each violated it on independent axes; only adversarial review (Cursor Bugbot) caught the implementations that the type couldn't. Pattern: when a function's correctness depends on accurately reporting its own incompleteness, the type signature is the first defense and adversarial review is the second; ship neither alone. Lower priority than H11; tracked here so the lesson doesn't get lost.
+- **H11 promoted from J-003** — Begin work this session: refactor `buildIndex` to eagerly populate `index.entity` so `bootstrapEntityMatches` becomes dead code. Acceptance criteria: cold-path `entity` lookup wall-clock drops from ~22s to sub-second (R2 read of pre-built index); `bootstrapEntityMatches` either deleted or reduced to a defensive fallback that runs only when the eager population is empty (pure backstop, never the primary path); `complete=true` stays a real promise (eager population either succeeds or the index is marked stale, never partial-without-disclosure).
diff --git a/src/registry.ts b/src/registry.ts
--- a/src/registry.ts
+++ b/src/registry.ts
@@ -7,7 +7,7 @@
} from "./types.js";
import { metadataUrl, fetchJson, fetchRepoSha, fetchOrgRepos } from "./github.js";
import { isValidIndexReference, rangesOverlap } from "./references.js";
-import { AquiferStorage, indexKey, metadataKey, passageIndexKey, titleIndexKey, articleIndexKey } from "./storage.js";
+import { AquiferStorage, indexKey, metadataKey, passageIndexKey, titleIndexKey, articleIndexKey, entityIndexKey, contentKey } from "./storage.js";
/** Per-resource article lookup: content_id → file location + minimal metadata. */
export interface ArticleLookupEntry {
@@ -22,6 +22,44 @@
const SHA_STALE_MS = 15 * 60 * 1000; // 15 minutes
const INDEX_MEMORY_TTL_MS = 5 * 60 * 1000; // 5 minutes
+/**
+ * Concurrency caps for entity index population during buildIndex. Scanning all
+ * content files for ACAI entity references would otherwise blow the Worker
+ * memory budget — the same OOM mechanism (J-002 / J-003) that affected the
+ * old user-blocking bootstrap path. The same caps apply here for the same
+ * reason; this code runs at index-build time instead of per-query, which
+ * removes the user-visible latency but does not change the per-fetch memory
+ * cost.
+ */
+const ENTITY_BUILD_RESOURCE_CONCURRENCY = 4;
+const ENTITY_BUILD_FILE_CONCURRENCY = 8;
+
+/**
+ * Run `fn` over `items` in batches of `chunkSize`, awaiting each batch to
+ * settle before starting the next. Same shape as Promise.allSettled but with
+ * memory usage bounded by the chunk size rather than the total item count.
+ *
+ * Duplicated from tools.ts intentionally: registry.ts cannot import from
+ * tools.ts (tools.ts depends on registry.ts), and the helper is small enough
+ * that a single canonical source isn't worth the dependency-graph contortion.
+ * See odd/ledger/journal.md J-005 for H12 — both call sites should eventually
+ * collapse onto a shared helper module if this duplication ever grows.
+ */
+async function settledInChunks<T, R>(
+ items: readonly T[],
+ chunkSize: number,
+ fn: (item: T, index: number) => Promise<R>,
+): Promise<PromiseSettledResult<R>[]> {
+ if (chunkSize <= 0) throw new Error("settledInChunks: chunkSize must be > 0");
+ const results: PromiseSettledResult<R>[] = [];
+ for (let i = 0; i < items.length; i += chunkSize) {
+ const batch = items.slice(i, i + chunkSize);
+ const settled = await Promise.allSettled(batch.map((item, j) => fn(item, i + j)));
+ for (const r of settled) results.push(r);
+ }
+ return results;
+}
+
/** Module-level memory cache — survives across requests within the same isolate. */
let cachedIndex: NavigabilityIndex | null = null;
let indexFetchedAt = 0;
@@ -305,9 +343,20 @@
}
}
- // Write all per-resource indexes to R2 in parallel
+ // Write all per-resource indexes (passage/title/article) to R2 in parallel
await Promise.allSettled(writePromises);
+ // H11: populate per-resource entity indexes. This scans every content file
+ // for ACAI entity references and writes one entityIndexKey per resource.
+ // It does the SAME work the pre-H11 bootstrap was doing on every cold-cache
+ // entity lookup — moved here to index-build time so user-facing entity
+ // queries become O(N_resources) parallel R2 reads (~30 small parallel
+ // requests, each <100KB) instead of O(N_files) blocking R2 reads gated by
+ // the per-isolate memory budget. The bounded fanout caps keep the per-fetch
+ // memory profile identical to bootstrap; only the latency cost moves off
+ // the user-blocking path.
+ await populateEntityIndexes(results, env, storage, repoShas);
+
// Return lightweight index — passage/title/entity are empty.
// Queries use fan-out functions to load per-resource indexes on demand.
return {
@@ -321,6 +370,144 @@
};
}
+/**
+ * H11: Build per-resource entity indexes by scanning content files. For each
+ * resource, walks every JSON content file in scripture_burrito.ingredients,
+ * collects every (entity_id → ArticleRef[]) mapping found in
+ * `article.associations.acai`, and writes the resulting Map to R2 keyed by
+ * entityIndexKey(code, sha). Memory is bounded by ENTITY_BUILD_RESOURCE_*
+ * and ENTITY_BUILD_FILE_CONCURRENCY caps using settledInChunks, mirroring
+ * the pre-H11 bootstrap path's safety profile.
+ *
+ * Failure handling: per-file failures are swallowed (the file's entities
+ * just don't appear in the index for this build). Per-resource failures
+ * mean the resource has no entityIndexKey written; fanOutEntitySearch will
+ * see a miss for that resource on the next query, which is the correct
+ * truthful-degradation behavior. The next index rebuild gets another shot.
+ *
+ * Performance: this adds a one-time cost to cold index builds. The pre-H11
+ * bootstrap was paying this cost per-entity-lookup; H11 pays it once and
+ * memoizes for the life of the composite SHA. Background refresh
+ * (refreshAndUpdateCurrentIndex via ctx.waitUntil) absorbs the cost away
+ * from user-visible latency for non-first cold builds.
+ */
+async function populateEntityIndexes(
+ results: PromiseSettledResult<{ code: string; metadata: ResourceMetadata } | null>[],
+ env: Env,
+ storage: AquiferStorage,
+ repoShas: Map<string, string>,
+): Promise<void> {
+ const resources: Array<{ code: string; language: string; files: string[]; sha: string }> = [];
+ for (const result of results) {
+ if (result.status !== "fulfilled" || !result.value) continue;
+ const { code, metadata } = result.value;
+ const sha = repoShas.get(code);
+ if (!sha) continue;
+ const ingredients = Object.keys(metadata.scripture_burrito?.ingredients ?? {});
+ const files = ingredients
+ .filter((k) => k.startsWith("json/") && k.endsWith(".content.json"))
+ .map((k) => k.replace(/^json\//, ""))
+ .sort();
+ if (files.length === 0) continue;
+ resources.push({ code, language: metadata.resource_metadata.language, files, sha });
+ }
+
+ await settledInChunks(resources, ENTITY_BUILD_RESOURCE_CONCURRENCY, async ({ code, language, files, sha }) => {
+ const entityMap = new Map<string, ArticleRef[]>();
+
+ await settledInChunks(files, ENTITY_BUILD_FILE_CONCURRENCY, async (file) => {
+ const url = `https://raw.githubusercontent.com/${env.AQUIFER_ORG}/${code}/${sha}/${language}/json/${file}`;
+ const key = contentKey(code, sha, language, file);
+ let articles: import("./types.js").ArticleContent[] | null = null;
+ try {
+ articles = await fetchJson<import("./types.js").ArticleContent[]>(url, storage, key);
+ } catch {
+ return; // per-file failure — swallow, see comment above
+ }
+ if (!articles?.length) return;
+ for (const article of articles) {
+ const acaiAssociations = article.associations?.acai ?? [];
+ for (const a of acaiAssociations) {
+ const entityId = String(a.id || "").toLowerCase();
+ if (!entityId) continue;
+ const ref: ArticleRef = {
+ resource_code: code,
+ language: article.language || language,
+ content_id: String(article.content_id),
+ title: article.title || `Article ${article.content_id}`,
+ resource_type: "",
+ index_reference: article.index_reference,
+ };
+ const existing = entityMap.get(entityId);
+ if (existing) {
+ existing.push(ref);
+ } else {
+ entityMap.set(entityId, [ref]);
+ }
+ }
+ }
+ });
+
+ if (entityMap.size > 0) {
+ // Serialize Map → array of [entityId, ArticleRef[]] entries for JSON.
+ await storage.putJSON(entityIndexKey(code, sha), Array.from(entityMap.entries()));
+ }
+ });
+}
+
+/**
+ * H11: load all per-resource entity indexes in parallel and union-merge any
+ * matches for the requested entityId. This is the post-H11 hot path for
+ * entity lookup — replaces the pre-H11 user-blocking bootstrap scan with N
+ * parallel R2 reads of small per-resource entity blobs (typically <100KB
+ * each). Resource-types are filled in from index.registry on read, since the
+ * stored per-resource entity index doesn't carry that metadata.
+ */
+export async function fanOutEntitySearch(
+ entityId: string,
+ index: NavigabilityIndex,
+ storage: AquiferStorage,
+ tracer?: RequestTracer,
+): Promise<ArticleRef[]> {
+ const normalized = entityId.toLowerCase();
+
+ // If entity data is already in memory (tests provide this), use it directly.
+ const memHit = index.entity.get(normalized);
+ if (memHit?.length) return memHit;
+
+ const fanStart = performance.now();
+ let hits = 0;
+ let misses = 0;
+
+ const results = await Promise.allSettled(
+ index.registry.map(async (entry) => {
+ const sha = index.repo_shas.get(entry.resource_code);
+ if (!sha) { misses++; return []; }
+ const key = entityIndexKey(entry.resource_code, sha);
+ const { data } = await storage.getJSON<Array<[string, ArticleRef[]]>>(key, tracer);
+ if (!data) { misses++; return []; }
+ hits++;
+ // Find this entityId in the per-resource entity map (entries form).
+ for (const [eid, refs] of data) {
+ if (eid === normalized) {
+ // Backfill resource_type from registry — per-resource index doesn't store it.
+ return refs.map((r) => ({ ...r, resource_type: entry.resource_type }));
+ }
+ }
+ return [];
+ }),
+ );
+
+ tracer?.addSpan("fanout-entities", Math.round(performance.now() - fanStart), undefined,
+ `${index.registry.length} resources, ${hits} hits, ${misses} misses`);
+
+ const matches: ArticleRef[] = [];
+ for (const r of results) {
+ if (r.status === "fulfilled") matches.push(...r.value);
+ }
+ return matches;
+}
+
// --- Fan-out query functions ---
/**
diff --git a/src/storage.ts b/src/storage.ts
--- a/src/storage.ts
+++ b/src/storage.ts
@@ -175,3 +175,20 @@
export function articleIndexKey(resourceCode: string, sha: string): string {
return `index/${resourceCode}/${sha}/articles.json`;
}
+
+/**
+ * Per-resource entity index. Maps lowercase entity_id (e.g. "person:paul") to
+ * the ArticleRefs in this resource that reference that entity. Built once at
+ * index-build time by scanning the resource's content files; queried via
+ * fanOutEntitySearch which loads all per-resource entity indexes in parallel.
+ *
+ * Why per-resource and not a single global blob: keeps each R2 object small
+ * (typically <100KB), makes the SHA-keyed lifecycle work the same as the
+ * other indexes, and matches the established passageIndexKey/titleIndexKey
+ * pattern. The fan-out at query time is N small reads in parallel, which is
+ * fast and memory-bounded — the opposite of the bootstrap path's pre-H11
+ * behavior of scanning every content file on every cold entity lookup.
+ */
+export function entityIndexKey(resourceCode: string, sha: string): string {
+ return `index/${resourceCode}/${sha}/entities.json`;
+}
diff --git a/src/tools.test.ts b/src/tools.test.ts
--- a/src/tools.test.ts
+++ b/src/tools.test.ts
@@ -1299,3 +1299,167 @@
expect(text).not.toContain("failed");
});
});
+
+
+describe("H11 — fanOutEntitySearch eager entity index", () => {
+ // These tests verify the post-H11 behavior: when per-resource entity indexes
+ // exist in storage (built at index-build time), entity lookups return data
+ // from those small per-resource blobs in parallel WITHOUT scanning content
+ // files at query time. Bootstrap remains as a defensive fallback only.
+
+ beforeEach(() => {
+ vi.clearAllMocks();
+ mockGetOrBuildIndex.mockReset();
+ mockFetchJson.mockReset();
+ });
+
+ it("returns matches from per-resource entity index without bootstrap scan", async () => {
+ // Arrange: storage pre-seeded with a per-resource entity index for
+ // STUDY_NOTES_ENTRY containing person:Paul; no metadata fetches
+ // configured (mockFetchJson would throw if anything tries them).
+ const env: Env = {
+ AQUIFER_CACHE: createMockKV(),
+ AQUIFER_CONTENT: {} as R2Bucket,
+ AQUIFER_ORG: "BibleAquifer",
+ DOCS_REPO: "docs",
+ WORKER_ENV: "production",
+ };
+ const storage = createMockStorage();
+ const idx = buildMockIndex([STUDY_NOTES_ENTRY]);
+ // Clear in-memory entity map so the fan-out path is the one under test
+ // (default buildMockIndex pre-seeds it with a fixture for tier-1 tests).
+ idx.entity.clear();
+ mockGetOrBuildIndex.mockResolvedValue(idx);
+
+ // Pre-seed the per-resource entity index in storage. Format matches what
+ // populateEntityIndexes writes: array of [entityId, ArticleRef[]] entries.
+ const studyNotesSha = idx.repo_shas.get(STUDY_NOTES_ENTRY.resource_code)!;
+ const entityIndexKey = `index/${STUDY_NOTES_ENTRY.resource_code}/${studyNotesSha}/entities.json`;
+ const seededRefs: ArticleRef[] = [{
+ resource_code: STUDY_NOTES_ENTRY.resource_code,
+ language: "eng",
+ content_id: "9640",
+ title: "Acts 7:58",
+ resource_type: "", // Backfilled by fanOutEntitySearch from registry
+ index_reference: "ACT 7:58",
+ }];
+ await storage.putJSON(entityIndexKey, [["person:paul", seededRefs]]);
+
+ // mockFetchJson must NOT be called — if it is, that means the bootstrap
+ // path (which scans content files via fetchJson) ran when it shouldn't.
+ mockFetchJson.mockImplementation(() => {
+ throw new Error("UNEXPECTED: bootstrap fetched content when fanout should have served");
+ });
+
+ const result = await handleEntity({ entity_id: "person:Paul" }, env, storage);
+ const text = result.content[0]!.text;
+
+ expect(text).toContain("Found 1 article(s)");
+ expect(text).toContain("Acts 7:58");
+ // resource_type should be filled in from registry
+ expect(text).toContain("Study Notes");
+ // No partial note — fan-out path doesn't produce them
+ expect(text).not.toContain("Partial result");
+ // Bootstrap should NOT have been invoked
+ expect(mockFetchJson).not.toHaveBeenCalled();
+ });
+
+ it("falls through to bootstrap when no per-resource entity index exists", async () => {
+ // Defensive fallback: if no per-resource entity index has been written to
+ // storage (e.g. index built pre-H11), fan-out returns empty and the
+ // bootstrap kicks in. This is the migration-safety path; once any
+ // bootstrap result is cached, fan-out will start finding it on next
+ // index rebuild.
+ const env: Env = {
+ AQUIFER_CACHE: createMockKV(),
+ AQUIFER_CONTENT: {} as R2Bucket,
+ AQUIFER_ORG: "BibleAquifer",
+ DOCS_REPO: "docs",
+ WORKER_ENV: "production",
+ };
+ const storage = createMockStorage();
+ mockGetOrBuildIndex.mockResolvedValue(buildMockIndex([STUDY_NOTES_ENTRY]));
+ // No entity index pre-seeded, mockFetchJson returns null for everything
+ // → bootstrap walks the empty corpus, returns complete=true with no matches.
+ mockFetchJson.mockResolvedValue(null);
+
+ const result = await handleEntity({ entity_id: "person:Whoever" }, env, storage);
+ const text = result.content[0]!.text;
+ expect(text).toContain("No articles found");
+ // Bootstrap WAS invoked (because fan-out returned empty), and that's
+ // signaled by mockFetchJson being called at least once for the metadata.
+ expect(mockFetchJson).toHaveBeenCalled();
+ });
+
+ it("merges results from multiple per-resource entity indexes", async () => {
+ // Arrange: TWO per-resource entity indexes both contain entries for
+ // person:paul. Fan-out should union them and return all refs.
+ const env: Env = {
+ AQUIFER_CACHE: createMockKV(),
+ AQUIFER_CONTENT: {} as R2Bucket,
+ AQUIFER_ORG: "BibleAquifer",
+ DOCS_REPO: "docs",
+ WORKER_ENV: "production",
+ };
+ const storage = createMockStorage();
+ const idx = buildMockIndex([STUDY_NOTES_ENTRY, FIA_MAPS_ENTRY]);
+ idx.entity.clear(); // Force fan-out path
+ mockGetOrBuildIndex.mockResolvedValue(idx);
+
+ const sha1 = idx.repo_shas.get(STUDY_NOTES_ENTRY.resource_code)!;
+ const sha2 = idx.repo_shas.get(FIA_MAPS_ENTRY.resource_code)!;
+ await storage.putJSON(`index/${STUDY_NOTES_ENTRY.resource_code}/${sha1}/entities.json`, [
+ ["person:paul", [{
+ resource_code: STUDY_NOTES_ENTRY.resource_code, language: "eng",
+ content_id: "9640", title: "Acts 7:58", resource_type: "", index_reference: "ACT 7:58",
+ }]],
+ ]);
+ await storage.putJSON(`index/${FIA_MAPS_ENTRY.resource_code}/${sha2}/entities.json`, [
+ ["person:paul", [{
+ resource_code: FIA_MAPS_ENTRY.resource_code, language: "eng",
+ content_id: "500001", title: "Paul's Missionary Journeys", resource_type: "", index_reference: "",
+ }]],
+ ]);
+ mockFetchJson.mockImplementation(() => { throw new Error("should not be called"); });
+
+ const result = await handleEntity({ entity_id: "person:Paul" }, env, storage);
+ const text = result.content[0]!.text;
+ expect(text).toContain("Found 2 article(s)");
+ expect(text).toContain("Acts 7:58");
+ expect(text).toContain("Paul's Missionary Journeys");
+ // Both resource_types backfilled from registry
+ expect(text).toContain("Study Notes");
+ expect(text).toContain("Maps");
+ });
+
+ it("normalizes entity_id case before lookup", async () => {
+ // Per-resource entity indexes store entityIds lowercase; the fan-out
+ // function must normalize the query the same way so case differences
+ // don't produce false misses.
+ const env: Env = {
+ AQUIFER_CACHE: createMockKV(),
+ AQUIFER_CONTENT: {} as R2Bucket,
+ AQUIFER_ORG: "BibleAquifer",
+ DOCS_REPO: "docs",
+ WORKER_ENV: "production",
+ };
+ const storage = createMockStorage();
+ const idx = buildMockIndex([STUDY_NOTES_ENTRY]);
+ idx.entity.clear(); // Force fan-out path
+ mockGetOrBuildIndex.mockResolvedValue(idx);
+ const sha = idx.repo_shas.get(STUDY_NOTES_ENTRY.resource_code)!;
+ await storage.putJSON(`index/${STUDY_NOTES_ENTRY.resource_code}/${sha}/entities.json`, [
+ ["person:paul", [{
+ resource_code: STUDY_NOTES_ENTRY.resource_code, language: "eng",
+ content_id: "9640", title: "Acts 7:58", resource_type: "", index_reference: "ACT 7:58",
+ }]],
+ ]);
+ mockFetchJson.mockImplementation(() => { throw new Error("should not be called"); });
+
+ // Query with mixed case
+ const result = await handleEntity({ entity_id: "PERSON:Paul" }, env, storage);
+ const text = result.content[0]!.text;
+ expect(text).toContain("Found 1 article(s)");
+ expect(text).toContain("Acts 7:58");
+ });
+});
diff --git a/src/tools.ts b/src/tools.ts
--- a/src/tools.ts
+++ b/src/tools.ts
@@ -1,7 +1,7 @@
import type { Env, ArticleRef, ArticleContent, NavigabilityIndex, ResourceEntry, ResourceMetadata } from "./types.js";
import { parseReference, rangesOverlap, rangeToReadable, isValidIndexReference, bbcccvvvToReadable } from "./references.js";
import { contentUrl, metadataUrl, fetchJson, GC_TTL } from "./github.js";
-import { getOrBuildIndex, fanOutPassageSearch, fanOutTitleSearch, loadArticleLookup, type ArticleLookupEntry } from "./registry.js";
+import { getOrBuildIndex, fanOutPassageSearch, fanOutTitleSearch, fanOutEntitySearch, loadArticleLookup, type ArticleLookupEntry } from "./registry.js";
import { getPublicTelemetrySnapshot } from "./telemetry.js";
import { AquiferStorage, contentKey, metadataKey, catalogKey, entityKey } from "./storage.js";
import type { RequestTracer } from "./tracing.js";
@@ -420,10 +420,19 @@
let partialNote = "";
if (matches.length === 0) {
- const bootstrap = await bootstrapEntityMatches(normalized, index, env, storage, tracer);
- matches.push(...bootstrap.matches);
- if (!bootstrap.complete) {
- partialNote = formatPartialBootstrapNote(bootstrap);
+ // H11: fan out to per-resource entity indexes first (fast, parallel).
+ const fanned = await fanOutEntitySearch(normalized, index, storage, tracer);
+ matches.push(...fanned);
+ if (matches.length === 0) {
+ // Defensive fallback: if no per-resource entity indexes have been
+ // populated yet (e.g. the index pre-dates H11 deploy), fall back to
+ // the on-demand bootstrap. Once any complete bootstrap result has
+ // been cached, future entity lookups skip this path entirely.
+ const bootstrap = await bootstrapEntityMatches(normalized, index, env, storage, tracer);
+ matches.push(...bootstrap.matches);
+ if (!bootstrap.complete) {
+ partialNote = formatPartialBootstrapNote(bootstrap);
+ }
}
}
@@ -1353,13 +1362,18 @@
const index = await getOrBuildIndex(env, storage, ctx, tracer);
const normalized = entityId.toLowerCase();
- // Find all articles referencing this entity. Hot path: pre-built index map.
- // Cold path: scan the article corpus via bootstrapEntityMatches, which may
- // return PARTIAL results under wall-clock pressure — surface that to the
- // user so they can decide whether to retry.
+ // Find all articles referencing this entity. Three-tier lookup:
+ // (1) in-memory index.entity map (tests + warm bootstrap cache)
+ // (2) H11: fan out to per-resource entity indexes — fast, parallel
+ // (3) Defensive fallback: on-demand bootstrap if (2) returns empty
+ // (e.g. for indexes built pre-H11). Once any complete bootstrap has
+ // cached its result, future entity lookups skip this fallback.
let refs = index.entity.get(normalized);
let partialNote = "";
if (!refs?.length) {
+ refs = await fanOutEntitySearch(normalized, index, storage, tracer);
+ }
+ if (!refs?.length) {
const bootstrap = await bootstrapEntityMatches(normalized, index, env, storage, tracer);
refs = bootstrap.matches;
if (!bootstrap.complete) {You can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit 1d971a2. Configure here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Why
Closes H11 from J-003. The J-002/J-003 fix bounded the OOM but didn't address the underlying bad pattern: every cold-cache entity lookup was scanning every content file in every resource on the user's blocking path. Post-merge production observation (J-004) confirmed the OOM is dead but the cold-cache lookup wall-clock is ~22s — at the edge of Cloudflare's frontend gateway tolerance, surfacing as occasional 502s for the user even when the Worker itself returned
outcome=success.The canonical fix per the project's pre-built-index principle: stop scanning the corpus at query time. Pre-build per-resource entity indexes at index-build time, exactly the way passage and title indexes are already built.
What
storage.ts— addsentityIndexKey(code, sha)mirroringpassageIndexKey/titleIndexKey/articleIndexKey. StoresArray<[entityId, ArticleRef[]]>per resource atindex/{code}/{sha}/entities.json.registry.ts— three additions, no removals:populateEntityIndexes(results, env, storage, repoShas)— invoked frombuildIndexafter the existing per-resource writes complete. Scans every JSON content file in every resource'sscripture_burrito.ingredients, collectsarticle.associations.acaireferences, writes per-resource entity maps to R2. Memory bounded byENTITY_BUILD_RESOURCE_CONCURRENCY=4×ENTITY_BUILD_FILE_CONCURRENCY=8= max 32 in flight (same caps as J-003's bootstrap, same memory profile).fanOutEntitySearch(entityId, index, storage, tracer)— post-H11 query path. Loads all per-resource entity indexes in parallel, unions matches, backfillsresource_typefromindex.registry. MirrorsfanOutPassageSearch/fanOutTitleSearchexactly.settledInChunkshelper (registry.ts cannot import from tools.ts; the tradeoff and follow-up are tracked under H12).tools.ts— both entity callers (handleEntity,searchByEntity) now follow a 3-tier lookup:index.entity.get(normalized)(tests, warm bootstrap cache)fanOutEntitySearch(the new H11 fast path)bootstrapEntityMatchesonly if (2) returns empty (defensive backstop for indexes built pre-H11)bootstrapEntityMatchesandBootstrapEntityResultstay intact — both for migration safety and as a permanent fallback.Performance prediction
ctx.waitUntil(existing mechanism)Order-of-magnitude latency improvement on cold-cache entity lookup. No regression on any other path.
Vodka check
populateEntityIndexes,fanOutEntitySearch) are generic over their data shape — no domain-specific branching byresource_typeor content category.entityIndexKeyfollows the existing per-resource key pattern.if (resource_type === ...)branches anywhere in the changed code.registry.tsandstorage.ts); the entire entity tool's safety + performance budget for J-002/J-003/H11 has cost ~9% LOC growth — well under the ~88% KB coverage growth this work enables.Verification (DoD)
src/storage.ts(+key fn),src/registry.ts(+helpers + builder + fan-out),src/tools.ts(caller updates),src/tools.test.ts(+4 H11 tests),odd/ledger/journal.md(+J-004).npm ci && npm run build && npm run testwithGITHUB_TOKENset, mirroring CIbuild-test.wrangler deploy --dry-runclean. typecheck output unchanged from main's pre-existing state — no new errors.bootstrapEntityMatchesstays as defensive fallback rather than being deleted now, because indexes built pre-H11 still exist in R2 and the migration is graceful. Remaining risks: build-time cost is now ~22s for the cold-start path (was per-lookup); background refresh absorbs this for non-first builds, but the very first cold start after deploy will pay it once. Acceptable.Post-deploy validation plan
getOrBuildIndex) — index build will runpopulateEntityIndexesand write per-resource entity blobs.entity entity_id=person:Paulagainst an instance that hasn't seen the entity warm. Should respond well under 1s with the full result set (was ~22s pre-H11).outcomedistribution should remain 100%successfor entity calls — noclientDisconnectedfrom the user-side timeout, no edge-injected 502s.J-005 will be encoded after that observation, closing H11 affirmatively (or naming what didn't work as expected and what to do about it).
Mode trail
Investigation-driven session: J-002 (incident) → J-003 (root-cause + bounded-fanout fix) → bug-fix arc (9 Bugbot findings closed) → J-004 (post-merge verification) → H11 (canonical fix this PR). Each step grounded in direct observation of the deployed system; no speculation about behavior we hadn't measured.
Note
Medium Risk
Shifts entity resolution from on-demand corpus scans to index-build-time scanning of all content files, which adds a potentially heavy cold-build cost and new R2 objects; query-path changes are straightforward but could affect completeness/latency if indexes are missing or partially built.
Overview
Entity lookup no longer relies on a user-blocking corpus scan by default. Index builds now precompute and store a per-resource entity index (entity_id →
ArticleRef[]) in R2, and entity queries fan out across these small per-resource blobs in parallel.handleEntity/searchByEntitynow use a 3-tier lookup: in-memoryindex.entity, thenfanOutEntitySearch, and only then fall back to the existingbootstrapEntityMatchespath for older indexes. Adds bounded-concurrency scanning duringbuildIndexto populate the entity indexes, a newentityIndexKeyin storage, and new tests covering the fan-out path, multi-resource merge, normalization, and bootstrap fallback.Reviewed by Cursor Bugbot for commit 1a0641a. Bugbot is set up for automated code reviews on this repo. Configure here.