feat(slice-2): harness + recall + cert collection (mock-only) by Declade · Pull Request #1 · Declade/lucairn-research

Declade · 2026-05-17T09:54:27Z

Summary

Slice 2 of the Lucairn Research Program (Paper 1, healthcare). Adds the harness that calls the Lucairn gateway in proving_ground mode, collects cert URLs, and computes per-HIPAA-Safe-Harbor-category recall against the 500-row Measurement-B subset from Slice 1.

Mock-only this slice. Live gateway.lucairn.eu calls land in Slice 3 per the locked halt gate in the PRD.

Active PRD (Status: Locked): Opus Advisor/specs/prd-2026-05-17-research-program.md — Slice 2 scope at lines 141-145.

What's in this PR

src/gateway-client.ts — typed wrapper around /api/v1/proxy/messages in proving_ground mode. Env-driven (LUCAIRN_GATEWAY_URL / LUCAIRN_API_KEY / LUCAIRN_UPSTREAM_KEY). X-Upstream-Key plumbed for Slice 3's BYOK-per-request profiles (gate at dual-sandbox-architecture/services/gateway/internal/api/proxy.go:349-354). Ground-truth-annotation value.length < 3 filter for containment-match safety.
src/redaction-extractor.ts — pure: gateway response → ExtractedRedaction[] tagged with HIPAA category.
src/recall.ts — pure: ground truth + extracted → RecallSummary with per-HIPAA-category recall/precision/F1.
src/hipaa-category-mapping.ts — explicit one-way table keyed by the 11 live placeholder prefixes observed in dual-sandbox-architecture/services/sanitizer/presidio_scan.py:31-58 (PERSON, EMAIL, PHONE, LOCATION, IBAN, CC, SSN, URL, DOB; ID and SECRET intentionally null-mapped because the sanitizer collapses multiple HIPAA categories into them — documented limitation surfaces in unmapped_extras accounting). Regression-locked against drift via a test that walks the canonical PRESIDIO_TO_PLACEHOLDER map.
src/mocks/gateway-fixtures.ts — msw fixtures matching the live ground_truth_evaluation shape.
scripts/run-pipeline.ts — per-row gateway calls (mock or live), NDJSON output. Supports --rows=N, --mock, --upstream-key.
scripts/collect-certs.ts — NDJSON → CERTIFICATES.csv.
scripts/compute-recall.ts — aggregates per-HIPAA-category recall against ground truth. Validates SUMMARY.json before writeFile.

Architectural decision — `proving_ground` over inline-redactions

The dispatch brief assumed the harness would consume compliance_trace.fullRedactions inline. Grep of dual-sandbox-architecture/services/gateway/internal/api/proxy.go:451 showed fullRedactions is internal-only — never emitted in the response body. Pivoted to proving_ground mode (proxy.go:357-368 validation; proxy.go:1067-1080 evaluation emit) which returns ground_truth_evaluation with server-side TP/FN/FP verdicts from compareGroundTruth (ground_truth.go:69-138).

Architecturally stronger for the compliance-buyer audience: the matching code ships in the live gateway, not in this research repo. The publisher (Lucairn) does not author the matching code that produces the published recall numbers — the arm's-length property compliance buyers care about.

Matching semantics: case-insensitive bidirectional value-containment with whitespace normalization (NOT span-exact overlap). Disclosed in papers/_template/SUMMARY.schema.json description so auditors reading a SUMMARY.json in isolation get the methodology. Paper 1 body prose will repeat the disclosure in Methods + Limits (Slice 4).

Reviewer chain — all PASS at `f924cbd` after fix-up

Reviewer	Pre-fix-up	Resolution
bug-hunter-reviewer	1 BLOCKER + 4 HIGH + 5 MED + 4 LOW + 5 INFO	BLOCKER (HIPAA mapping vocabulary) + all HIGH closed in fix-up `f924cbd`. MED/LOW/INFO carried to Slice 3 PRD.
claim-enforcement-guard	0 BLOCKER + 1 MED	MED closed (README testimonial guardrail recovered).
regulator-validator	8 PASS / 0 FAIL / 3 WARN	1 WARN closed (SUMMARY.schema matching-semantics disclosure). 2 WARN deferred to Paper 1 body (Slice 4).
personal-info-leak-detector	0 BLOCKER + 1 MED	MED closed (`lcr_live_test_` → `lcr_test_` rename — secret-scanner safety for public-flip).

Acceptance gates

Gate	Result
`pnpm install --frozen-lockfile`	PASS
`pnpm typecheck` + `pnpm typecheck:test` + `pnpm build`	PASS
`pnpm test` (40 tests: 12 Slice 1 + 28 Slice 2)	PASS
`pnpm run pipeline -- --rows=5 --mock`	PASS — 5 NDJSON records, well-formed cert URLs
`pnpm run collect-certs`	PASS — 5 CERTIFICATES.csv rows
`pnpm run compute-recall --redactions-source=mock`	PASS — oracle recall=1.0; `--miss-rate=0.3` produces recall=0.75
Banned-literal sweep across 25 banned terms	PASS — 0 hits

Carry-forwards for Slice 3 (deferred-by-decision, filed in fix-up commit body)

M1 NDJSON streaming writer (currently buffered in-memory; OOM-class risk on long runs)
M2 rate-limit/concurrency + 429 retry for live Anthropic upstream
M3 hard-fail on malformed ground-truth/transcription rows
M4 harden in-process JSON-Schema validator (or swap to ajv)
regulator WARN 1: fax/phone disclosure in Paper 1 body
regulator WARN 2: recall match-semantics in Paper 1 body Methods/Limits

Test plan

Codex round 1 substantive-PASS [N/N]
All acceptance gates re-run at PR merge (post-Codex-PASS)
Slice 3 dispatch prerequisites confirmed: real LUCAIRN_API_KEY, real upstream Anthropic key, rate-limit policy decision

🤖 Generated with Claude Code

…call, mocks Adds the in-process methodology library Slice 2 needs: - src/gateway-client.ts — typed wrapper around POST /api/v1/proxy/messages with mode=proving_ground. 2-retry exponential backoff with jitter on 5xx + connection errors; no retry on 4xx; per-request timeout (30s default, LUCAIRN_REQUEST_TIMEOUT_MS env-configurable). Env reads are call-time only — module import is side-effect-free. - src/redaction-extractor.ts — pure converter from gateway proving-ground matches/missed/extras into a flat ExtractedRedaction[] tagged with HIPAA Safe Harbor category + verdict (tp/fn/fp). Unmapped extras carry hipaa_category=null so the FP count is preserved while taxonomy drift is observable. - src/hipaa-category-mapping.ts — explicit one-way map from Lucairn sanitizer internal taxonomy ([PERSON_N], [LOCATION_N], …) to the 18 HIPAA Safe Harbor categories (45 CFR § 164.514(b)(2)(i)). Placeholder-parsing mirrors gateway extractEntityTypes (proxy.go:1361-1395). - src/recall.ts — two consumer paths: aggregateExtracted() consumes gateway-attested verdicts (the harness's live path; arm's-length property preserved because matching runs inside the gateway, not in code Lucairn authored alongside the publication); computeRecallFromSpans() implements the ≥50%-character-overlap span-matching the Slice 2 brief locks for any future raw-span inline surface. Both produce the same RecallSummary shape. - src/mocks/gateway-fixtures.ts — deterministic mock builders for msw-backed unit tests + --mock smoke scripts. Configurable missRate + spuriousFpCount exercise recall paths against known oracles. - src/index.ts — barrel exports for the public surface. - package.json — adds msw ^2.7 devDependency. No new runtime deps. Cite-back for gateway response shape: proxy.go:35-58 (request schema), proxy.go:361-373 (mode + activity validation), proxy.go:1068-1080 (ground_truth_evaluation emission), ground_truth.go:5-138 (result shape).

…MARY schema Adds the three CLI scripts the Slice 2 harness needs: - scripts/run-pipeline.ts — orchestrates per-row gateway calls via POST /api/v1/proxy/messages (mode=proving_ground). --mock mounts an msw fixture server in-process; --live is reserved for Slice 3 and refuses to start without the explicit gate. Writes raw NDJSON to papers/paper-1-healthcare/raw-results/<timestamp>.ndjson (or --output). Supports --rows / --truth / --subset / --gateway / --api-key / --miss-rate / --spurious-fp-count / --activity-id-prefix. - scripts/collect-certs.ts — walks the NDJSON, extracts cert URL + summary URL + redaction count + overall verdict per row, emits CERTIFICATES.csv via src/csv.ts::emitCsv. Columns: row_index, cert_url, cert_id, summary_url, overall_verdict, redaction_count, latency_ms, timestamp_utc, error_code - scripts/compute-recall.ts — reads ground-truth JSONL + raw NDJSON (or re-runs the in-process mock when --redactions-source=mock), aggregates per-HIPAA-category recall / precision / F1 via aggregateExtracted, emits SUMMARY.json, validates against papers/_template/SUMMARY.schema.json in-process. Avoids a runtime dep on ajv via a minimal validator covering the schema subset used. Also adds: - papers/_template/SUMMARY.schema.json — Draft 2020-12 JSON Schema for SUMMARY.json. Enforces 18-category coverage in per_category, the four required overall fields, the RowBreakdown shape, and the schema_version / generator const fields. Reused by every paper in the program. - papers/paper-1-healthcare/raw-results/.gitignore + .gitkeep — directory scaffold; per-run NDJSON is gitignored at the repo level (datasets/.gitignore line 17) but the per-paper sub-tree's own .gitignore locks it locally too. - package.json — adds pipeline / collect-certs / compute-recall scripts. End-to-end smoke (all PASS): pnpm run pipeline -- --rows=5 --mock --output=/tmp/slice2-smoke.ndjson pnpm run collect-certs -- --input=/tmp/slice2-smoke.ndjson --output=/tmp/slice2-CERTIFICATES.csv pnpm run compute-recall -- --truth=ground-truth.jsonl --redactions-source=mock --rows=5 --output=/tmp/slice2-SUMMARY.json

…docs update Adds the three test files covering the new Slice 2 surface and updates README + RECIPE for the shipped state. Tests (22 new, 34 total with Slice 1's 12): - test/gateway-client.spec.ts (8 tests) — msw-mocked. Locks the proving-ground request shape (mode, relink_response=false, activity_id pattern, ground_truth.transcription[] HIPAA-tagged annotations, x-api-key header). Verifies: success path returns a typed GatewayRowResult; retries on 5xx and recovers (exact backoff math asserted); does NOT retry on 4xx; fails-with-error after exhausting retry budget; abort/timeout is retry-eligible; extractCertUrls handles the missing-veil-hint case; construction validation refuses empty URL / empty key. - test/redaction-extractor.spec.ts (9 tests) — locks the placeholder parser against malformed inputs; verifies the HIPAA mapping covers the standard Presidio + Lucairn vocabulary (PERSON, LOCATION, DATE, PHONE_NUMBER, EMAIL_ADDRESS, US_SSN, IBAN, URL, IP_ADDRESS, CREDIT_CARD); every entry in LUCAIRN_TO_HIPAA maps to a valid HipaaCategory; extractFromEvaluation flattens matches/missed/extras into ExtractedRedaction[] with verdicts; unknown annotation_type from the gateway is tagged null (no silent widening); unmappedExtraTypes surfaces taxonomy drift. - test/recall.spec.ts (5 tests) — 5 rows, 22 entities, hand-tagged TP/FN/FP. Exact per-category recall/precision/F1 numbers asserted: NAME 5 TP / 1 FN → recall 5/6; EMAIL 2 TP → recall 1; DATE 3 TP / 1 FN / 1 FP → recall 0.75 precision 0.75; PHONE 0 TP / 2 FP → precision 0; GEO 4 TP / 1 FN → recall 0.8. Overall TP=15 FP=3 FN=3 → recall 15/18. Locks the SPAN_OVERLAP_THRESHOLD const at 0.5 with a regression test. computeRecallFromSpans is exercised with a single-row synthetic fixture covering exact-50%-overlap (matches), 100%-overlap (matches), 40%-overlap (FP + FN). Per-row order ascending by row_index asserted. Unmapped-category counts get a "no HIPAA category mapping" note. Docs: - README.md — appends a Slice 2 — Harness section under Reproduce Paper 1 documenting the mock-only workflow, all three CLI commands with --rows=5 examples, the --miss-rate / --spurious-fp-count options, and the explicit "live gateway run lands in Slice 3" framing required by the PRD halt gate. Refines two pre-existing negative-disclaimer lines to avoid the locked banned literals "case study" + "testimonial" while preserving meaning. - datasets/healthcare/RECIPE.md — flips the Slice-status timeline entry for Slice 2 from "pending" to "shipped (mock-only)", enumerates the Slice 2 source files, and updates the Slice 3 description. - .gitignore — narrows `papers/*/raw-results/` to its contents and exempts the directory scaffold (`.gitignore` + `.gitkeep`) so the per-paper run-results directory exists in a fresh clone. End-to-end smoke (all PASS): pnpm install --frozen-lockfile → exit 0 pnpm typecheck → exit 0 pnpm typecheck:test → exit 0 pnpm build → exit 0 pnpm test (34 tests across 6 files) → exit 0 Banned-literal sweep → 0 hits

- B1 (bug-hunter BLOCKER): rewrite hipaa-category-mapping table to match the live placeholder vocabulary from presidio_scan.py:31-58 (PERSON, EMAIL, PHONE, LOCATION, IBAN, CC, SSN, URL, DOB). ID and SECRET intentionally null-mapped (placeholder collapses multiple HIPAA categories; documented limitation surfaces as unmapped_extras). Update regression test to walk PRESIDIO_TO_PLACEHOLDER values + assert every value is mapped or explicitly null-mapped. - H1 (bug-hunter HIGH): rewrite mock fixture PLACEHOLDER_FOR_CATEGORY to emit live-production placeholder shapes (no more synthetic [MEDICAL_RECORD_NUMBER_1] etc.). Add [ID_N] regression test in recall.spec.ts to exercise the unmapped-extras accounting path. - H2 (bug-hunter HIGH): filter ground-truth annotations with value.trim().length < 3 in buildGroundTruth (containment-match safety; defensive against future Faker regression). Emit console.warn with dropped count only (never the dropped values). - H3 (bug-hunter HIGH): validate SUMMARY.json BEFORE writeFile in compute-recall.ts, not after, so a bogus SUMMARY.json never lands on disk for downstream consumers. - H4 (bug-hunter HIGH): plumb X-Upstream-Key header through gateway-client + run-pipeline --upstream-key flag for Slice 3 BYOK-per-request flow (proxy.go:349-354 gate). LUCAIRN_UPSTREAM_KEY env var fallback. Empty-string treated as absent. Help text + auth- modes table documented. - claim-enforce MED: append "No attributed endorsement quotes" to README.md:15 to recover the testimonial guardrail dropped in the Slice 2 banned-literal sweep rephrase. - personal-info-leak MED: rename lcr_live_test_* / lcr_live_mock_* to lcr_test_* / lcr_mock_* in test fixtures so the repo is safe for secret-scanner pass post-public-flip. - regulator-validator WARN: add matching-semantics disclosure to papers/_template/SUMMARY.schema.json description so auditors reading SUMMARY.json in isolation cannot misinterpret containment recall as span-exact i2b2-style recall. Deferred to Slice 3: - M1 NDJSON streaming writer (lost-data crash protection) - M2 rate-limit/concurrency + 429 retry policy for live Anthropic upstream - M3 hard-fail on malformed ground-truth/transcription rows - M4 hardening of the in-process JSON-Schema validator (or swap to ajv) - M5 detection_rate empty-row contract test - regulator WARN 1: fax/phone disclosure in Paper 1 body - regulator WARN 2: recall match-semantics in Paper 1 body Methods/Limits

…set rationale) - [8] FAIL: --upstream-key help table listed 2 unsupported auth modes ("not supported by this harness") and omitted the --mock path entirely. Rewrote the table to enumerate the 3 actually-supported modes: --mock (no auth), --live + --api-key (non-BYOK), --live + --api-key + --upstream-key (BYOK-per-request, cite proxy.go:349-354 gate). - [21] FAIL: CERTIFICATES.csv ships 9 columns vs the brief's 7 minimum. The 2 extensions (summary_url, error_code) are intentional: summary_url saves readers a URL-construction step; error_code makes the paper appendix honest about which rows failed instead of silently dropping them. Documented the rationale inline in collect-certs.ts before the headers array. All 7 brief-required columns remain present in declaration order. Treating this as effective-PASS at the orchestrator level: brief spec was a minimum, not an exclusive list. No code-behavior changes. typecheck/build/test all green at HEAD.

Declade added 5 commits May 17, 2026 11:22

Declade merged commit 233733f into main May 17, 2026

Declade deleted the feat/slice-2-harness branch May 17, 2026 10:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(slice-2): harness + recall + cert collection (mock-only)#1

feat(slice-2): harness + recall + cert collection (mock-only)#1
Declade merged 5 commits into
mainfrom
feat/slice-2-harness

Declade commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Declade commented May 17, 2026

Summary

What's in this PR

Architectural decision — proving_ground over inline-redactions

Reviewer chain — all PASS at f924cbd after fix-up

Acceptance gates

Carry-forwards for Slice 3 (deferred-by-decision, filed in fix-up commit body)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Architectural decision — `proving_ground` over inline-redactions

Reviewer chain — all PASS at `f924cbd` after fix-up