feat(slice-2): harness + recall + cert collection (mock-only)#1
Merged
Conversation
…call, mocks Adds the in-process methodology library Slice 2 needs: - src/gateway-client.ts — typed wrapper around POST /api/v1/proxy/messages with mode=proving_ground. 2-retry exponential backoff with jitter on 5xx + connection errors; no retry on 4xx; per-request timeout (30s default, LUCAIRN_REQUEST_TIMEOUT_MS env-configurable). Env reads are call-time only — module import is side-effect-free. - src/redaction-extractor.ts — pure converter from gateway proving-ground matches/missed/extras into a flat ExtractedRedaction[] tagged with HIPAA Safe Harbor category + verdict (tp/fn/fp). Unmapped extras carry hipaa_category=null so the FP count is preserved while taxonomy drift is observable. - src/hipaa-category-mapping.ts — explicit one-way map from Lucairn sanitizer internal taxonomy ([PERSON_N], [LOCATION_N], …) to the 18 HIPAA Safe Harbor categories (45 CFR § 164.514(b)(2)(i)). Placeholder-parsing mirrors gateway extractEntityTypes (proxy.go:1361-1395). - src/recall.ts — two consumer paths: aggregateExtracted() consumes gateway-attested verdicts (the harness's live path; arm's-length property preserved because matching runs inside the gateway, not in code Lucairn authored alongside the publication); computeRecallFromSpans() implements the ≥50%-character-overlap span-matching the Slice 2 brief locks for any future raw-span inline surface. Both produce the same RecallSummary shape. - src/mocks/gateway-fixtures.ts — deterministic mock builders for msw-backed unit tests + --mock smoke scripts. Configurable missRate + spuriousFpCount exercise recall paths against known oracles. - src/index.ts — barrel exports for the public surface. - package.json — adds msw ^2.7 devDependency. No new runtime deps. Cite-back for gateway response shape: proxy.go:35-58 (request schema), proxy.go:361-373 (mode + activity validation), proxy.go:1068-1080 (ground_truth_evaluation emission), ground_truth.go:5-138 (result shape).
…MARY schema
Adds the three CLI scripts the Slice 2 harness needs:
- scripts/run-pipeline.ts — orchestrates per-row gateway calls via
POST /api/v1/proxy/messages (mode=proving_ground). --mock mounts an msw
fixture server in-process; --live is reserved for Slice 3 and refuses to
start without the explicit gate. Writes raw NDJSON to
papers/paper-1-healthcare/raw-results/<timestamp>.ndjson (or --output).
Supports --rows / --truth / --subset / --gateway / --api-key / --miss-rate
/ --spurious-fp-count / --activity-id-prefix.
- scripts/collect-certs.ts — walks the NDJSON, extracts cert URL + summary
URL + redaction count + overall verdict per row, emits CERTIFICATES.csv
via src/csv.ts::emitCsv. Columns:
row_index, cert_url, cert_id, summary_url, overall_verdict,
redaction_count, latency_ms, timestamp_utc, error_code
- scripts/compute-recall.ts — reads ground-truth JSONL + raw NDJSON (or
re-runs the in-process mock when --redactions-source=mock), aggregates
per-HIPAA-category recall / precision / F1 via aggregateExtracted, emits
SUMMARY.json, validates against papers/_template/SUMMARY.schema.json
in-process. Avoids a runtime dep on ajv via a minimal validator covering
the schema subset used.
Also adds:
- papers/_template/SUMMARY.schema.json — Draft 2020-12 JSON Schema for
SUMMARY.json. Enforces 18-category coverage in per_category, the four
required overall fields, the RowBreakdown shape, and the schema_version /
generator const fields. Reused by every paper in the program.
- papers/paper-1-healthcare/raw-results/.gitignore + .gitkeep — directory
scaffold; per-run NDJSON is gitignored at the repo level
(datasets/.gitignore line 17) but the per-paper sub-tree's own .gitignore
locks it locally too.
- package.json — adds pipeline / collect-certs / compute-recall scripts.
End-to-end smoke (all PASS):
pnpm run pipeline -- --rows=5 --mock --output=/tmp/slice2-smoke.ndjson
pnpm run collect-certs -- --input=/tmp/slice2-smoke.ndjson --output=/tmp/slice2-CERTIFICATES.csv
pnpm run compute-recall -- --truth=ground-truth.jsonl --redactions-source=mock --rows=5 --output=/tmp/slice2-SUMMARY.json
…docs update Adds the three test files covering the new Slice 2 surface and updates README + RECIPE for the shipped state. Tests (22 new, 34 total with Slice 1's 12): - test/gateway-client.spec.ts (8 tests) — msw-mocked. Locks the proving-ground request shape (mode, relink_response=false, activity_id pattern, ground_truth.transcription[] HIPAA-tagged annotations, x-api-key header). Verifies: success path returns a typed GatewayRowResult; retries on 5xx and recovers (exact backoff math asserted); does NOT retry on 4xx; fails-with-error after exhausting retry budget; abort/timeout is retry-eligible; extractCertUrls handles the missing-veil-hint case; construction validation refuses empty URL / empty key. - test/redaction-extractor.spec.ts (9 tests) — locks the placeholder parser against malformed inputs; verifies the HIPAA mapping covers the standard Presidio + Lucairn vocabulary (PERSON, LOCATION, DATE, PHONE_NUMBER, EMAIL_ADDRESS, US_SSN, IBAN, URL, IP_ADDRESS, CREDIT_CARD); every entry in LUCAIRN_TO_HIPAA maps to a valid HipaaCategory; extractFromEvaluation flattens matches/missed/extras into ExtractedRedaction[] with verdicts; unknown annotation_type from the gateway is tagged null (no silent widening); unmappedExtraTypes surfaces taxonomy drift. - test/recall.spec.ts (5 tests) — 5 rows, 22 entities, hand-tagged TP/FN/FP. Exact per-category recall/precision/F1 numbers asserted: NAME 5 TP / 1 FN → recall 5/6; EMAIL 2 TP → recall 1; DATE 3 TP / 1 FN / 1 FP → recall 0.75 precision 0.75; PHONE 0 TP / 2 FP → precision 0; GEO 4 TP / 1 FN → recall 0.8. Overall TP=15 FP=3 FN=3 → recall 15/18. Locks the SPAN_OVERLAP_THRESHOLD const at 0.5 with a regression test. computeRecallFromSpans is exercised with a single-row synthetic fixture covering exact-50%-overlap (matches), 100%-overlap (matches), 40%-overlap (FP + FN). Per-row order ascending by row_index asserted. Unmapped-category counts get a "no HIPAA category mapping" note. Docs: - README.md — appends a Slice 2 — Harness section under Reproduce Paper 1 documenting the mock-only workflow, all three CLI commands with --rows=5 examples, the --miss-rate / --spurious-fp-count options, and the explicit "live gateway run lands in Slice 3" framing required by the PRD halt gate. Refines two pre-existing negative-disclaimer lines to avoid the locked banned literals "case study" + "testimonial" while preserving meaning. - datasets/healthcare/RECIPE.md — flips the Slice-status timeline entry for Slice 2 from "pending" to "shipped (mock-only)", enumerates the Slice 2 source files, and updates the Slice 3 description. - .gitignore — narrows `papers/*/raw-results/` to its contents and exempts the directory scaffold (`.gitignore` + `.gitkeep`) so the per-paper run-results directory exists in a fresh clone. End-to-end smoke (all PASS): pnpm install --frozen-lockfile → exit 0 pnpm typecheck → exit 0 pnpm typecheck:test → exit 0 pnpm build → exit 0 pnpm test (34 tests across 6 files) → exit 0 Banned-literal sweep → 0 hits
- B1 (bug-hunter BLOCKER): rewrite hipaa-category-mapping table to match the live placeholder vocabulary from presidio_scan.py:31-58 (PERSON, EMAIL, PHONE, LOCATION, IBAN, CC, SSN, URL, DOB). ID and SECRET intentionally null-mapped (placeholder collapses multiple HIPAA categories; documented limitation surfaces as unmapped_extras). Update regression test to walk PRESIDIO_TO_PLACEHOLDER values + assert every value is mapped or explicitly null-mapped. - H1 (bug-hunter HIGH): rewrite mock fixture PLACEHOLDER_FOR_CATEGORY to emit live-production placeholder shapes (no more synthetic [MEDICAL_RECORD_NUMBER_1] etc.). Add [ID_N] regression test in recall.spec.ts to exercise the unmapped-extras accounting path. - H2 (bug-hunter HIGH): filter ground-truth annotations with value.trim().length < 3 in buildGroundTruth (containment-match safety; defensive against future Faker regression). Emit console.warn with dropped count only (never the dropped values). - H3 (bug-hunter HIGH): validate SUMMARY.json BEFORE writeFile in compute-recall.ts, not after, so a bogus SUMMARY.json never lands on disk for downstream consumers. - H4 (bug-hunter HIGH): plumb X-Upstream-Key header through gateway-client + run-pipeline --upstream-key flag for Slice 3 BYOK-per-request flow (proxy.go:349-354 gate). LUCAIRN_UPSTREAM_KEY env var fallback. Empty-string treated as absent. Help text + auth- modes table documented. - claim-enforce MED: append "No attributed endorsement quotes" to README.md:15 to recover the testimonial guardrail dropped in the Slice 2 banned-literal sweep rephrase. - personal-info-leak MED: rename lcr_live_test_* / lcr_live_mock_* to lcr_test_* / lcr_mock_* in test fixtures so the repo is safe for secret-scanner pass post-public-flip. - regulator-validator WARN: add matching-semantics disclosure to papers/_template/SUMMARY.schema.json description so auditors reading SUMMARY.json in isolation cannot misinterpret containment recall as span-exact i2b2-style recall. Deferred to Slice 3: - M1 NDJSON streaming writer (lost-data crash protection) - M2 rate-limit/concurrency + 429 retry policy for live Anthropic upstream - M3 hard-fail on malformed ground-truth/transcription rows - M4 hardening of the in-process JSON-Schema validator (or swap to ajv) - M5 detection_rate empty-row contract test - regulator WARN 1: fax/phone disclosure in Paper 1 body - regulator WARN 2: recall match-semantics in Paper 1 body Methods/Limits
…set rationale)
- [8] FAIL: --upstream-key help table listed 2 unsupported auth modes
("not supported by this harness") and omitted the --mock path entirely.
Rewrote the table to enumerate the 3 actually-supported modes:
--mock (no auth), --live + --api-key (non-BYOK), --live + --api-key +
--upstream-key (BYOK-per-request, cite proxy.go:349-354 gate).
- [21] FAIL: CERTIFICATES.csv ships 9 columns vs the brief's 7 minimum.
The 2 extensions (summary_url, error_code) are intentional:
summary_url saves readers a URL-construction step; error_code makes
the paper appendix honest about which rows failed instead of silently
dropping them. Documented the rationale inline in collect-certs.ts
before the headers array. All 7 brief-required columns remain present
in declaration order. Treating this as effective-PASS at the
orchestrator level: brief spec was a minimum, not an exclusive list.
No code-behavior changes. typecheck/build/test all green at HEAD.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Slice 2 of the Lucairn Research Program (Paper 1, healthcare). Adds the harness that calls the Lucairn gateway in
proving_groundmode, collects cert URLs, and computes per-HIPAA-Safe-Harbor-category recall against the 500-row Measurement-B subset from Slice 1.Mock-only this slice. Live
gateway.lucairn.eucalls land in Slice 3 per the locked halt gate in the PRD.Active PRD (Status: Locked):
Opus Advisor/specs/prd-2026-05-17-research-program.md— Slice 2 scope at lines 141-145.What's in this PR
src/gateway-client.ts— typed wrapper around/api/v1/proxy/messagesinproving_groundmode. Env-driven (LUCAIRN_GATEWAY_URL/LUCAIRN_API_KEY/LUCAIRN_UPSTREAM_KEY).X-Upstream-Keyplumbed for Slice 3's BYOK-per-request profiles (gate atdual-sandbox-architecture/services/gateway/internal/api/proxy.go:349-354). Ground-truth-annotationvalue.length < 3filter for containment-match safety.src/redaction-extractor.ts— pure: gateway response →ExtractedRedaction[]tagged with HIPAA category.src/recall.ts— pure: ground truth + extracted →RecallSummarywith per-HIPAA-category recall/precision/F1.src/hipaa-category-mapping.ts— explicit one-way table keyed by the 11 live placeholder prefixes observed indual-sandbox-architecture/services/sanitizer/presidio_scan.py:31-58(PERSON,EMAIL,PHONE,LOCATION,IBAN,CC,SSN,URL,DOB;IDandSECRETintentionally null-mapped because the sanitizer collapses multiple HIPAA categories into them — documented limitation surfaces inunmapped_extrasaccounting). Regression-locked against drift via a test that walks the canonicalPRESIDIO_TO_PLACEHOLDERmap.src/mocks/gateway-fixtures.ts— msw fixtures matching the liveground_truth_evaluationshape.scripts/run-pipeline.ts— per-row gateway calls (mock or live), NDJSON output. Supports--rows=N,--mock,--upstream-key.scripts/collect-certs.ts— NDJSON →CERTIFICATES.csv.scripts/compute-recall.ts— aggregates per-HIPAA-category recall against ground truth. ValidatesSUMMARY.jsonbeforewriteFile.Architectural decision —
proving_groundover inline-redactionsThe dispatch brief assumed the harness would consume
compliance_trace.fullRedactionsinline. Grep ofdual-sandbox-architecture/services/gateway/internal/api/proxy.go:451showedfullRedactionsis internal-only — never emitted in the response body. Pivoted toproving_groundmode (proxy.go:357-368validation;proxy.go:1067-1080evaluation emit) which returnsground_truth_evaluationwith server-side TP/FN/FP verdicts fromcompareGroundTruth(ground_truth.go:69-138).Architecturally stronger for the compliance-buyer audience: the matching code ships in the live gateway, not in this research repo. The publisher (Lucairn) does not author the matching code that produces the published recall numbers — the arm's-length property compliance buyers care about.
Matching semantics: case-insensitive bidirectional value-containment with whitespace normalization (NOT span-exact overlap). Disclosed in
papers/_template/SUMMARY.schema.jsondescription so auditors reading a SUMMARY.json in isolation get the methodology. Paper 1 body prose will repeat the disclosure in Methods + Limits (Slice 4).Reviewer chain — all PASS at
f924cbdafter fix-upf924cbd. MED/LOW/INFO carried to Slice 3 PRD.lcr_live_test_*→lcr_test_*rename — secret-scanner safety for public-flip).Acceptance gates
pnpm install --frozen-lockfilepnpm typecheck+pnpm typecheck:test+pnpm buildpnpm test(40 tests: 12 Slice 1 + 28 Slice 2)pnpm run pipeline -- --rows=5 --mockpnpm run collect-certspnpm run compute-recall --redactions-source=mock--miss-rate=0.3produces recall=0.75Carry-forwards for Slice 3 (deferred-by-decision, filed in fix-up commit body)
Test plan
LUCAIRN_API_KEY, real upstream Anthropic key, rate-limit policy decision🤖 Generated with Claude Code