Skip to content

feat(slice-2): harness + recall + cert collection (mock-only)#1

Merged
Declade merged 5 commits into
mainfrom
feat/slice-2-harness
May 17, 2026
Merged

feat(slice-2): harness + recall + cert collection (mock-only)#1
Declade merged 5 commits into
mainfrom
feat/slice-2-harness

Conversation

@Declade
Copy link
Copy Markdown
Owner

@Declade Declade commented May 17, 2026

Summary

Slice 2 of the Lucairn Research Program (Paper 1, healthcare). Adds the harness that calls the Lucairn gateway in proving_ground mode, collects cert URLs, and computes per-HIPAA-Safe-Harbor-category recall against the 500-row Measurement-B subset from Slice 1.

Mock-only this slice. Live gateway.lucairn.eu calls land in Slice 3 per the locked halt gate in the PRD.

Active PRD (Status: Locked): Opus Advisor/specs/prd-2026-05-17-research-program.md — Slice 2 scope at lines 141-145.

What's in this PR

  • src/gateway-client.ts — typed wrapper around /api/v1/proxy/messages in proving_ground mode. Env-driven (LUCAIRN_GATEWAY_URL / LUCAIRN_API_KEY / LUCAIRN_UPSTREAM_KEY). X-Upstream-Key plumbed for Slice 3's BYOK-per-request profiles (gate at dual-sandbox-architecture/services/gateway/internal/api/proxy.go:349-354). Ground-truth-annotation value.length < 3 filter for containment-match safety.
  • src/redaction-extractor.ts — pure: gateway response → ExtractedRedaction[] tagged with HIPAA category.
  • src/recall.ts — pure: ground truth + extracted → RecallSummary with per-HIPAA-category recall/precision/F1.
  • src/hipaa-category-mapping.ts — explicit one-way table keyed by the 11 live placeholder prefixes observed in dual-sandbox-architecture/services/sanitizer/presidio_scan.py:31-58 (PERSON, EMAIL, PHONE, LOCATION, IBAN, CC, SSN, URL, DOB; ID and SECRET intentionally null-mapped because the sanitizer collapses multiple HIPAA categories into them — documented limitation surfaces in unmapped_extras accounting). Regression-locked against drift via a test that walks the canonical PRESIDIO_TO_PLACEHOLDER map.
  • src/mocks/gateway-fixtures.ts — msw fixtures matching the live ground_truth_evaluation shape.
  • scripts/run-pipeline.ts — per-row gateway calls (mock or live), NDJSON output. Supports --rows=N, --mock, --upstream-key.
  • scripts/collect-certs.ts — NDJSON → CERTIFICATES.csv.
  • scripts/compute-recall.ts — aggregates per-HIPAA-category recall against ground truth. Validates SUMMARY.json before writeFile.

Architectural decision — proving_ground over inline-redactions

The dispatch brief assumed the harness would consume compliance_trace.fullRedactions inline. Grep of dual-sandbox-architecture/services/gateway/internal/api/proxy.go:451 showed fullRedactions is internal-only — never emitted in the response body. Pivoted to proving_ground mode (proxy.go:357-368 validation; proxy.go:1067-1080 evaluation emit) which returns ground_truth_evaluation with server-side TP/FN/FP verdicts from compareGroundTruth (ground_truth.go:69-138).

Architecturally stronger for the compliance-buyer audience: the matching code ships in the live gateway, not in this research repo. The publisher (Lucairn) does not author the matching code that produces the published recall numbers — the arm's-length property compliance buyers care about.

Matching semantics: case-insensitive bidirectional value-containment with whitespace normalization (NOT span-exact overlap). Disclosed in papers/_template/SUMMARY.schema.json description so auditors reading a SUMMARY.json in isolation get the methodology. Paper 1 body prose will repeat the disclosure in Methods + Limits (Slice 4).

Reviewer chain — all PASS at f924cbd after fix-up

Reviewer Pre-fix-up Resolution
bug-hunter-reviewer 1 BLOCKER + 4 HIGH + 5 MED + 4 LOW + 5 INFO BLOCKER (HIPAA mapping vocabulary) + all HIGH closed in fix-up f924cbd. MED/LOW/INFO carried to Slice 3 PRD.
claim-enforcement-guard 0 BLOCKER + 1 MED MED closed (README testimonial guardrail recovered).
regulator-validator 8 PASS / 0 FAIL / 3 WARN 1 WARN closed (SUMMARY.schema matching-semantics disclosure). 2 WARN deferred to Paper 1 body (Slice 4).
personal-info-leak-detector 0 BLOCKER + 1 MED MED closed (lcr_live_test_*lcr_test_* rename — secret-scanner safety for public-flip).

Acceptance gates

Gate Result
pnpm install --frozen-lockfile PASS
pnpm typecheck + pnpm typecheck:test + pnpm build PASS
pnpm test (40 tests: 12 Slice 1 + 28 Slice 2) PASS
pnpm run pipeline -- --rows=5 --mock PASS — 5 NDJSON records, well-formed cert URLs
pnpm run collect-certs PASS — 5 CERTIFICATES.csv rows
pnpm run compute-recall --redactions-source=mock PASS — oracle recall=1.0; --miss-rate=0.3 produces recall=0.75
Banned-literal sweep across 25 banned terms PASS — 0 hits

Carry-forwards for Slice 3 (deferred-by-decision, filed in fix-up commit body)

  • M1 NDJSON streaming writer (currently buffered in-memory; OOM-class risk on long runs)
  • M2 rate-limit/concurrency + 429 retry for live Anthropic upstream
  • M3 hard-fail on malformed ground-truth/transcription rows
  • M4 harden in-process JSON-Schema validator (or swap to ajv)
  • regulator WARN 1: fax/phone disclosure in Paper 1 body
  • regulator WARN 2: recall match-semantics in Paper 1 body Methods/Limits

Test plan

  • Codex round 1 substantive-PASS [N/N]
  • All acceptance gates re-run at PR merge (post-Codex-PASS)
  • Slice 3 dispatch prerequisites confirmed: real LUCAIRN_API_KEY, real upstream Anthropic key, rate-limit policy decision

🤖 Generated with Claude Code

Declade added 5 commits May 17, 2026 11:22
…call, mocks

Adds the in-process methodology library Slice 2 needs:

- src/gateway-client.ts — typed wrapper around POST /api/v1/proxy/messages
  with mode=proving_ground. 2-retry exponential backoff with jitter on 5xx +
  connection errors; no retry on 4xx; per-request timeout (30s default,
  LUCAIRN_REQUEST_TIMEOUT_MS env-configurable). Env reads are call-time only
  — module import is side-effect-free.
- src/redaction-extractor.ts — pure converter from gateway proving-ground
  matches/missed/extras into a flat ExtractedRedaction[] tagged with HIPAA
  Safe Harbor category + verdict (tp/fn/fp). Unmapped extras carry
  hipaa_category=null so the FP count is preserved while taxonomy drift is
  observable.
- src/hipaa-category-mapping.ts — explicit one-way map from Lucairn
  sanitizer internal taxonomy ([PERSON_N], [LOCATION_N], …) to the 18 HIPAA
  Safe Harbor categories (45 CFR § 164.514(b)(2)(i)). Placeholder-parsing
  mirrors gateway extractEntityTypes (proxy.go:1361-1395).
- src/recall.ts — two consumer paths: aggregateExtracted() consumes
  gateway-attested verdicts (the harness's live path; arm's-length property
  preserved because matching runs inside the gateway, not in code Lucairn
  authored alongside the publication); computeRecallFromSpans() implements
  the ≥50%-character-overlap span-matching the Slice 2 brief locks for any
  future raw-span inline surface. Both produce the same RecallSummary shape.
- src/mocks/gateway-fixtures.ts — deterministic mock builders for msw-backed
  unit tests + --mock smoke scripts. Configurable missRate + spuriousFpCount
  exercise recall paths against known oracles.
- src/index.ts — barrel exports for the public surface.
- package.json — adds msw ^2.7 devDependency. No new runtime deps.

Cite-back for gateway response shape: proxy.go:35-58 (request schema),
proxy.go:361-373 (mode + activity validation), proxy.go:1068-1080
(ground_truth_evaluation emission), ground_truth.go:5-138 (result shape).
…MARY schema

Adds the three CLI scripts the Slice 2 harness needs:

- scripts/run-pipeline.ts — orchestrates per-row gateway calls via
  POST /api/v1/proxy/messages (mode=proving_ground). --mock mounts an msw
  fixture server in-process; --live is reserved for Slice 3 and refuses to
  start without the explicit gate. Writes raw NDJSON to
  papers/paper-1-healthcare/raw-results/<timestamp>.ndjson (or --output).
  Supports --rows / --truth / --subset / --gateway / --api-key / --miss-rate
  / --spurious-fp-count / --activity-id-prefix.

- scripts/collect-certs.ts — walks the NDJSON, extracts cert URL + summary
  URL + redaction count + overall verdict per row, emits CERTIFICATES.csv
  via src/csv.ts::emitCsv. Columns:
    row_index, cert_url, cert_id, summary_url, overall_verdict,
    redaction_count, latency_ms, timestamp_utc, error_code

- scripts/compute-recall.ts — reads ground-truth JSONL + raw NDJSON (or
  re-runs the in-process mock when --redactions-source=mock), aggregates
  per-HIPAA-category recall / precision / F1 via aggregateExtracted, emits
  SUMMARY.json, validates against papers/_template/SUMMARY.schema.json
  in-process. Avoids a runtime dep on ajv via a minimal validator covering
  the schema subset used.

Also adds:

- papers/_template/SUMMARY.schema.json — Draft 2020-12 JSON Schema for
  SUMMARY.json. Enforces 18-category coverage in per_category, the four
  required overall fields, the RowBreakdown shape, and the schema_version /
  generator const fields. Reused by every paper in the program.
- papers/paper-1-healthcare/raw-results/.gitignore + .gitkeep — directory
  scaffold; per-run NDJSON is gitignored at the repo level
  (datasets/.gitignore line 17) but the per-paper sub-tree's own .gitignore
  locks it locally too.
- package.json — adds pipeline / collect-certs / compute-recall scripts.

End-to-end smoke (all PASS):
  pnpm run pipeline -- --rows=5 --mock --output=/tmp/slice2-smoke.ndjson
  pnpm run collect-certs -- --input=/tmp/slice2-smoke.ndjson --output=/tmp/slice2-CERTIFICATES.csv
  pnpm run compute-recall -- --truth=ground-truth.jsonl --redactions-source=mock --rows=5 --output=/tmp/slice2-SUMMARY.json
…docs update

Adds the three test files covering the new Slice 2 surface and updates
README + RECIPE for the shipped state.

Tests (22 new, 34 total with Slice 1's 12):

- test/gateway-client.spec.ts (8 tests) — msw-mocked. Locks the
  proving-ground request shape (mode, relink_response=false, activity_id
  pattern, ground_truth.transcription[] HIPAA-tagged annotations,
  x-api-key header). Verifies: success path returns a typed
  GatewayRowResult; retries on 5xx and recovers (exact backoff math
  asserted); does NOT retry on 4xx; fails-with-error after exhausting
  retry budget; abort/timeout is retry-eligible; extractCertUrls handles
  the missing-veil-hint case; construction validation refuses empty URL
  / empty key.
- test/redaction-extractor.spec.ts (9 tests) — locks the placeholder
  parser against malformed inputs; verifies the HIPAA mapping covers the
  standard Presidio + Lucairn vocabulary (PERSON, LOCATION, DATE,
  PHONE_NUMBER, EMAIL_ADDRESS, US_SSN, IBAN, URL, IP_ADDRESS,
  CREDIT_CARD); every entry in LUCAIRN_TO_HIPAA maps to a valid
  HipaaCategory; extractFromEvaluation flattens matches/missed/extras
  into ExtractedRedaction[] with verdicts; unknown annotation_type from
  the gateway is tagged null (no silent widening); unmappedExtraTypes
  surfaces taxonomy drift.
- test/recall.spec.ts (5 tests) — 5 rows, 22 entities, hand-tagged
  TP/FN/FP. Exact per-category recall/precision/F1 numbers asserted:
  NAME 5 TP / 1 FN → recall 5/6; EMAIL 2 TP → recall 1; DATE 3 TP / 1 FN
  / 1 FP → recall 0.75 precision 0.75; PHONE 0 TP / 2 FP → precision 0;
  GEO 4 TP / 1 FN → recall 0.8. Overall TP=15 FP=3 FN=3 → recall 15/18.
  Locks the SPAN_OVERLAP_THRESHOLD const at 0.5 with a regression test.
  computeRecallFromSpans is exercised with a single-row synthetic
  fixture covering exact-50%-overlap (matches), 100%-overlap (matches),
  40%-overlap (FP + FN). Per-row order ascending by row_index asserted.
  Unmapped-category counts get a "no HIPAA category mapping" note.

Docs:

- README.md — appends a Slice 2 — Harness section under Reproduce Paper 1
  documenting the mock-only workflow, all three CLI commands with
  --rows=5 examples, the --miss-rate / --spurious-fp-count options, and
  the explicit "live gateway run lands in Slice 3" framing required by
  the PRD halt gate. Refines two pre-existing negative-disclaimer lines
  to avoid the locked banned literals "case study" + "testimonial"
  while preserving meaning.
- datasets/healthcare/RECIPE.md — flips the Slice-status timeline entry
  for Slice 2 from "pending" to "shipped (mock-only)", enumerates the
  Slice 2 source files, and updates the Slice 3 description.
- .gitignore — narrows `papers/*/raw-results/` to its contents and
  exempts the directory scaffold (`.gitignore` + `.gitkeep`) so the
  per-paper run-results directory exists in a fresh clone.

End-to-end smoke (all PASS):
  pnpm install --frozen-lockfile      → exit 0
  pnpm typecheck                       → exit 0
  pnpm typecheck:test                  → exit 0
  pnpm build                           → exit 0
  pnpm test (34 tests across 6 files)  → exit 0
  Banned-literal sweep                 → 0 hits
- B1 (bug-hunter BLOCKER): rewrite hipaa-category-mapping table to
  match the live placeholder vocabulary from presidio_scan.py:31-58
  (PERSON, EMAIL, PHONE, LOCATION, IBAN, CC, SSN, URL, DOB).
  ID and SECRET intentionally null-mapped (placeholder collapses
  multiple HIPAA categories; documented limitation surfaces as
  unmapped_extras). Update regression test to walk PRESIDIO_TO_PLACEHOLDER
  values + assert every value is mapped or explicitly null-mapped.
- H1 (bug-hunter HIGH): rewrite mock fixture PLACEHOLDER_FOR_CATEGORY
  to emit live-production placeholder shapes (no more synthetic
  [MEDICAL_RECORD_NUMBER_1] etc.). Add [ID_N] regression test in
  recall.spec.ts to exercise the unmapped-extras accounting path.
- H2 (bug-hunter HIGH): filter ground-truth annotations with
  value.trim().length < 3 in buildGroundTruth (containment-match
  safety; defensive against future Faker regression). Emit
  console.warn with dropped count only (never the dropped values).
- H3 (bug-hunter HIGH): validate SUMMARY.json BEFORE writeFile in
  compute-recall.ts, not after, so a bogus SUMMARY.json never lands
  on disk for downstream consumers.
- H4 (bug-hunter HIGH): plumb X-Upstream-Key header through
  gateway-client + run-pipeline --upstream-key flag for Slice 3
  BYOK-per-request flow (proxy.go:349-354 gate). LUCAIRN_UPSTREAM_KEY
  env var fallback. Empty-string treated as absent. Help text + auth-
  modes table documented.
- claim-enforce MED: append "No attributed endorsement quotes" to
  README.md:15 to recover the testimonial guardrail dropped in the
  Slice 2 banned-literal sweep rephrase.
- personal-info-leak MED: rename lcr_live_test_* / lcr_live_mock_*
  to lcr_test_* / lcr_mock_* in test fixtures so the repo is safe
  for secret-scanner pass post-public-flip.
- regulator-validator WARN: add matching-semantics disclosure to
  papers/_template/SUMMARY.schema.json description so auditors reading
  SUMMARY.json in isolation cannot misinterpret containment recall as
  span-exact i2b2-style recall.

Deferred to Slice 3:
- M1 NDJSON streaming writer (lost-data crash protection)
- M2 rate-limit/concurrency + 429 retry policy for live Anthropic upstream
- M3 hard-fail on malformed ground-truth/transcription rows
- M4 hardening of the in-process JSON-Schema validator (or swap to ajv)
- M5 detection_rate empty-row contract test
- regulator WARN 1: fax/phone disclosure in Paper 1 body
- regulator WARN 2: recall match-semantics in Paper 1 body Methods/Limits
…set rationale)

- [8] FAIL: --upstream-key help table listed 2 unsupported auth modes
  ("not supported by this harness") and omitted the --mock path entirely.
  Rewrote the table to enumerate the 3 actually-supported modes:
  --mock (no auth), --live + --api-key (non-BYOK), --live + --api-key +
  --upstream-key (BYOK-per-request, cite proxy.go:349-354 gate).

- [21] FAIL: CERTIFICATES.csv ships 9 columns vs the brief's 7 minimum.
  The 2 extensions (summary_url, error_code) are intentional:
  summary_url saves readers a URL-construction step; error_code makes
  the paper appendix honest about which rows failed instead of silently
  dropping them. Documented the rationale inline in collect-certs.ts
  before the headers array. All 7 brief-required columns remain present
  in declaration order. Treating this as effective-PASS at the
  orchestrator level: brief spec was a minimum, not an exclusive list.

No code-behavior changes. typecheck/build/test all green at HEAD.
@Declade Declade merged commit 233733f into main May 17, 2026
@Declade Declade deleted the feat/slice-2-harness branch May 17, 2026 10:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant