Paper 2: CFPB / GLBA NPI finance dataset + benchmark artifacts + harness generalisation#3
Merged
Merged
Conversation
…ne artifacts Paper 2 in-flight — Lucairn Research Program's CFPB Consumer Complaint Database (public-domain US-government work) benchmarked against GLBA NPI (16 CFR § 313.3(n) + FTC Safeguards + PCI-DSS). Same two-measurement methodology as Paper 1. What's in this commit (in-flight; numbers TBD by benchmark runs): - datasets/finance/RECIPE.md — methodology of record for the CFPB + GLBA enumeration; mirrors datasets/healthcare/RECIPE.md structurally - src/inject-finance-pii-core.ts — deterministic synthetic-NPI injection (Mulberry32 PRNG, Faker, 17 GLBA categories, 20-25 entities per narrative, same seed = 42 as healthcare for cross-paper sampling parity) - src/glba-category-mapping.ts — placeholder→GLBA mapping for FP attribution - src/streaming-csv.ts — streaming CSV reader for the ~8GB CFPB CSV (V8's max string length is ~512MB; the healthcare path stays on in-memory csv.ts since MTSamples is ~50MB) - scripts/download-cfpb.ts + inject-finance-pii.ts + verify-finance-injection.ts + analyze-finance-ndjson.ts — Paper 2 driver scripts - papers/paper-2-finance/sanitizer-config/ — paper2_* recognizers.py + finance-terms.txt + README (reproducibility artifact; the live sanitizer application path is documented in the README) Harness generalisation (shared with Paper 1; preserves Paper 1 behaviour): - gateway-client.ts: ProvingGroundAnnotation.type widened from HipaaCategory to string; new AnnotationInput interface as the generic shape both papers' InjectedEntity types satisfy - run-pipeline.ts: new --narrative-column flag (default 'transcription' for healthcare; finance overrides to 'Consumer complaint narrative') - mocks/gateway-fixtures.ts: AnnotationInput swap - All 46 existing tests still pass (no Paper 1 regression) Benchmark runs + blog publication land in subsequent commits once both baseline and tuned numbers are confirmed row-by-row against the NDJSONs. PRD: Opus Advisor/specs/prd-2026-05-22-paper-2-finance.md
… JSONs + compare script After running the full baseline + tuned + score-bump-variant benchmarks, this commit lands the final reproducibility state: - papers/paper-2-finance/SUMMARY-baseline.json (rows=500, recall=72.24%, precision=47.36%, F1=57.21, FP=9026) - papers/paper-2-finance/SUMMARY-tuned.json (rows=500, recall=72.20%, precision=81.35%, F1=76.50, FP=1861) — V1 safelist-only is canonical "after" - papers/paper-2-finance/sanitizer-config/recognizers.py — score-bump variants documented (V2 experiment); 10 paper2_* recognizers - papers/paper-2-finance/sanitizer-config/finance-terms.txt — 108 effective terms (trimmed after broadness audit per Paper 1's "any word in span" lesson) - datasets/finance/RECIPE.md — PCI-DSS cite refined to "v4.0 Glossary + §3.2.1" per regulator-validator review - scripts/compare-finance-summaries.py — markdown delta table generator used to produce blog tables NDJSONs (baseline-500row-*.ndjson + tuned-500row-*.ndjson) stay gitignored per the per-paper raw-results convention. Companion blog: lucairn.eu/blog/financial-pii-redaction-benchmark (theveil-website#262).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Paper 2 of the Lucairn Research Program. Mirrors Paper 1's methodology on CFPB Consumer Complaint Database + GLBA NPI enumeration.
Companion blog post (already merged + deployed): lucairn.eu/blog/financial-pii-redaction-benchmark (Declade/theveil-website#262).
What ships
Dataset + methodology:
datasets/finance/RECIPE.md— full provenance + two-measurement methodology + 17-category GLBA NPI enumerationscripts/download-cfpb.ts— direct download from CFPB (US-gov public domain)scripts/inject-finance-pii.ts— deterministic synthetic NPI injection (seed=42, 20–25 NPI/row across 17 GLBA categories)scripts/verify-finance-injection.ts— round-trip verificationscripts/analyze-finance-ndjson.ts— per-category recall/precision/F1 aggregatorscripts/compare-finance-summaries.py— baseline-vs-tuned markdown delta tablesSource code:
src/inject-finance-pii-core.ts— Faker + Mulberry32 PRNG, GLBA categoriessrc/glba-category-mapping.ts— sanitizer placeholder → GLBA category attributionsrc/streaming-csv.ts— chunked CSV reader (CFPB CSV is ~8.8 GB unzipped, exceeds V8's max string length)Harness generalisation (shared with Paper 1, no regression):
src/gateway-client.ts—ProvingGroundAnnotation.typewidened fromHipaaCategorytostring; newAnnotationInputinterface that both papers' entity types satisfy structurallyscripts/run-pipeline.ts— new--narrative-columnCLI flag (defaulttranscriptionfor healthcare; finance overrides toConsumer complaint narrative)src/mocks/gateway-fixtures.ts— same widening, all 46 existing Paper 1 tests still passSanitizer config artifact:
papers/paper-2-finance/sanitizer-config/recognizers.py— 10paper2_*PatternRecognizerdefinitions (score-bump variants documented)papers/paper-2-finance/sanitizer-config/finance-terms.txt— 108-term consumer-finance safelist (multi-character unambiguous only; CFPB redaction artifacts + bank brand names + card networks + credit bureaus + finance-only acronyms)papers/paper-2-finance/sanitizer-config/README.md— deployment + honesty caveatsResult summaries:
papers/paper-2-finance/SUMMARY-baseline.json— 500/500 rows, recall 72.24%, precision 47.36%, F1 57.21, 9 026 FPspapers/paper-2-finance/SUMMARY-tuned.json— 500/500 rows, recall 72.20%, precision 81.35%, F1 76.50, 1 861 FPs (−79.4%)Result narrative (the lesson)
In Paper 1 (healthcare), regex-per-weak-category lifted six weak HIPAA categories from 9–53% to 98–100% recall. In Paper 2 (finance), the same lever does not close the recall gap on digit-shape ambiguous categories — they compete with existing recognisers, sit below the sanitizer's 0.35 confidence threshold, and score-bumping trades recall for FPs roughly 1-to-1. The safelist alone drove the +34 pp precision gain. Different vertical, different lever. Full diagnosis in the blog body.
Test plan
pnpm typecheckexit 0pnpm test— 46/46 pass (Paper 1 + shared no regression)PRD:
Opus Advisor/specs/prd-2026-05-22-paper-2-finance.md