Paper 2: CFPB / GLBA NPI finance dataset + benchmark artifacts + harness generalisation by Declade · Pull Request #3 · Declade/lucairn-research

Declade · 2026-05-23T00:08:32Z

Summary

Paper 2 of the Lucairn Research Program. Mirrors Paper 1's methodology on CFPB Consumer Complaint Database + GLBA NPI enumeration.

Companion blog post (already merged + deployed): lucairn.eu/blog/financial-pii-redaction-benchmark (Declade/theveil-website#262).

What ships

Dataset + methodology:

datasets/finance/RECIPE.md — full provenance + two-measurement methodology + 17-category GLBA NPI enumeration
scripts/download-cfpb.ts — direct download from CFPB (US-gov public domain)
scripts/inject-finance-pii.ts — deterministic synthetic NPI injection (seed=42, 20–25 NPI/row across 17 GLBA categories)
scripts/verify-finance-injection.ts — round-trip verification
scripts/analyze-finance-ndjson.ts — per-category recall/precision/F1 aggregator
scripts/compare-finance-summaries.py — baseline-vs-tuned markdown delta tables

Source code:

src/inject-finance-pii-core.ts — Faker + Mulberry32 PRNG, GLBA categories
src/glba-category-mapping.ts — sanitizer placeholder → GLBA category attribution
src/streaming-csv.ts — chunked CSV reader (CFPB CSV is ~8.8 GB unzipped, exceeds V8's max string length)

Harness generalisation (shared with Paper 1, no regression):

src/gateway-client.ts — ProvingGroundAnnotation.type widened from HipaaCategory to string; new AnnotationInput interface that both papers' entity types satisfy structurally
scripts/run-pipeline.ts — new --narrative-column CLI flag (default transcription for healthcare; finance overrides to Consumer complaint narrative)
src/mocks/gateway-fixtures.ts — same widening, all 46 existing Paper 1 tests still pass

Sanitizer config artifact:

papers/paper-2-finance/sanitizer-config/recognizers.py — 10 paper2_* PatternRecognizer definitions (score-bump variants documented)
papers/paper-2-finance/sanitizer-config/finance-terms.txt — 108-term consumer-finance safelist (multi-character unambiguous only; CFPB redaction artifacts + bank brand names + card networks + credit bureaus + finance-only acronyms)
papers/paper-2-finance/sanitizer-config/README.md — deployment + honesty caveats

Result summaries:

papers/paper-2-finance/SUMMARY-baseline.json — 500/500 rows, recall 72.24%, precision 47.36%, F1 57.21, 9 026 FPs
papers/paper-2-finance/SUMMARY-tuned.json — 500/500 rows, recall 72.20%, precision 81.35%, F1 76.50, 1 861 FPs (−79.4%)

Result narrative (the lesson)

In Paper 1 (healthcare), regex-per-weak-category lifted six weak HIPAA categories from 9–53% to 98–100% recall. In Paper 2 (finance), the same lever does not close the recall gap on digit-shape ambiguous categories — they compete with existing recognisers, sit below the sanitizer's 0.35 confidence threshold, and score-bumping trades recall for FPs roughly 1-to-1. The safelist alone drove the +34 pp precision gain. Different vertical, different lever. Full diagnosis in the blog body.

Test plan

pnpm typecheck exit 0
pnpm test — 46/46 pass (Paper 1 + shared no regression)
Reviewer chain (run on PR #262 in theveil-website + this branch): claim-enforcement-guard PASS, personal-info-leak-detector PASS, bug-hunter caught 2 numerical-drift findings (both fixed), regulator-validator 4 PASS + 2 WARN (one fixed in this PR's RECIPE.md update)
Post-merge edge verify: github.com/Declade/lucairn-research/tree/main/papers/paper-2-finance returns 200
Post-merge edge verify: github.com/Declade/lucairn-research/blob/main/datasets/finance/RECIPE.md returns 200

PRD: Opus Advisor/specs/prd-2026-05-22-paper-2-finance.md

…ne artifacts Paper 2 in-flight — Lucairn Research Program's CFPB Consumer Complaint Database (public-domain US-government work) benchmarked against GLBA NPI (16 CFR § 313.3(n) + FTC Safeguards + PCI-DSS). Same two-measurement methodology as Paper 1. What's in this commit (in-flight; numbers TBD by benchmark runs): - datasets/finance/RECIPE.md — methodology of record for the CFPB + GLBA enumeration; mirrors datasets/healthcare/RECIPE.md structurally - src/inject-finance-pii-core.ts — deterministic synthetic-NPI injection (Mulberry32 PRNG, Faker, 17 GLBA categories, 20-25 entities per narrative, same seed = 42 as healthcare for cross-paper sampling parity) - src/glba-category-mapping.ts — placeholder→GLBA mapping for FP attribution - src/streaming-csv.ts — streaming CSV reader for the ~8GB CFPB CSV (V8's max string length is ~512MB; the healthcare path stays on in-memory csv.ts since MTSamples is ~50MB) - scripts/download-cfpb.ts + inject-finance-pii.ts + verify-finance-injection.ts + analyze-finance-ndjson.ts — Paper 2 driver scripts - papers/paper-2-finance/sanitizer-config/ — paper2_* recognizers.py + finance-terms.txt + README (reproducibility artifact; the live sanitizer application path is documented in the README) Harness generalisation (shared with Paper 1; preserves Paper 1 behaviour): - gateway-client.ts: ProvingGroundAnnotation.type widened from HipaaCategory to string; new AnnotationInput interface as the generic shape both papers' InjectedEntity types satisfy - run-pipeline.ts: new --narrative-column flag (default 'transcription' for healthcare; finance overrides to 'Consumer complaint narrative') - mocks/gateway-fixtures.ts: AnnotationInput swap - All 46 existing tests still pass (no Paper 1 regression) Benchmark runs + blog publication land in subsequent commits once both baseline and tuned numbers are confirmed row-by-row against the NDJSONs. PRD: Opus Advisor/specs/prd-2026-05-22-paper-2-finance.md

… JSONs + compare script After running the full baseline + tuned + score-bump-variant benchmarks, this commit lands the final reproducibility state: - papers/paper-2-finance/SUMMARY-baseline.json (rows=500, recall=72.24%, precision=47.36%, F1=57.21, FP=9026) - papers/paper-2-finance/SUMMARY-tuned.json (rows=500, recall=72.20%, precision=81.35%, F1=76.50, FP=1861) — V1 safelist-only is canonical "after" - papers/paper-2-finance/sanitizer-config/recognizers.py — score-bump variants documented (V2 experiment); 10 paper2_* recognizers - papers/paper-2-finance/sanitizer-config/finance-terms.txt — 108 effective terms (trimmed after broadness audit per Paper 1's "any word in span" lesson) - datasets/finance/RECIPE.md — PCI-DSS cite refined to "v4.0 Glossary + §3.2.1" per regulator-validator review - scripts/compare-finance-summaries.py — markdown delta table generator used to produce blog tables NDJSONs (baseline-500row-*.ndjson + tuned-500row-*.ndjson) stay gitignored per the per-paper raw-results convention. Companion blog: lucairn.eu/blog/financial-pii-redaction-benchmark (theveil-website#262).

Declade added 2 commits May 23, 2026 00:35

Declade merged commit 273d044 into main May 23, 2026
0 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paper 2: CFPB / GLBA NPI finance dataset + benchmark artifacts + harness generalisation#3

Paper 2: CFPB / GLBA NPI finance dataset + benchmark artifacts + harness generalisation#3
Declade merged 2 commits into
mainfrom
feat/paper-2-finance

Declade commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Declade commented May 23, 2026

Summary

What ships

Result narrative (the lesson)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant