Skip to content

Paper 2: CFPB / GLBA NPI finance dataset + benchmark artifacts + harness generalisation#3

Merged
Declade merged 2 commits into
mainfrom
feat/paper-2-finance
May 23, 2026
Merged

Paper 2: CFPB / GLBA NPI finance dataset + benchmark artifacts + harness generalisation#3
Declade merged 2 commits into
mainfrom
feat/paper-2-finance

Conversation

@Declade
Copy link
Copy Markdown
Owner

@Declade Declade commented May 23, 2026

Summary

Paper 2 of the Lucairn Research Program. Mirrors Paper 1's methodology on CFPB Consumer Complaint Database + GLBA NPI enumeration.

Companion blog post (already merged + deployed): lucairn.eu/blog/financial-pii-redaction-benchmark (Declade/theveil-website#262).

What ships

Dataset + methodology:

  • datasets/finance/RECIPE.md — full provenance + two-measurement methodology + 17-category GLBA NPI enumeration
  • scripts/download-cfpb.ts — direct download from CFPB (US-gov public domain)
  • scripts/inject-finance-pii.ts — deterministic synthetic NPI injection (seed=42, 20–25 NPI/row across 17 GLBA categories)
  • scripts/verify-finance-injection.ts — round-trip verification
  • scripts/analyze-finance-ndjson.ts — per-category recall/precision/F1 aggregator
  • scripts/compare-finance-summaries.py — baseline-vs-tuned markdown delta tables

Source code:

  • src/inject-finance-pii-core.ts — Faker + Mulberry32 PRNG, GLBA categories
  • src/glba-category-mapping.ts — sanitizer placeholder → GLBA category attribution
  • src/streaming-csv.ts — chunked CSV reader (CFPB CSV is ~8.8 GB unzipped, exceeds V8's max string length)

Harness generalisation (shared with Paper 1, no regression):

  • src/gateway-client.tsProvingGroundAnnotation.type widened from HipaaCategory to string; new AnnotationInput interface that both papers' entity types satisfy structurally
  • scripts/run-pipeline.ts — new --narrative-column CLI flag (default transcription for healthcare; finance overrides to Consumer complaint narrative)
  • src/mocks/gateway-fixtures.ts — same widening, all 46 existing Paper 1 tests still pass

Sanitizer config artifact:

  • papers/paper-2-finance/sanitizer-config/recognizers.py — 10 paper2_* PatternRecognizer definitions (score-bump variants documented)
  • papers/paper-2-finance/sanitizer-config/finance-terms.txt — 108-term consumer-finance safelist (multi-character unambiguous only; CFPB redaction artifacts + bank brand names + card networks + credit bureaus + finance-only acronyms)
  • papers/paper-2-finance/sanitizer-config/README.md — deployment + honesty caveats

Result summaries:

  • papers/paper-2-finance/SUMMARY-baseline.json — 500/500 rows, recall 72.24%, precision 47.36%, F1 57.21, 9 026 FPs
  • papers/paper-2-finance/SUMMARY-tuned.json — 500/500 rows, recall 72.20%, precision 81.35%, F1 76.50, 1 861 FPs (−79.4%)

Result narrative (the lesson)

In Paper 1 (healthcare), regex-per-weak-category lifted six weak HIPAA categories from 9–53% to 98–100% recall. In Paper 2 (finance), the same lever does not close the recall gap on digit-shape ambiguous categories — they compete with existing recognisers, sit below the sanitizer's 0.35 confidence threshold, and score-bumping trades recall for FPs roughly 1-to-1. The safelist alone drove the +34 pp precision gain. Different vertical, different lever. Full diagnosis in the blog body.

Test plan

  • pnpm typecheck exit 0
  • pnpm test — 46/46 pass (Paper 1 + shared no regression)
  • Reviewer chain (run on PR #262 in theveil-website + this branch): claim-enforcement-guard PASS, personal-info-leak-detector PASS, bug-hunter caught 2 numerical-drift findings (both fixed), regulator-validator 4 PASS + 2 WARN (one fixed in this PR's RECIPE.md update)
  • Post-merge edge verify: github.com/Declade/lucairn-research/tree/main/papers/paper-2-finance returns 200
  • Post-merge edge verify: github.com/Declade/lucairn-research/blob/main/datasets/finance/RECIPE.md returns 200

PRD: Opus Advisor/specs/prd-2026-05-22-paper-2-finance.md

Declade added 2 commits May 23, 2026 00:35
…ne artifacts

Paper 2 in-flight — Lucairn Research Program's CFPB Consumer Complaint Database
(public-domain US-government work) benchmarked against GLBA NPI (16 CFR § 313.3(n)
+ FTC Safeguards + PCI-DSS). Same two-measurement methodology as Paper 1.

What's in this commit (in-flight; numbers TBD by benchmark runs):
- datasets/finance/RECIPE.md — methodology of record for the CFPB + GLBA
  enumeration; mirrors datasets/healthcare/RECIPE.md structurally
- src/inject-finance-pii-core.ts — deterministic synthetic-NPI injection
  (Mulberry32 PRNG, Faker, 17 GLBA categories, 20-25 entities per narrative,
  same seed = 42 as healthcare for cross-paper sampling parity)
- src/glba-category-mapping.ts — placeholder→GLBA mapping for FP attribution
- src/streaming-csv.ts — streaming CSV reader for the ~8GB CFPB CSV
  (V8's max string length is ~512MB; the healthcare path stays on in-memory
  csv.ts since MTSamples is ~50MB)
- scripts/download-cfpb.ts + inject-finance-pii.ts + verify-finance-injection.ts
  + analyze-finance-ndjson.ts — Paper 2 driver scripts
- papers/paper-2-finance/sanitizer-config/ — paper2_* recognizers.py +
  finance-terms.txt + README (reproducibility artifact; the live sanitizer
  application path is documented in the README)

Harness generalisation (shared with Paper 1; preserves Paper 1 behaviour):
- gateway-client.ts: ProvingGroundAnnotation.type widened from HipaaCategory
  to string; new AnnotationInput interface as the generic shape both papers'
  InjectedEntity types satisfy
- run-pipeline.ts: new --narrative-column flag (default 'transcription' for
  healthcare; finance overrides to 'Consumer complaint narrative')
- mocks/gateway-fixtures.ts: AnnotationInput swap
- All 46 existing tests still pass (no Paper 1 regression)

Benchmark runs + blog publication land in subsequent commits once both
baseline and tuned numbers are confirmed row-by-row against the NDJSONs.

PRD: Opus Advisor/specs/prd-2026-05-22-paper-2-finance.md
… JSONs + compare script

After running the full baseline + tuned + score-bump-variant benchmarks, this
commit lands the final reproducibility state:

- papers/paper-2-finance/SUMMARY-baseline.json (rows=500, recall=72.24%, precision=47.36%, F1=57.21, FP=9026)
- papers/paper-2-finance/SUMMARY-tuned.json (rows=500, recall=72.20%, precision=81.35%, F1=76.50, FP=1861) — V1 safelist-only is canonical "after"
- papers/paper-2-finance/sanitizer-config/recognizers.py — score-bump variants documented (V2 experiment); 10 paper2_* recognizers
- papers/paper-2-finance/sanitizer-config/finance-terms.txt — 108 effective terms (trimmed after broadness audit per Paper 1's "any word in span" lesson)
- datasets/finance/RECIPE.md — PCI-DSS cite refined to "v4.0 Glossary + §3.2.1" per regulator-validator review
- scripts/compare-finance-summaries.py — markdown delta table generator used to produce blog tables

NDJSONs (baseline-500row-*.ndjson + tuned-500row-*.ndjson) stay gitignored
per the per-paper raw-results convention.

Companion blog: lucairn.eu/blog/financial-pii-redaction-benchmark (theveil-website#262).
@Declade Declade merged commit 273d044 into main May 23, 2026
0 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant