Skip to content

Improve PII generator: more realistic fake data and lookup existing mappings first #465

@hanneshapke

Description

@hanneshapke

Summary

Once entity persistence is restored (see #464), the next gap is the generator itself:

  1. The per-label generators use small inline arrays of fake values, so collisions and obviously-fake outputs are common — we should expand the pools and produce more realistic data (e.g., richer name pools, locale-aware addresses, realistic-looking but invalid credit card / SSN / phone formats).
  2. The masking pipeline regenerates a fresh dummy for every detected entity on every request. We should look up the persisted mapping first so the same original PII always maps to the same dummy, avoiding duplicates and making conversations consistent across turns.

Code path to review

Generator dispatch and per-type generators:

  • src/backend/pii/generator_service.go:54GenerateReplacement(label, originalText) and the label → generator routing table at :60
  • src/backend/pii/generators/pii_generators.go — all per-type generators (~500 lines). Example: EmailGenerator at :29 uses ~50 first names and ~40 last names; PhoneGenerator at :73. Expand the inline pools and/or load from data files.

Masking pipeline that currently bypasses the existing mapping:

  • src/backend/pii/masking_service.go:89-94 — the loop calls s.generator.GenerateReplacement(...) unconditionally for every detected entity. It does not check the existing store before generating, so each request produces a new dummy and over-writes the previous mapping via StoreMapping's upsert.

Mapping lookup that already exists and should be wired in:

  • src/backend/pii/mapper.go:106PIIMapping.GetDummy(original) — cache-first, falls through to SQLite. This is exactly the dedupe check we need before calling the generator.
  • src/backend/pii/database.go:185StoreMapping upserts; once dedupe is in place, repeated entities should hit the cache/DB and skip the generator entirely.

Suggested next steps

  1. In masking_service.go:89, before calling GenerateReplacement, pass the PIIMapping (or a small lookup interface) into MaskingService and check mapper.GetDummy(originalText) first. Only generate if there's no existing mapping. Then AddMapping the new one so future requests are consistent.
  2. Audit each generator in pii_generators.go for pool size and realism. Decide whether to keep them inline (and just expand) or move to embedded data files (embed.FS) so we can ship larger, locale-aware pools without bloating the source.
  3. Add tests in src/backend/pii/generators/pii_generators_test.go covering: (a) the generator never returns the original input, (b) repeated calls for the same (label, original) return the persisted dummy rather than a new one (this test will fail until step 1 lands).

Depends on

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions