Organizations cannot validate if their Data Security Posture Management (DSPM) tools actually work without uploading real, sensitive production data, which creates a chicken-and-egg security risk. Security teams need a way to test classification accuracy, data discovery coverage, and alert tuning without exposing actual PII, PHI, or financial data.
A Python utility that generates "High-Fidelity Synthetic PII." It creates files that look like real customer data to a scanner but contain zero actual sensitive information. The tool generates realistic patterns for SSNs, credit cards, email addresses, phone numbers, and medical record numbers across multiple file formats.
Validates tool effectiveness and classification accuracy in a controlled environment. Enables security teams to benchmark DSPM solutions (BigID, Microsoft Purview, Wiz) before purchasing or to tune existing deployments without compliance risk.
- Script to generate 1,000 rows of synthetic data in CSV, JSON, and Parquet formats.
- Configurable "data profiles" (e.g., Healthcare, Financial Services, Retail).
- A "Ground Truth" manifest file that documents what sensitive data patterns exist in each generated file.
- Python
- Faker Library
- Pandas
generate_test_data.pyground_truth_manifest.jsonsample_datasets/folder with pre-generated test filesREADME.mdwith usage examples
- Adding "obfuscated" data patterns to test advanced ML-based classification.
- Support for unstructured data (PDFs, DOCX files with embedded PII).
- API endpoint to generate data on-demand for CI/CD testing.