A personal data-removal toolkit for the Indian and global web. SanitizeMe scans where your email, phone, and username show up online; checks if any of those have appeared in known breaches; and helps you actually get the data taken down through legal opt-out letters under DPDP, GDPR, and CCPA.
There are three pieces in this repo:
web/— a Next.js dashboard where the user runs scans and tracks opt-out progressandroid/— an Android app that sits alongside the dashboard and adds on-device features (HTTPS interception via a local CA, archive removal helpers)backend/— a FastAPI service that does the actual scanning, account discovery, breach lookups, and outbound legal-letter generation
Your name, email, phone number, and home address are sold by 200+ data brokers — most of which you've never heard of. Major sources:
- Truecaller, Spokeo, BeenVerified, Whitepages — public people-search sites built off scraped phone directories and voter rolls
- LinkedIn, GitHub, Twitter public profiles — your professional identity is permanently indexable
- Old forum accounts and one-off sign-ups from 5+ years ago, still alive on the registration page that hasn't been touched since 2015
- Breach corpora (HaveIBeenPwned reports 13B+ breached account records across 800+ documented breaches)
- Wayback Machine and Common Crawl, which keep historical copies of pages you thought you'd deleted
DPDP (India 2023), GDPR (EU 2018), and CCPA (California 2020) all give you a legal right to demand removal — but each broker has different forms, different evidence requirements, and different response windows. Doing it by hand for 200 brokers is a multi-month project.
-
Account discovery. Given an email or username, find every site that account exists on. Uses public probing techniques (forgot-password endpoints via
holehe, public profile lookups, OSINT-style username enumeration across 120+ sites). -
Breach checks. Cross-references the email against breach corpora (HaveIBeenPwned-style, with local breach data fallback).
-
Phone discovery. Reverse-lookup APIs (NumVerify, APILayer) to identify phone-number footprint and which carrier/region it leaks to.
-
Data-broker opt-out. A built-in catalogue of 200+ data brokers (
backend/data/brokers.json) with structured opt-out URLs, methods (web form, email, snail mail), legal basis (GDPR, CCPA, DPDP), difficulty rating, and region. The app auto-generates the right opt-out letter for each. -
Legal letter templates. DPDP, GDPR, and CCPA-compliant takedown letters in
backend/api/legal_templates.py, parameterised on the user's data and the broker's response window. -
Web archive removal. Scripts in
backend/api/archive_removal.pyhandle Wayback Machine and Common Crawl removal requests, and emit a properly-formattedrobots.txtthat excludes AI scraper user-agents (GPTBot, ClaudeBot, Google-Extended, CCBot, etc.). -
OAuth integration map.
backend/api/oauth_integrations.pylists which OAuth providers a given account hangs off (Google, Apple, Facebook, etc.) so the user can revoke at the source instead of trying to delete downstream. -
Temp email and bulk actions. Spin up disposable inboxes for new sign-ups; run bulk scans / opt-outs across multiple accounts in one go.
+-------------------+ +----------------------+
| Next.js Dashboard | HTTPS | FastAPI Backend |
| (web/) | <-----> | (backend/) |
+-------------------+ +-----------+----------+
|
| shells out to / queries
v
+-----------------------------+---------------------+
| | |
+----+----+ +---------+---------+ +------+------+
| holehe | | NumVerify / | | HIBP-style |
| (120+ | | APILayer phone | | breach data |
| sites) | | reverse lookup | | corpus |
+---------+ +--------------------+ +-------------+
+----------------------+
| Android app |
| (android/) |
| |
| Local CA → MITM | on-device only
| HTTPS interception | (no traffic leaves device)
+----------------------+
The Android app is independent — it doesn't call the backend. It surfaces which third-party tracking domains your phone is talking to by installing a SanitizeMe-controlled CA cert and inspecting outbound HTTPS in-process. Useful for catching things the server-side scanners can't see (in-app trackers in your installed apps).
.
+-- web/ Next.js dashboard (TypeScript + React)
| +-- src/app/ App Router pages: scan, report, dashboard
| +-- package.json
+-- android/ Android app (Kotlin + Gradle)
| +-- app/src/main/ Activities, HTTPS interceptor, CA cert export
+-- backend/ FastAPI service
| +-- main.py
| +-- api/ Route handlers (one file per feature)
| +-- scrapers/ Account / phone / public-record discovery
| +-- db/ SQLAlchemy models + DB session
| +-- data/brokers.json Curated catalogue of 200+ data brokers
| +-- requirements.txt
+-- README.md
cd backend
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# fill in NUMVERIFY_API_KEY, APILAYER_KEY (optional, for phone discovery)
uvicorn main:app --reload --port 8000Endpoints (excerpt):
| Method | Path | Purpose |
|---|---|---|
| POST | /scan/email |
Account + breach discovery for an email |
| POST | /scan/phone |
Reverse phone lookup |
| POST | /scan/username |
Username footprint via OSINT |
| GET | /brokers |
Full broker catalogue |
| POST | /optout/{broker_id} |
Generate opt-out letter for a broker |
| POST | /letters/dpdp |
Generate DPDP takedown letter |
| POST | /archive/wayback-removal |
Submit a Wayback Machine removal request |
| GET | /report/{user_id} |
Aggregated report (used by the web dashboard) |
cd web
npm install
npm run dev # http://localhost:3000The dashboard hits the FastAPI backend at NEXT_PUBLIC_API_URL (default http://localhost:8000).
Requires Android Studio, SDK 34+, device on Android 8.0+ (API 26).
cd android
./gradlew assembleDebug
adb install app/build/outputs/apk/debug/app-debug.apkOptional: enable HTTPS interception by exporting the SanitizeMe CA cert from inside the app and installing it under Settings -> Security -> Install Certificate -> CA Certificate. This lets the app inspect outbound HTTPS traffic and surface third-party tracking domains.
Security note. Installing a third-party CA cert means any app on your phone that doesn't pin certificates will trust SanitizeMe to MITM its HTTPS. SanitizeMe runs the interception locally and never routes traffic off-device, but a stolen device with the CA installed could in theory be used as a man-in-the-middle by someone else. Remove the cert when you're done auditing.
The backend optionally shells out to holehe for forgot-password probing across 120+ sites. Install it on your PATH:
pipx install holeheIf holehe is not on PATH, that scanner is silently skipped and the rest of the pipeline still works.
Why local-first, no hosted SaaS? A "personal data removal" SaaS would be sitting on the most sensitive dataset imaginable — every user's full breach + broker exposure inventory. That's a single high-value target for an attacker. Running everything locally means no one else holds the data, and the user controls when (and whether) any of it leaves the laptop.
Why a separate Android app instead of folding it into the web dashboard? Mobile is where the tracking surface is — most third-party trackers are in-app SDKs, not web pixels. The web dashboard can't see what apps your phone is talking to; the Android app can, via the local CA + VpnService route. The two surfaces are complementary, not redundant.
Why curate brokers.json by hand instead of scraping a registry? No reliable registry exists. Different jurisdictions, different naming conventions, brokers actively try not to be findable. Hand-curated also lets the catalogue carry per-broker metadata (response window, evidence needed, success rate based on the maintainer's own results) that a scrape can't produce.
Why generate letters instead of auto-submitting opt-outs? Half the broker opt-out flows require a CAPTCHA, a notarised ID, or snail mail. The other half work over an HTML form that breaks every six months. Generating a ready-to-send letter (or paste-able form text) is the highest-leverage piece a maintainer can keep working; the actual submission is one extra click for the user but doesn't break.
Why DPDP + GDPR + CCPA in one tool? A typical user has data leaked across all three jurisdictions — the GDPR letter alone doesn't help with a Truecaller (India) listing, the DPDP alone doesn't move a US-only broker. The legal-basis field in brokers.json picks the right framework per broker.
holeheprobes leak signal to the target sites. Every "forgot password" probe is logged by the target. A SOC analyst reviewing failed login attempts on, say, Reddit will see the probe. This is the OSINT trade-off — you're using the same techniques an attacker would use to enumerate your own exposure.- Phone-discovery APIs are rate-limited. NumVerify and APILayer free tiers are stingy; production use needs a paid plan or a long sleep between calls.
- Broker catalogue ages fast. New brokers appear constantly; old ones change their opt-out form URL. Catalogue contributions welcome.
- No automated success tracking. The dashboard records when a letter was generated, not whether the broker complied. Compliance verification requires the user re-running the scan a few weeks later.
- The Android CA-cert MITM does not bypass certificate pinning. Apps that pin (most banks, WhatsApp, Signal) will reject the SanitizeMe CA and refuse to talk. That's intended on their side; it just means SanitizeMe can't audit them.
Local-only. The backend runs on uvicorn :8000 and the dashboard on next dev :3000. There is no hosted instance and no plan for one — see the local-first rationale above.
MIT.