Skip to content

dsremo/sanitize-me

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SanitizeMe

Python Next.js Kotlin License: MIT

A personal data-removal toolkit for the Indian and global web. SanitizeMe scans where your email, phone, and username show up online; checks if any of those have appeared in known breaches; and helps you actually get the data taken down through legal opt-out letters under DPDP, GDPR, and CCPA.

There are three pieces in this repo:

  • web/ — a Next.js dashboard where the user runs scans and tracks opt-out progress
  • android/ — an Android app that sits alongside the dashboard and adds on-device features (HTTPS interception via a local CA, archive removal helpers)
  • backend/ — a FastAPI service that does the actual scanning, account discovery, breach lookups, and outbound legal-letter generation

The problem

Your name, email, phone number, and home address are sold by 200+ data brokers — most of which you've never heard of. Major sources:

  • Truecaller, Spokeo, BeenVerified, Whitepages — public people-search sites built off scraped phone directories and voter rolls
  • LinkedIn, GitHub, Twitter public profiles — your professional identity is permanently indexable
  • Old forum accounts and one-off sign-ups from 5+ years ago, still alive on the registration page that hasn't been touched since 2015
  • Breach corpora (HaveIBeenPwned reports 13B+ breached account records across 800+ documented breaches)
  • Wayback Machine and Common Crawl, which keep historical copies of pages you thought you'd deleted

DPDP (India 2023), GDPR (EU 2018), and CCPA (California 2020) all give you a legal right to demand removal — but each broker has different forms, different evidence requirements, and different response windows. Doing it by hand for 200 brokers is a multi-month project.

What SanitizeMe does

  1. Account discovery. Given an email or username, find every site that account exists on. Uses public probing techniques (forgot-password endpoints via holehe, public profile lookups, OSINT-style username enumeration across 120+ sites).

  2. Breach checks. Cross-references the email against breach corpora (HaveIBeenPwned-style, with local breach data fallback).

  3. Phone discovery. Reverse-lookup APIs (NumVerify, APILayer) to identify phone-number footprint and which carrier/region it leaks to.

  4. Data-broker opt-out. A built-in catalogue of 200+ data brokers (backend/data/brokers.json) with structured opt-out URLs, methods (web form, email, snail mail), legal basis (GDPR, CCPA, DPDP), difficulty rating, and region. The app auto-generates the right opt-out letter for each.

  5. Legal letter templates. DPDP, GDPR, and CCPA-compliant takedown letters in backend/api/legal_templates.py, parameterised on the user's data and the broker's response window.

  6. Web archive removal. Scripts in backend/api/archive_removal.py handle Wayback Machine and Common Crawl removal requests, and emit a properly-formatted robots.txt that excludes AI scraper user-agents (GPTBot, ClaudeBot, Google-Extended, CCBot, etc.).

  7. OAuth integration map. backend/api/oauth_integrations.py lists which OAuth providers a given account hangs off (Google, Apple, Facebook, etc.) so the user can revoke at the source instead of trying to delete downstream.

  8. Temp email and bulk actions. Spin up disposable inboxes for new sign-ups; run bulk scans / opt-outs across multiple accounts in one go.

Architecture

+-------------------+         +----------------------+
| Next.js Dashboard |  HTTPS  |  FastAPI Backend     |
| (web/)            | <-----> |  (backend/)          |
+-------------------+         +-----------+----------+
                                          |
                                          |  shells out to / queries
                                          v
            +-----------------------------+---------------------+
            |                             |                     |
       +----+----+              +---------+---------+    +------+------+
       | holehe  |              | NumVerify /        |    | HIBP-style  |
       | (120+   |              | APILayer phone     |    | breach data |
       |  sites) |              | reverse lookup     |    | corpus      |
       +---------+              +--------------------+    +-------------+

+----------------------+
| Android app          |
| (android/)           |
|                      |
| Local CA → MITM      |   on-device only
| HTTPS interception   |   (no traffic leaves device)
+----------------------+

The Android app is independent — it doesn't call the backend. It surfaces which third-party tracking domains your phone is talking to by installing a SanitizeMe-controlled CA cert and inspecting outbound HTTPS in-process. Useful for catching things the server-side scanners can't see (in-app trackers in your installed apps).

Repo layout

.
+-- web/                  Next.js dashboard (TypeScript + React)
|   +-- src/app/          App Router pages: scan, report, dashboard
|   +-- package.json
+-- android/              Android app (Kotlin + Gradle)
|   +-- app/src/main/     Activities, HTTPS interceptor, CA cert export
+-- backend/              FastAPI service
|   +-- main.py
|   +-- api/              Route handlers (one file per feature)
|   +-- scrapers/         Account / phone / public-record discovery
|   +-- db/               SQLAlchemy models + DB session
|   +-- data/brokers.json Curated catalogue of 200+ data brokers
|   +-- requirements.txt
+-- README.md

Backend

cd backend
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# fill in NUMVERIFY_API_KEY, APILAYER_KEY (optional, for phone discovery)
uvicorn main:app --reload --port 8000

Endpoints (excerpt):

Method Path Purpose
POST /scan/email Account + breach discovery for an email
POST /scan/phone Reverse phone lookup
POST /scan/username Username footprint via OSINT
GET /brokers Full broker catalogue
POST /optout/{broker_id} Generate opt-out letter for a broker
POST /letters/dpdp Generate DPDP takedown letter
POST /archive/wayback-removal Submit a Wayback Machine removal request
GET /report/{user_id} Aggregated report (used by the web dashboard)

Web dashboard

cd web
npm install
npm run dev          # http://localhost:3000

The dashboard hits the FastAPI backend at NEXT_PUBLIC_API_URL (default http://localhost:8000).

Android app

Requires Android Studio, SDK 34+, device on Android 8.0+ (API 26).

cd android
./gradlew assembleDebug
adb install app/build/outputs/apk/debug/app-debug.apk

Optional: enable HTTPS interception by exporting the SanitizeMe CA cert from inside the app and installing it under Settings -> Security -> Install Certificate -> CA Certificate. This lets the app inspect outbound HTTPS traffic and surface third-party tracking domains.

Security note. Installing a third-party CA cert means any app on your phone that doesn't pin certificates will trust SanitizeMe to MITM its HTTPS. SanitizeMe runs the interception locally and never routes traffic off-device, but a stolen device with the CA installed could in theory be used as a man-in-the-middle by someone else. Remove the cert when you're done auditing.

External tools

The backend optionally shells out to holehe for forgot-password probing across 120+ sites. Install it on your PATH:

pipx install holehe

If holehe is not on PATH, that scanner is silently skipped and the rest of the pipeline still works.

Key design decisions

Why local-first, no hosted SaaS? A "personal data removal" SaaS would be sitting on the most sensitive dataset imaginable — every user's full breach + broker exposure inventory. That's a single high-value target for an attacker. Running everything locally means no one else holds the data, and the user controls when (and whether) any of it leaves the laptop.

Why a separate Android app instead of folding it into the web dashboard? Mobile is where the tracking surface is — most third-party trackers are in-app SDKs, not web pixels. The web dashboard can't see what apps your phone is talking to; the Android app can, via the local CA + VpnService route. The two surfaces are complementary, not redundant.

Why curate brokers.json by hand instead of scraping a registry? No reliable registry exists. Different jurisdictions, different naming conventions, brokers actively try not to be findable. Hand-curated also lets the catalogue carry per-broker metadata (response window, evidence needed, success rate based on the maintainer's own results) that a scrape can't produce.

Why generate letters instead of auto-submitting opt-outs? Half the broker opt-out flows require a CAPTCHA, a notarised ID, or snail mail. The other half work over an HTML form that breaks every six months. Generating a ready-to-send letter (or paste-able form text) is the highest-leverage piece a maintainer can keep working; the actual submission is one extra click for the user but doesn't break.

Why DPDP + GDPR + CCPA in one tool? A typical user has data leaked across all three jurisdictions — the GDPR letter alone doesn't help with a Truecaller (India) listing, the DPDP alone doesn't move a US-only broker. The legal-basis field in brokers.json picks the right framework per broker.

Known limitations

  • holehe probes leak signal to the target sites. Every "forgot password" probe is logged by the target. A SOC analyst reviewing failed login attempts on, say, Reddit will see the probe. This is the OSINT trade-off — you're using the same techniques an attacker would use to enumerate your own exposure.
  • Phone-discovery APIs are rate-limited. NumVerify and APILayer free tiers are stingy; production use needs a paid plan or a long sleep between calls.
  • Broker catalogue ages fast. New brokers appear constantly; old ones change their opt-out form URL. Catalogue contributions welcome.
  • No automated success tracking. The dashboard records when a letter was generated, not whether the broker complied. Compliance verification requires the user re-running the scan a few weeks later.
  • The Android CA-cert MITM does not bypass certificate pinning. Apps that pin (most banks, WhatsApp, Signal) will reject the SanitizeMe CA and refuse to talk. That's intended on their side; it just means SanitizeMe can't audit them.

Status

Local-only. The backend runs on uvicorn :8000 and the dashboard on next dev :3000. There is no hosted instance and no plan for one — see the local-first rationale above.

License

MIT.

About

Personal data-removal toolkit. Find your email/phone/username online, check for breaches, and auto-generate DPDP/GDPR/CCPA opt-out letters for 200+ data brokers.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors