An adversarial, explainable name-screening engine that matches sanctioned names across Arabic and Latin scripts, survives deliberate evasion, and explains every match.
Type a name in Arabic or Latin. MIRSAD screens it against the open OFAC SDN and UN Consolidated sanctions lists, returns ranked candidate matches with a calibrated confidence score, and shows why each one matched, signal by signal, well enough for a compliance analyst and a regulator.
Sanctions screening is a never-finished problem in regulated finance, and it is brutal on two fronts:
- More than 90% of screening alerts are false positives. Every one is a human analyst-hour. Every miss is a sanctioned party slipping through and a regulatory penalty.
- Arabic makes it dramatically harder. A single name like محمد romanizes to Mohammed, Muhammad, Mohamad, Mohamed, Muhammed. Particles (Al, Bin, Abu) shift word order. خ, ع, ق have no clean Latin equivalent. English-first tools underperform here, which is exactly the gap.
MIRSAD is the screening core built for that gap, and stress-tested against the thing real criminals actually do: deliberately misspell their names to evade screening.
Screen a name and get back, for each candidate:
- a calibrated match score and confidence band (strong / possible / weak)
- the cross-script name pairing (Arabic ⟷ Latin) when an original-script name is on file
- a per-signal explanation (which features drove the score, positive and negative) plus a plain-language reason
- accept / dismiss for human-in-the-loop review, because the tool flags and a person decides
When the matched entity has no Arabic name on file (OFAC records are Latin-only), the dossier says so plainly rather than faking it:
On the full lists (8,192 individuals, 21,799 aliases including 332 Arabic-script names), evaluated on a held-out, entity-disjoint test split (no entity's aliases straddle train and test):
| Metric | Learned fusion | Fuzzy baseline | Read |
|---|---|---|---|
| AUC | 0.806 | 0.770 | learned ranks better overall |
| Adversarial recall @ rank-1 (Tiers 0 to 3) | 0.95 to 0.99 | 0.90 to 0.94 | learned wins under every evasion tier |
| Recall @ ≤1% false-positive rate | 0.450 | 0.490 | the baseline edges it here |
| Expected Calibration Error | 0.021 | 0.105 before isotonic | scores are well-calibrated |
| Latency per screen | 25 ms median | real-time on a single machine | |
| Blocking recall | 1.000 | nothing lost before scoring |
The honest headline is two-sided, and that is the point. The learned model wins on ranking quality (AUC) and on ranking the true entity first under every evasion tier, but the simple fuzzy baseline edges it at the strict ≤1% false-positive operating point. That gap is a diagnosis, not an embarrassment: it shows exactly where the Pass-2 deep encoder earns its keep (the extreme low-false-positive tail). For a fiduciary-minded reviewer, naming that honestly is more credible than a single inflated number.
flowchart LR
Q["Query name<br/>AR or EN"] --> N["Normalize<br/>diacritics, particles,<br/>cross-script"]
N --> B["Block<br/>trigram candidate<br/>generation"]
B --> F["Featurize<br/>edit-distance, phonetic,<br/>token, structural"]
F --> S["Learned fusion<br/>logistic regression<br/>+ calibration"]
S --> X["Explain<br/>per-signal<br/>contributions"]
X --> R["Ranked matches<br/>+ reasons"]
Three layers, each a thin, single-purpose module:
- Normalization and candidate generation. Ingest the sanctions XML, normalize Arabic and Latin names (strip diacritics, unify alef and hamza and taa-marbuta, handle Al/Bin/Abu particles), then a trigram inverted index returns a small candidate set so we never compare against all 21,799 aliases.
- Hybrid learned scoring. Eight deterministic features (Jaro-Winkler, Levenshtein, token-set and token-sort, an Arabic-aware phonetic score, particle-aware overlap, length, first-token) feed a logistic-regression fusion that learns to weight them. Trained on real same-entity alias pairs versus block-together cross-entity pairs.
- Explainability. Every match returns its per-signal contributions (the deterministic signals plus the learned weights) as a human-readable rationale, so an analyst can see and defend exactly why a name was flagged.
Most name-matching demos test on clean variants. Real criminals obfuscate. MIRSAD's benchmark generates evasive variants of held-out sanctioned names across four tiers, each grounded in documented evasion typologies (Wolfsberg, FATF, OFAC), and reports recall under that pressure versus a baseline:
- Tier 0, clean: a legitimate alternate transliteration (Mohammed to Muhammad)
- Tier 1, typo: one or two character edits
- Tier 2, structural: particle drops, token reordering, dropped middle name
- Tier 3, adversarial: homoglyph substitution, cross-script mixing, vowel manipulation, name splitting
Ethics. The variant generator is dual-use (effectively a sanctions-evasion cookbook), so it is internal-only: used by the evaluation harness, never exposed through the API or UI, never shipped as a runnable tool. This README publishes the methodology and the numbers; the generator itself is withheld deliberately.
- The linear fusion over-weights shared core tokens, so common name components (a lone "Mohammed") trigger false positives. This is visible in the signal panel, not hidden, and it is what the human-in-the-loop review is for.
- The phonetic feature is English-biased and contributes nothing on Arabic script (metaphone returns empty for Arabic). The real cross-script bridge is the Pass-2 Siamese encoder.
- The model trails the baseline at the strict ≤1% false-positive operating point (see above).
- Evaluation uses synthetic adversarial variants for robustness direction and a held-out slice of genuine aliases for real performance. Both are reported.
# 1. Backend (Python 3.12)
cd mirsad
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
python data/download.py # fetch the open OFAC + UN lists
PYTHONPATH=. python scripts/serve_api.py # screening API on :8000
# 2. Console (second terminal)
cd ui
npm install
npm run dev # http://localhost:5173Reproduce the evaluation and regenerate the charts:
PYTHONPATH=. python scripts/run_eval.pyNote: the adversarial-tier evaluation calls the internal-only variant generator, which is withheld by design (see The adversarial benchmark). That portion runs only where the local component is present; the honest-alias metrics and every figure driven by the open lists reproduce directly.
mirsad/
├── data/ download scripts + cached sanctions XML (raw data gitignored)
├── ingest/ OFAC + UN XML parsers -> normalized Entity records
├── normalize/ Arabic/Latin normalization (extracted as a reusable skill)
├── match/ features, blocking, learned fusion scorer, SHAP explanation, screen()
├── adversarial/ Tier 0 to 3 red-team generator (internal-only)
├── eval/ recall@FP, adversarial lift, calibration, latency, charts
├── api/ FastAPI /screen + /benchmark
├── ui/ React + Vite + TS + Tailwind "Watchtower" console
├── docs/ architecture, specs, plans, screenshots
├── PRODUCT.md design context (users, tone)
└── DESIGN.md the Watchtower design system
Python 3.12, lxml, rapidfuzz, jellyfish, scikit-learn, shap, pandas, matplotlib, FastAPI, uvicorn, pytest. Front-end: React, Vite, TypeScript, Tailwind, shadcn/ui, Radix.
All data is open and public (OFAC SDN, UN Consolidated). No bank data, no private feeds, no NDA. This is a portfolio project built to demonstrate the screening problem class, not a deployed compliance system. The code is released under the MIT License.
Concepts explained from first principles, the architecture decisions (ADRs), the full engineering journal, and a self-study curriculum live in the Obsidian vault: ../The Dome/MIRSAD/MIRSAD-MOC.md. See also docs/ARCHITECTURE.md and the design system in DESIGN.md. Inherits workspace standards from ../CLAUDE.md.


