Disclaimer: Entity Resolution eXperiment (ERX) is my personal research during my free time and there is no checks or gaurantee. Please do your own checks if you are using this code for your research and projects. All sample data and examples are arbitrary and synthetic data I generated. No personally identifiable inforamtion is in this project.
Objective: This project aims to combine fuzzy matching, graph-based analysis, and feature generation for AML (Anti-Money Laundering) applications.
This system provides three main components:
- Entity Resolution: Identifies and groups similar records into entities with proper classification from multiple source systems
- Graph Generation: Creates a graph in TigerGraph with entities as nodes and transactions as edges
- Feature Generation: Computes graph-based features like PageRank, centrality measures, and risk scores
The system processes data from three source systems:
- Transaction Data (trnx): Originator, beneficiary, and third-party information
- Orbis Data (orbis): Company information and corporate records
- WorldCheck Data (WC): Individual and entity screening records
- Multi-source Entity Resolution: Processes parties from transaction, Orbis, and WorldCheck data
- Optional Field Handling: Gracefully handles missing email, phone, and address data
- Cross-source Entity Mapping: Maintains separate entity records for same entity across sources
- Entity Classification: Automatically classifies entities as individuals or businesses
- PEP Detection: Identifies Politically Exposed Persons
- Risk Scoring: Calculates risk scores based on various factors
- Graph Analytics: Leverages TigerGraph for advanced graph algorithms
- Feature Engineering: Generates 20+ graph-based features for machine learning
- Comprehensive Reporting: Detailed summaries and risk analysis
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Raw Data │ │ Entity Resolution│ │ TigerGraph │
│ (CSV files) │───▶│ Engine │───▶│ Database │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│ │
▼ ▼
┌──────────────────┐ ┌─────────────────┐
│ Feature Generator│ │ Graph Analytics │
│ (20+ features) │ │ (PageRank, etc.)│
└──────────────────┘ └─────────────────┘First, create sample data to test the system:
python src/data_synthesizer/generate_sample_data.pyThis creates:
data/sample_trnx_large.csv(1M transaction records with 42 fields)data/sample_orbis_large.csv(100K Orbis company records)data/sample_wc_large.csv(100K WorldCheck screening records)
Create consolidated party data from the three source systems:
python src/data_synthesizer/generate_party_ref_large.pyThis creates:
data/party_ref_large.csv(consolidated parties from all sources)
Process party reference data to identify and group similar entities:
python src/run_entity_resolution_optimized.pyThis creates:
data/entity.csv(resolved entities with confidence scores)
For complete functionality with TigerGraph:
python example_usage.pyThe system generates the following categories of features:
- PageRank Score: Entity importance in the transaction network
- Connected Component ID: Community membership
- Degree Centrality: Number of direct connections
- Total Transaction Amount: Sum of all transactions
- Transaction Count: Number of transactions
- Average Transaction Amount: Mean transaction value
- Amount Variance: Variability in transaction amounts
- Suspicious Pattern Score: Detection of structuring patterns
- Unique Currencies/Countries: Geographic and currency diversity
- Direct/Indirect Connections: Network reach
- PEP Connections: Connections to politically exposed persons
- High-Risk Connections: Connections to high-risk entities
- Network Density: Clustering coefficient
- Activity Recency: Days since last transaction
- Transaction Trend: Increasing/decreasing activity
- Recent vs Old Transactions: Temporal distribution
The pipeline generates several output files:
data/entity.csv: Resolved entities with confidence scores and resolved fieldsoutput/entities_with_features.csv: Complete entity data with all featuresoutput/features.csv: All generated featuresoutput/entity_mapping.json: Mapping from party IDs to entity IDsoutput/pipeline_summary.json: Comprehensive pipeline summary
The system processes party data from three source systems and applies the following rules:
- Transaction Data: Extract parties from
originator_name,beneficiary_name,TP_originator_name,TP_beneficiary_name - Orbis Data: Extract parties from
company_name - WorldCheck Data: Extract parties from
full_name
Each party record contains:
party_id: Unique identifiername,email,phone,address,country: Contact information (optional fields)accounts_list: Account numbers (only for transaction source)source_system: Origin system (trnx, orbis, WC)source_index_list: References to original source records
- Names: Remove extra spaces and punctuation, convert to lowercase
- Emails: Convert to lowercase, handle NaN values
- Phones: Remove all non-digit characters, handle NaN values
- Addresses: Normalize spaces and convert to lowercase, handle NaN values
For each pair of parties, the following fields are compared with weighted scoring:
- Name: Fuzzy matching (40% weight)
- Email: Exact match gets 1.0, otherwise fuzzy matching (30% weight)
- Phone: Exact match gets 1.0, otherwise fuzzy matching (20% weight)
- Address: Fuzzy matching (10% weight)
- Clustering: Group parties with similarity ≥ 0.7 into entities
- Cross-source Logic: Each entity represents one real-world entity from one source system
- Same entity across sources: Creates separate entity records (no cross-source deduplication)
- Within-source Deduplication: Aggregates multiple references to the same entity within a source
Each resolved entity contains:
entity_id: Unique entity identifierparty_ids: List of party IDs belonging to this entityconfidence_score: Resolution confidence (0.0-1.0)resolved_name,resolved_email,resolved_phone,resolved_address,resolved_country: Best available valuessource_systems: List of source systems this entity appears inrecords: Complete party data as JSON array
Among different graph databases I chose Tiger Graph for scalability reasons. See the comparison of graph databases for more information.
This is a personal research project and there is no support or licensing