-
Notifications
You must be signed in to change notification settings - Fork 0
Horizontal Scaling
Nick edited this page Nov 21, 2025
·
1 revision
Scale PATAS to 10M+ messages/day using multiple instances.
Distributed locks prevent concurrent processing of the same dataset. You cannot simply run 10 instances on the same dataset - they will block each other.
Solution: Shard data across instances:
- Split data into N shards (by message_id or timestamp)
- Each instance processes its own shard with unique lock key
- Merge results after processing
# Instance 1: message_id 1-1,000,000
patas mine-patterns --days=7 --shard-id=1 --total-shards=10
# Instance 2: message_id 1,000,001-2,000,000
patas mine-patterns --days=7 --shard-id=2 --total-shards=10Lock key: pattern_mining:7:10:shard:1 (unique per shard)
Split data into time windows:
- Instance 1: days 1-2
- Instance 2: days 3-4
- etc.
Process only new messages (~1.4M/day):
- CPU: 4-8 vCPU
- RAM: 16-32 GB
- Disk: 200 GB SSD
- PostgreSQL: 32-64 GB RAM
- Redis: 4-8 GB RAM
- Time: ~10 hours
Per instance:
- CPU: 4-8 vCPU
- RAM: 16-32 GB
Shared infrastructure:
- PostgreSQL: 64-128 GB RAM
- Redis: 8-16 GB RAM
Time: ~7 hours (parallel)
Sharding maintains quality if results are properly merged:
- Deduplicate patterns across shards
- Use global metrics for quality tiers
- Set low
min_spam_countto catch rare patterns
P1 (after successful pilot):
- Automatic sharding in CLI/API
- Result merging and deduplication
- Database partitioning
P2 (for 100M+ messages):
- Sharded evaluation
- Read replicas for evaluation queries