Skip to content

Horizontal Scaling

Nick edited this page Nov 21, 2025 · 1 revision

Horizontal Scaling

Scale PATAS to 10M+ messages/day using multiple instances.

Current Limitation

Distributed locks prevent concurrent processing of the same dataset. You cannot simply run 10 instances on the same dataset - they will block each other.

Solution: Shard data across instances:

  • Split data into N shards (by message_id or timestamp)
  • Each instance processes its own shard with unique lock key
  • Merge results after processing

Data Sharding

Approach 1: By message_id

# Instance 1: message_id 1-1,000,000
patas mine-patterns --days=7 --shard-id=1 --total-shards=10

# Instance 2: message_id 1,000,001-2,000,000
patas mine-patterns --days=7 --shard-id=2 --total-shards=10

Lock key: pattern_mining:7:10:shard:1 (unique per shard)

Approach 2: By timestamp

Split data into time windows:

  • Instance 1: days 1-2
  • Instance 2: days 3-4
  • etc.

Requirements for 10M Messages/Day

Option 1: Incremental Mining (Recommended)

Process only new messages (~1.4M/day):

  • CPU: 4-8 vCPU
  • RAM: 16-32 GB
  • Disk: 200 GB SSD
  • PostgreSQL: 32-64 GB RAM
  • Redis: 4-8 GB RAM
  • Time: ~10 hours

Option 2: 10 Instances (Parallel)

Per instance:

  • CPU: 4-8 vCPU
  • RAM: 16-32 GB

Shared infrastructure:

  • PostgreSQL: 64-128 GB RAM
  • Redis: 8-16 GB RAM

Time: ~7 hours (parallel)

Quality Impact

Sharding maintains quality if results are properly merged:

  • Deduplicate patterns across shards
  • Use global metrics for quality tiers
  • Set low min_spam_count to catch rare patterns

Roadmap

P1 (after successful pilot):

  • Automatic sharding in CLI/API
  • Result merging and deduplication
  • Database partitioning

P2 (for 100M+ messages):

  • Sharded evaluation
  • Read replicas for evaluation queries

Clone this wiki locally