Description
The crawler currently uses a global sleep (CRAWL_DELAY_SECONDS) between requests. If we scale to multiple threads or crawl a specific domain heavily, we risk overloading that server.
We need per-domain rate limiting.
Tasks
Acceptance Criteria
- Crawler never hits the same domain more than once every
N seconds (configurable).
- Global throughput can increase (by crawling distinct domains in parallel) without violating per-domain limits.
Description
The crawler currently uses a global sleep (
CRAWL_DELAY_SECONDS) between requests. If we scale to multiple threads or crawl a specific domain heavily, we risk overloading that server.We need per-domain rate limiting.
Tasks
now - last_accessed < MIN_DELAY, sleep or skip to a URL from a different domain.Acceptance Criteria
Nseconds (configurable).