Skip to content

Implement Domain-Level Rate Limiting (Politeness) #27

@Digvijay-x1

Description

@Digvijay-x1

Description

The crawler currently uses a global sleep (CRAWL_DELAY_SECONDS) between requests. If we scale to multiple threads or crawl a specific domain heavily, we risk overloading that server.

We need per-domain rate limiting.

Tasks

  • Redis: Use Redis to store the "last accessed time" for each domain.
  • Crawler (C++): Check Redis before fetching. If now - last_accessed < MIN_DELAY, sleep or skip to a URL from a different domain.
  • Queue: Ideally, prioritize URLs from domains we haven't hit recently (requires smarter queue management).

Acceptance Criteria

  • Crawler never hits the same domain more than once every N seconds (configurable).
  • Global throughput can increase (by crawling distinct domains in parallel) without violating per-domain limits.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions