Skip to content

Add spider retry middleware and politeness delays per domain #52

@kamsirichard

Description

@kamsirichard

Context

The spider occasionally fails on network timeouts or rate-limit responses (429) from source sites. These failures are silent — the spider just moves on, and some opportunities are missed. Scrapy has built-in retry and autothrottle middleware but it's not configured.

Task

Enable and tune Scrapy's built-in middleware in scoutbot/settings.py:

1. Retry middleware

RETRY_ENABLED = True
RETRY_TIMES   = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]

2. AutoThrottle (politeness)

AUTOTHROTTLE_ENABLED     = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY   = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0

3. Respect robots.txt

ROBOTSTXT_OBEY = True

(Check if any current sources would be blocked before enabling)

Files to touch

  • scoutbot/settings.py

Notes

  • Test by running scrapy crawl opportunities locally and checking for 429 or timeout errors
  • AutoThrottle automatically slows down when servers are responding slowly

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformanceSpeed, efficiency, or resource improvements

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions