Context
The spider occasionally fails on network timeouts or rate-limit responses (429) from source sites. These failures are silent — the spider just moves on, and some opportunities are missed. Scrapy has built-in retry and autothrottle middleware but it's not configured.
Task
Enable and tune Scrapy's built-in middleware in scoutbot/settings.py:
1. Retry middleware
RETRY_ENABLED = True
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]
2. AutoThrottle (politeness)
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
3. Respect robots.txt
(Check if any current sources would be blocked before enabling)
Files to touch
Notes
- Test by running
scrapy crawl opportunities locally and checking for 429 or timeout errors
- AutoThrottle automatically slows down when servers are responding slowly
Context
The spider occasionally fails on network timeouts or rate-limit responses (429) from source sites. These failures are silent — the spider just moves on, and some opportunities are missed. Scrapy has built-in retry and autothrottle middleware but it's not configured.
Task
Enable and tune Scrapy's built-in middleware in
scoutbot/settings.py:1. Retry middleware
2. AutoThrottle (politeness)
3. Respect robots.txt
(Check if any current sources would be blocked before enabling)
Files to touch
scoutbot/settings.pyNotes
scrapy crawl opportunitieslocally and checking for 429 or timeout errors