Skip to content

fix(#30): detect Cloudflare challenge pages and retry with premium_proxy#31

Merged
Jing-yilin merged 2 commits into
developfrom
fix/30-crawler-selector
Feb 27, 2026
Merged

fix(#30): detect Cloudflare challenge pages and retry with premium_proxy#31
Jing-yilin merged 2 commits into
developfrom
fix/30-crawler-selector

Conversation

@Jing-yilin
Copy link
Copy Markdown
Contributor

Problem

Closes #30

When Cloudflare serves a JS challenge interstitial page, ScrapingBee returns HTTP 200 with a small challenge HTML (no real project content). The existing retry logic only triggers on HTTP 429/5xx, so challenge pages passed through silently with 0 parsed campaigns. The nightly crawl sanity check then failed with:

ERROR: crawl sanity check FAILED — only 0 distinct campaigns seen (expected >=50)

Root Cause

CF challenge pages are small (<200KB) and contain challenge-platform/scripts/jsd/main.js. Real Kickstarter discover pages are 500KB+ and contain data-project elements. The two are easy to distinguish by size + marker.

Fix

scrapingbee_client.go

  • Added isCFChallengePage(html string) bool — detects challenge pages by the presence of CF's challenge-platform script in a small (<200KB) response
  • In doRequest, if a 200 response is a CF challenge, it's treated as a retryable error (logged + continue). The existing retry loop will escalate to premium_proxy=true on the 4th attempt
  • Added response size to success log for easier debugging

kickstarter_parser.go

  • Added warning log when 0 campaigns are parsed from a large (>50KB) page — surfaces HTML structure changes early

scrapingbee_client_test.go

  • 5 unit tests covering pure challenge, large real page with CF script injected, normal page, empty, small non-CF page

When Cloudflare serves a JS challenge interstitial (HTTP 200 but no
real content), the existing retry logic was never triggered because
status was 200. Crawl would silently return 0 campaigns and fail the
sanity check.

Fix: isCFChallengePage() detects pages that contain the CF challenge
platform script AND are small (<200KB). Real Kickstarter pages are
500KB+. Detected challenge pages are treated as retryable errors,
which will eventually escalate to premium_proxy on the 4th attempt.

Also adds a parser-level warning log when 0 campaigns are found in a
large page, to surface structural changes earlier.

Adds unit tests for isCFChallengePage.
Copy link
Copy Markdown
Contributor Author

@Jing-yilin Jing-yilin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finding:

  • Medium: backend/internal/service/kickstarter_parser.go:49 This new warning is in the shared parser, so it now fires for every caller that legitimately gets zero campaigns back, not just the nightly crawl. Search() can return empty results for a user query, and the crawl itself uses len(campaigns) == 0 as the normal pagination stop condition. On any of those pages that happen to be larger than 50 KB, we will now log possible HTML structure change or blocked response for an expected outcome, which dilutes the signal this change is trying to add. I would move this warning to the crawl call site or gate it on a stronger blocked-response signal instead of any large 0-result page.

Codex review (#31) flagged the parser-level warning as too noisy:
it would fire on legitimate empty search results and normal
pagination stop pages. Moved it to DiscoverCampaigns only, where
it reflects an actual crawl anomaly, not a user query.
@Jing-yilin Jing-yilin merged commit aa9ad5e into develop Feb 27, 2026
2 checks passed
@Jing-yilin Jing-yilin deleted the fix/30-crawler-selector branch February 27, 2026 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant