fix(#30): detect Cloudflare challenge pages and retry with premium_proxy#31
Merged
Conversation
When Cloudflare serves a JS challenge interstitial (HTTP 200 but no real content), the existing retry logic was never triggered because status was 200. Crawl would silently return 0 campaigns and fail the sanity check. Fix: isCFChallengePage() detects pages that contain the CF challenge platform script AND are small (<200KB). Real Kickstarter pages are 500KB+. Detected challenge pages are treated as retryable errors, which will eventually escalate to premium_proxy on the 4th attempt. Also adds a parser-level warning log when 0 campaigns are found in a large page, to surface structural changes earlier. Adds unit tests for isCFChallengePage.
Jing-yilin
commented
Feb 27, 2026
Contributor
Author
Jing-yilin
left a comment
There was a problem hiding this comment.
Finding:
- Medium: backend/internal/service/kickstarter_parser.go:49 This new warning is in the shared parser, so it now fires for every caller that legitimately gets zero campaigns back, not just the nightly crawl.
Search()can return empty results for a user query, and the crawl itself useslen(campaigns) == 0as the normal pagination stop condition. On any of those pages that happen to be larger than 50 KB, we will now logpossible HTML structure change or blocked responsefor an expected outcome, which dilutes the signal this change is trying to add. I would move this warning to the crawl call site or gate it on a stronger blocked-response signal instead of any large 0-result page.
Codex review (#31) flagged the parser-level warning as too noisy: it would fire on legitimate empty search results and normal pagination stop pages. Moved it to DiscoverCampaigns only, where it reflects an actual crawl anomaly, not a user query.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Closes #30
When Cloudflare serves a JS challenge interstitial page, ScrapingBee returns HTTP 200 with a small challenge HTML (no real project content). The existing retry logic only triggers on HTTP 429/5xx, so challenge pages passed through silently with 0 parsed campaigns. The nightly crawl sanity check then failed with:
Root Cause
CF challenge pages are small (<200KB) and contain
challenge-platform/scripts/jsd/main.js. Real Kickstarter discover pages are 500KB+ and containdata-projectelements. The two are easy to distinguish by size + marker.Fix
scrapingbee_client.goisCFChallengePage(html string) bool— detects challenge pages by the presence of CF's challenge-platform script in a small (<200KB) responsedoRequest, if a 200 response is a CF challenge, it's treated as a retryable error (logged +continue). The existing retry loop will escalate topremium_proxy=trueon the 4th attemptkickstarter_parser.goscrapingbee_client_test.go