fix(#30): detect Cloudflare challenge pages and retry with premium_proxy by Jing-yilin · Pull Request #31 · ReScienceLab/KickWatch

Jing-yilin · 2026-02-27T13:27:51Z

Problem

Closes #30

When Cloudflare serves a JS challenge interstitial page, ScrapingBee returns HTTP 200 with a small challenge HTML (no real project content). The existing retry logic only triggers on HTTP 429/5xx, so challenge pages passed through silently with 0 parsed campaigns. The nightly crawl sanity check then failed with:

ERROR: crawl sanity check FAILED — only 0 distinct campaigns seen (expected >=50)

Root Cause

CF challenge pages are small (<200KB) and contain challenge-platform/scripts/jsd/main.js. Real Kickstarter discover pages are 500KB+ and contain data-project elements. The two are easy to distinguish by size + marker.

Fix

scrapingbee_client.go

Added isCFChallengePage(html string) bool — detects challenge pages by the presence of CF's challenge-platform script in a small (<200KB) response
In doRequest, if a 200 response is a CF challenge, it's treated as a retryable error (logged + continue). The existing retry loop will escalate to premium_proxy=true on the 4th attempt
Added response size to success log for easier debugging

kickstarter_parser.go

Added warning log when 0 campaigns are parsed from a large (>50KB) page — surfaces HTML structure changes early

scrapingbee_client_test.go

5 unit tests covering pure challenge, large real page with CF script injected, normal page, empty, small non-CF page

When Cloudflare serves a JS challenge interstitial (HTTP 200 but no real content), the existing retry logic was never triggered because status was 200. Crawl would silently return 0 campaigns and fail the sanity check. Fix: isCFChallengePage() detects pages that contain the CF challenge platform script AND are small (<200KB). Real Kickstarter pages are 500KB+. Detected challenge pages are treated as retryable errors, which will eventually escalate to premium_proxy on the 4th attempt. Also adds a parser-level warning log when 0 campaigns are found in a large page, to surface structural changes earlier. Adds unit tests for isCFChallengePage.

Jing-yilin

Finding:

Medium: backend/internal/service/kickstarter_parser.go:49 This new warning is in the shared parser, so it now fires for every caller that legitimately gets zero campaigns back, not just the nightly crawl. Search() can return empty results for a user query, and the crawl itself uses len(campaigns) == 0 as the normal pagination stop condition. On any of those pages that happen to be larger than 50 KB, we will now log possible HTML structure change or blocked response for an expected outcome, which dilutes the signal this change is trying to add. I would move this warning to the crawl call site or gate it on a stronger blocked-response signal instead of any large 0-result page.

Codex review (#31) flagged the parser-level warning as too noisy: it would fire on legitimate empty search results and normal pagination stop pages. Moved it to DiscoverCampaigns only, where it reflects an actual crawl anomaly, not a user query.

Jing-yilin commented Feb 27, 2026

View reviewed changes

Jing-yilin merged commit aa9ad5e into develop Feb 27, 2026
2 checks passed

Jing-yilin deleted the fix/30-crawler-selector branch February 27, 2026 13:38

Jing-yilin mentioned this pull request Feb 27, 2026

fix: add missing velocity_24h and ple_delta_24h columns to campaigns table #33

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(#30): detect Cloudflare challenge pages and retry with premium_proxy#31

fix(#30): detect Cloudflare challenge pages and retry with premium_proxy#31
Jing-yilin merged 2 commits into
developfrom
fix/30-crawler-selector

Jing-yilin commented Feb 27, 2026

Uh oh!

Jing-yilin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Jing-yilin commented Feb 27, 2026

Problem

Root Cause

Fix

Uh oh!

Jing-yilin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant