Skip to content

note: PerplexityBot fetchFailed on WAF-fronted sites is expected, not a bug #28

@BraedenBDev

Description

@BraedenBDev

Summary

PerplexityBot fetch times out against sites fronted by aggressive bot WAFs (Cloudflare bot-fight, DataDome, Akamai Bot Manager), even on canonical hosts that pass every other bot check. This is site-side behavior, not a crawl-sim defect — documenting here so the tool's output isn't misread as a tool bug.

Observation

During a live audit of https://almostimpossible.agency on v1.5.0:

  • googlebot, gptbot, claudebot all returned HTTP 200 with identical server HTML
  • perplexitybot timed out twice at 30s — once on the bare domain, again on https://www.almostimpossible.agency/ directly

The canonical-host fix (#27's companion work in v1.5.0) did not cause this, and it's not a DNS or routing issue. It reproduces on the canonical URL with a normal curl:

curl -I -A "Mozilla/5.0 (compatible; PerplexityBot/1.0; +https://www.perplexity.ai/bot)" \
  --max-time 30 https://www.almostimpossible.agency/
# hangs / times out

Why this happens

The three tiers of bot classification at Cloudflare/Akamai/DataDome treat PerplexityBot differently than Googlebot:

  • Googlebot: verified via reverse-DNS, allowed through by default
  • GPTBot / ClaudeBot: OpenAI and Anthropic publish crawler IP ranges, which major WAFs recognize
  • PerplexityBot: published IP ranges and verification are newer and less widely recognized by WAF rulesets. Several providers hold the connection open (silent-drop) rather than 403ing

This is exactly why crawl-sim exists — to surface these gaps. But it also means PerplexityBot will legitimately time out on many large-agency sites.

What crawl-sim should do

v1.5.1 already does the right thing:

  • bots.perplexitybot.fetchFailed: true with the curl timeout error text preserved
  • bots.perplexitybot.score: 0 / grade: F
  • warnings[] now surfaces a high-severity weighted_bot_fetch_failed entry (added in bug: overall composite ignores fetchFailed weighted bots #27)
  • overall.score correctly drops to reflect the failure (no more false 100/A)

Suggested follow-up

Not a bug; consider these enhancements:

  • Add a WAF-detection hint when three bots succeed and one times out. Output something like warnings[].code = "likely_waf_bot_block" with the affected bot id.
  • Document the WAF tier-mismatch in README.md or a docs/waf-behavior.md, so users interpret PerplexityBot fetchFailed correctly without opening new bug reports.
  • Optional: add a retry-with-Googlebot-verification-hint phase that attempts one more fetch with the canonical PerplexityBot UA + longer timeout, so transient WAF-challenge gets distinguished from persistent blocking.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions