Skip to content

Fix scraping - update for modern DownDetector HTML and improve bot detection bypass#16

Merged
aaryanrr merged 7 commits into
mainfrom
copilot/fix-scraping-issues
Dec 7, 2025
Merged

Fix scraping - update for modern DownDetector HTML and improve bot detection bypass#16
aaryanrr merged 7 commits into
mainfrom
copilot/fix-scraping-issues

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Dec 6, 2025

DownDetector changed their HTML structure and added bot detection, breaking the single-selector scraper. The CLI now fails silently or returns AttributeError.

Changes

Multi-strategy HTML parsing with backward compatibility:

  • Strategy 1: Original div#company > div.h2.entry-title (preserves old structure)
  • Strategy 2: Direct .entry-title lookup on h1/h2/h3/div (modern pages)
  • Strategy 3: Class pattern matching for "status"/"entry-title" with text validation
  • Strategy 4: Keyword-based h1 fallback (problem, outage, reports, etc.)

Bot detection bypass:

  • Updated User-Agent to Chrome 120.0.0.0
  • Added comprehensive browser headers (Accept-, Sec-Fetch-, DNT, Connection)
  • Session-based requests for cookie persistence
  • Extracted as module constants (USER_AGENT, DEFAULT_HEADERS)

Error handling improvements:

  • New NetworkError exception separating network failures from invalid services
  • 10-second timeout with raise_for_status() validation
  • Exception handling in Main.py for better UX
  • Improved error messages: Now includes the specific service name that failed, provides examples of valid service names (e.g., 'facebook', 'twitter', 'instagram'), and directs users to verify on downdetector.com

Code quality:

  • Extracted MIN_STATUS_TEXT_LENGTH constant
  • Generator-based iteration in Strategy 3 for early termination
  • Cached .lower() calls to avoid redundant operations
  • Removed obsolete AttributeError handling

Example

The scraper now handles multiple HTML structures:

# Modern structure (Strategy 2)
<h1 class="entry-title">User reports indicate problems at Facebook</h1>

# Original structure (Strategy 1) 
<div id="company">
  <div class="h2 entry-title">Problems at Facebook</div>
</div>

# Fallback (Strategy 4)
<h1>Major outage at Facebook</h1>

Error message example:

Unable to find status for 'invalid-service'. Please verify the service name is correct (e.g., 'facebook', 'twitter', 'instagram'). Visit https://downdetector.com to confirm the service exists.

All strategies tested with mock responses. CodeQL: 0 alerts.

Original prompt

this cli for down detector does not seem to work anymore. it is not able to scrape the correct status from the down detector website.

do the following things:

  1. figure out why it is unable to scrape the website for fetching the status
  2. make suitable changes to code in order to bypass any bot protections
  3. modify the scraping technique to make it align with the new website structure

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 5 commits December 6, 2025 10:08
…tection bypass

Co-authored-by: aaryanrr <73213670+aaryanrr@users.noreply.github.com>
Co-authored-by: aaryanrr <73213670+aaryanrr@users.noreply.github.com>
Co-authored-by: aaryanrr <73213670+aaryanrr@users.noreply.github.com>
…xtract constants

Co-authored-by: aaryanrr <73213670+aaryanrr@users.noreply.github.com>
Co-authored-by: aaryanrr <73213670+aaryanrr@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix down detector CLI scraping issues Fix scraping - update for modern DownDetector HTML and improve bot detection bypass Dec 6, 2025
Copilot AI requested a review from aaryanrr December 6, 2025 10:21
Comment thread src/Scraper.py Outdated
print(status_text)
else:
raise Errors.InvalidServiceName(
"Could not find status information. The service name may be invalid or the page structure has changed."
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot how about we improve this error message to better indicate the problem to the user?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved the error message in commit e16e998. It now:

  • Shows the specific service name that failed
  • Provides examples of valid service names (facebook, twitter, instagram)
  • Directs users to verify on downdetector.com

Example error message:

Unable to find status for 'invalid-service'. Please verify the service name is correct (e.g., 'facebook', 'twitter', 'instagram'). Visit https://downdetector.com to confirm the service exists.

Copy link
Copy Markdown
Owner

@aaryanrr aaryanrr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the comments given in the code. And test the code in an environemnt first.

Co-authored-by: aaryanrr <73213670+aaryanrr@users.noreply.github.com>
@aaryanrr aaryanrr marked this pull request as ready for review December 7, 2025 20:14
@aaryanrr aaryanrr added bug Something isn't working enhancement New feature or request labels Dec 7, 2025
@aaryanrr aaryanrr merged commit ec3974c into main Dec 7, 2025
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants