Skip to content

fix: Improve block detection and browser fallback for small responses#8

Open
eskobar95 wants to merge 44 commits intomainfrom
fix/anti-scraping-optimization
Open

fix: Improve block detection and browser fallback for small responses#8
eskobar95 wants to merge 44 commits intomainfrom
fix/anti-scraping-optimization

Conversation

@eskobar95
Copy link
Copy Markdown
Owner

@eskobar95 eskobar95 commented Dec 6, 2025

Fix

Problem

Serveren modtog små responses (2403 bytes) fra Transfermarkt som ikke blev opdaget som blokering, hvilket resulterede i tomme resultater. Normal Transfermarkt sider er 50k+ bytes.

Solution

  1. Increased block detection threshold - Fra 1000 til 5000 bytes i _detect_block()
  2. Content size validation in request_url_page() - Tjekker content størrelse og trigger automatisk browser fallback hvis < 5000 bytes
  3. Browser response validation - Sikrer at browser fallback også returnerer tilstrækkelig content

Changes

  • app/services/base.py:
    • Opdateret _detect_block() threshold fra 1000 til 5000 bytes
    • Tilføjet content size validation i request_url_page()
    • Automatisk browser fallback når content er for lille
    • Validering af browser response størrelse

Testing

  • ✅ Local: Virker korrekt (normal pages er 50k+ bytes)
  • ✅ Block detection: Opdager nu små block pages korrekt
  • ✅ Browser fallback: Trigger automatisk ved små responses

Related Issues

Fixes issue hvor serveren modtog små block pages (2403 bytes) der ikke blev opdaget, hvilket resulterede i tomme resultater for alle endpoints (club search, club players, etc.).

Summary by CodeRabbit

  • New Features

    • Added diagnostic /debug/scraping endpoint for troubleshooting data retrieval issues
    • Club player responses now include optional name and season identifiers
    • Improved fallback scraping mechanism with optional browser-based support
  • Bug Fixes

    • Enhanced accuracy of player position and age data extraction through improved selectors
    • Strengthened error handling and validation across search endpoints
  • Chores

    • Added browser automation framework to dependencies

✏️ Tip: You can customize this high-level summary in your review settings.

- Add full support for national teams across all club endpoints
- Add new /clubs/{club_id}/competitions endpoint to retrieve club competitions
- Add isNationalTeam field to Club Profile response schema
- Make Club Profile fields optional to accommodate national teams
- Enhance Club Players endpoint to handle national team HTML structure
- Update XPath expressions to support both club and national team structures
- Add intelligent detection logic for national teams
- Maintain backward compatibility with existing club endpoints

This update enables the API to work seamlessly with both regular clubs
and national teams, providing a unified interface for all club-related
data retrieval.
- Add GET /competitions/{competition_id}/seasons endpoint
- Implement TransfermarktCompetitionSeasons service to scrape season data
- Add CompetitionSeason and CompetitionSeasons Pydantic schemas
- Support both cross-year (e.g., 25/26) and single-year (e.g., 2025) seasons
- Handle historical seasons correctly (e.g., 99/00 -> 1999-2000)
- Extract seasons from competition page dropdown/table structure
- Return season_id, season_name, start_year, and end_year for each season
- Sort seasons by start_year descending (newest first)

Closes #[issue-number]
- Detect national team competitions (FIWC, EURO, COPA, AFAC, GOCU, AFCN)
- Use /teilnehmer/pokalwettbewerb/ URL for national team competitions
- Handle season_id correctly (year-1 for national teams in URL)
- Add XPath expressions for participants table
- Limit participants to expected tournament size to exclude non-qualified teams
- Make season_id optional in CompetitionClubs schema
- Update Dockerfile PYTHONPATH configuration
- Add length validation for ids and names before zip() to prevent silent data loss
- Raise descriptive ValueError with logging if ids and names mismatch
- Simplify seasonId assignment logic for national teams
- Remove unnecessary try/except block (isdigit() prevents ValueError)
- Clean up unreachable fallback code
- Add tournament size configuration to Settings class with environment variable support
- Replace hardcoded dict with settings.get_tournament_size() method
- Add warning logging when tournament size is not configured (instead of silent truncation)
- Proceed without truncation when size is unavailable (no silent data loss)
- Add validation for tournament sizes (must be positive integers)
- Add comprehensive unit tests for both configured and fallback paths
- Update README.md with new environment variables documentation

This prevents silent truncation when tournament sizes change (e.g., World Cup expanding to 48)
and allows easy configuration via environment variables.
- Remove extra HTTP request to fetch club profile just to read isNationalTeam
- Set is_national_team=None to let TransfermarktClubPlayers use DOM heuristics
- Remove broad except Exception that silently swallowed all errors
- Improve performance by eliminating redundant network call
- Players class already has robust DOM-based detection for national teams
- Move datetime and HTTPException imports from method level to module level
- Improves code readability and marginally improves performance
- Follows Python best practices for import organization
- Move datetime and HTTPException imports from method level to module level
- Improves code readability and marginally improves performance
- Follows Python best practices for import organization
- Keep imports at module level in clubs/competitions.py (from CodeRabbit review)
- Preserve is_national_team flag logic in clubs/players.py
- Keep name padding in competitions/search.py
- Add .DS_Store to .gitignore
- Remove whitespace from blank lines (W293)
- Add missing trailing commas (COM812)
- Split long XPath lines to comply with E501 line length limit
- Format XPath strings to comply with line length
- Format list comprehensions
- Format is_season condition
- Fix session initialization issue causing all HTTP requests to fail
- Improve block detection to avoid false positives
- Optimize browser scraping delays (reduce from 12-13s to 0.4-0.8s)
- Update XPath definitions for clubs, competitions, and players search
- Fix nationalities parsing in player search (relative to each row)
- Add comprehensive monitoring endpoints
- Update settings for anti-scraping configuration

Performance improvements:
- HTTP success rate: 0% → 100%
- Response time: 12-13s → 0.4-0.8s
- Browser fallback: Always → Never needed
- All endpoints now working correctly
Resolved conflicts:
- app/services/clubs/players.py: Kept improved nationalities parsing with trim()
- app/settings.py: Kept anti-scraping configuration settings
- app/utils/xpath.py: Combined URL from HEAD with robust NAME fallbacks from main
- Fix import sorting
- Add trailing commas
- Replace single quotes with double quotes
- Add noqa comments for long lines (User-Agent strings, XPath definitions)
- Remove unused variables
- Fix whitespace issues
- Change padding logic for players_joined_on, players_joined, and players_signed_from
- Use "" instead of None to match the default value when elements are None
- Fixes CodeRabbit review: inconsistent placeholder values
- Add try/except for playwright import to handle missing dependency
- Make _browser_scraper optional (None if playwright unavailable)
- Add checks in make_request_with_browser_fallback and get_monitoring_stats
- Update test_browser_scraping endpoint to handle missing playwright
- Add playwright to requirements.txt
- App can now start without playwright, browser scraping disabled if unavailable
- Add playwright install chromium step in Dockerfile
- Only runs if playwright is installed (graceful fallback)
- Ensures browser binaries are available for Railway deployment
- Keep optional playwright import in base.py
- Keep playwright availability check in test_browser_scraping endpoint
- Maintains Railway deployment compatibility
- Keep optional playwright import and checks
- Maintain browser scraper optional initialization
- Preserve playwright availability checks in monitoring stats
- All conflicts resolved, ready for merge
- Validate HTTP responses have content before returning
- Validate browser scraping content before using
- Add detailed logging for debugging deployment issues
- Raise proper exceptions when content is empty or invalid
- Helps diagnose why server returns 200 with empty content
- Add /debug/scraping endpoint to test HTTP, browser, and page requests
- Shows content lengths, errors, and availability status
- Helps diagnose why server returns empty responses
- Better error messages in request_url_page and make_request_with_browser_fallback
- Validate page is not None after request_url_page()
- Add exception handling in endpoints to catch and log errors
- Add XPath error handling with detailed error messages
- Warn when search results are empty (helps diagnose Railway issues)
- Prevents silent failures that return 200 with empty data
- Log page HTML length and content validation
- Log XPath extraction results
- Warn when no results found
- Helps diagnose why server returns empty results while local works
- Default to current year if season_id is None (like Club Competitions)
- Add club name and seasonId to response
- Fix URL formatting to prevent 'None' in URL
- Improves __update_season_id to validate extracted season
- Fix DOB_AGE XPath to use TD[5] instead of zentriert[2] selector
- Default season_id to current year if None (prevents 'None' in URL)
- Add club name and seasonId to response
- Resolves issue where 0 players were returned despite finding URLs/names
- Use more specific XPath matching zentriert td with date pattern
- Filters for dates (contains '/') and minimum length to avoid false matches
- Resolves issue where DOB_AGE returned 0 items
- Add Optional name and season_id fields to ClubPlayers schema
- These fields are set in get_club_players() but were missing from schema
- Resolves issue where name and seasonId were None in API response
- Increase block detection threshold from 1000 to 5000 bytes
- Add content size validation in request_url_page to catch small block pages
- Automatically trigger browser fallback when content is suspiciously small
- Normal Transfermarkt pages are 50k+ bytes, so 5k is a safe threshold
- Resolves issue where server received 2403 byte responses (block pages) that weren't detected
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Dec 6, 2025

Walkthrough

This pull request introduces optional Playwright browser integration for fallback scraping, adds comprehensive error handling and content validation across the API and service layers, enhances XPath selectors for improved robustness, and includes a new debug endpoint for testing scraping functionality. The changes make the system more resilient to failures with diagnostic logging.

Changes

Cohort / File(s) Summary
Infrastructure & Dependencies
Dockerfile, requirements.txt
Adds conditional Playwright browser installation in Docker and declares playwright==1.48.0 dependency.
API Endpoint Error Handling
app/api/endpoints/clubs.py, app/api/endpoints/competitions.py
Wraps search endpoints with try/except blocks; adds result validation, warning logging for missing/empty results, and exception propagation.
Debug & Scraping Endpoint
app/main.py
Introduces /debug/scraping endpoint that tests HTTP requests, optional browser fallback, and full-page requests; includes content length tracking and error reporting. Adds Playwright availability guard to existing test_browser_scraping.
Schema Updates
app/schemas/clubs/players.py
Adds two optional fields: name and season_id, both defaulting to None.
Core Service Layer – Browser Integration
app/services/base.py
Makes Playwright optional with conditional imports and PLAYWRIGHT_AVAILABLE flag; converts _browser_scraper to Optional type; adds content validation thresholds (5000 bytes) and browser fallback logic; enhances error messages with URL context; updates monitoring stats to reflect browser capability.
Service Layer – Search & Data Fetching
app/services/clubs/search.py, app/services/competitions/search.py, app/services/clubs/players.py
Adds initialization-time page-load validation with error logging; enhances parsing with diagnostic output and null checks; populates season_id (defaulting to current year) and club name in responses.
XPath Selectors Refinement
app/utils/xpath.py
Updates POSITIONS selector to use contains(@Class, 'zentriert') instead of exact class match; refines DOB_AGE selector to detect date patterns with / and minimum length for more robust extraction.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Areas requiring extra attention:

  • app/services/base.py: Complex browser fallback logic with content validation thresholds and conditional Playwright availability handling
  • app/services/clubs/search.py and app/services/competitions/search.py: Enhanced initialization validation and diagnostic logging patterns
  • app/utils/xpath.py: XPath selector refinements—verify new patterns correctly target positions and date-of-birth fields across Transfermarkt pages
  • app/main.py: New debug endpoint logic with multi-step scraping test flow and error aggregation

Poem

🐰 A Playwright takes stage, both optional and keen,
With fallback so graceful when content is lean;
XPath selectors sharper, validation more strong,
Error logs singing their diagnostic song!
The service layer's ready, resilient and bright. ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately reflects the primary changes: improving block detection (raising the threshold from 1000 to 5000 bytes) and implementing browser fallback for small responses.
Docstring Coverage ✅ Passed Docstring coverage is 86.36% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/anti-scraping-optimization

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (10)
app/api/endpoints/competitions.py (1)

15-24: Replace print() with proper logging.

The error handling logic is sound, but using print() for logging is not production-ready. Consider using Python's logging module for better observability and log management.

Apply this pattern:

+import logging
+
+logger = logging.getLogger(__name__)
+
 @router.get("/search/{competition_name}", response_model=schemas.CompetitionSearch)
 def search_competitions(competition_name: str, page_number: Optional[int] = 1):
     try:
         tfmkt = TransfermarktCompetitionSearch(query=competition_name, page_number=page_number)
         competitions = tfmkt.search_competitions()
-        # Validate we got actual results
         if not competitions.get("results"):
-            print(f"Warning: No results found for competition search: {competition_name} (page {page_number})")
+            logger.warning(f"No results found for competition search: {competition_name} (page {page_number})")
         return competitions
     except Exception as e:
-        print(f"Error in search_competitions for {competition_name}: {e}")
+        logger.error(f"Error in search_competitions for {competition_name}: {e}", exc_info=True)
         raise
app/services/competitions/search.py (1)

28-35: Good defensive validation, but use proper logging.

The page validation logic is solid and aligns with the PR's robustness goals. However, consider replacing print() with Python's logging module for production-grade observability (similar to the suggestion in app/api/endpoints/competitions.py).

app/api/endpoints/clubs.py (1)

16-25: Replace print() with proper logging and consider DRY refactor.

Similar to app/api/endpoints/competitions.py, this endpoint uses print() for logging. Additionally, the error-handling pattern is duplicated across multiple endpoints. Consider extracting this into a shared decorator or utility function.

Example decorator approach:

import logging
from functools import wraps

logger = logging.getLogger(__name__)

def handle_search_errors(search_type: str):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            try:
                result = func(*args, **kwargs)
                if not result.get("results"):
                    logger.warning(f"No results found for {search_type}: {args}")
                return result
            except Exception as e:
                logger.error(f"Error in {search_type}: {e}", exc_info=True)
                raise
        return wrapper
    return decorator

Then apply it to endpoints:

@router.get("/search/{club_name}", ...)
@handle_search_errors("club search")
def search_clubs(club_name: str, page_number: Optional[int] = 1) -> dict:
    tfmkt = TransfermarktClubSearch(query=club_name, page_number=page_number)
    return tfmkt.search_clubs()
Dockerfile (1)

11-12: Conditional Playwright installation is redundant since Playwright is in requirements, but consider documenting Docker image size implications.

Playwright is listed in requirements.txt (line 36), so the conditional import check on line 12 will always succeed—the playwright install chromium command runs regardless. While the approach is harmless, consider documenting this Chromium installation in the README since it significantly increases image size (~200-300 MB). Alternatively, use a multi-stage build to isolate browser dependencies if image size optimization is a priority.

app/services/clubs/search.py (3)

25-35: Init-time page load validation is reasonable but slightly redundant

Calling self.page = self.request_url_page() inside __post_init__ and checking for None gives early failure, which fits with the stricter request_url_page behavior. Note that convert_bsoup_to_page and request_url_page already raise HTTPException when parsing fails, so the explicit if self.page is None/ValueError is mostly defensive. Keeping it is fine, but you could simplify later by relying solely on the base class’ exceptions.


46-63: Defensive page diagnostics look good; consider gating noisy prints

The additional checks around self.page is None and the lxml.etree.tostring/length + “transfermarkt” presence are helpful for diagnosing block/small-content issues and align with the new small-content checks in TransfermarktBase.request_url_page. Just be aware these print calls can be quite noisy in production; if this endpoint gets real traffic, consider routing them through a logger or gating behind a debug flag/env var.


70-74: XPath debug logging is useful for block/HTML changes

Logging the counts for names/URLs/countries and warning when len(clubs_names) == 0 with the actual Clubs.Search.NAMES XPath is very helpful when Transfermarkt changes markup or when block pages slip through. No functional issues here; again, if this endpoint is called frequently, you may eventually want to move these to structured logging rather than raw prints.

app/main.py (1)

151-223: /debug/scraping endpoint is useful; remove unused event loop

The endpoint gives a clear picture of HTTP vs browser fallback vs full page behavior and aligns with the new small‑content checks in TransfermarktBase. One minor nit:

  • In the browser fallback block you do:
import asyncio

loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
browser_response = test_base.make_request_with_browser_fallback(use_browser=True)

make_request_with_browser_fallback is synchronous and manages its own event loop internally, so the new_event_loop() and set_event_loop() here are unused and just create an extra loop per call.

You can safely drop those two lines:

-            try:
-                import asyncio
-
-                loop = asyncio.new_event_loop()
-                asyncio.set_event_loop(loop)
-                browser_response = test_base.make_request_with_browser_fallback(use_browser=True)
+            try:
+                browser_response = test_base.make_request_with_browser_fallback(use_browser=True)

Everything else in the endpoint looks consistent with the base service behavior.

app/services/base.py (2)

743-805: Browser fallback validation is stricter; consider minor loop-hygiene tweak

The updated make_request_with_browser_fallback behavior is aligned with the PR goals:

  • Guarding on not PLAYWRIGHT_AVAILABLE or _browser_scraper is None before attempting fallback avoids crashes when Playwright isn’t installed.
  • Treating html_content that is empty or < 100 chars as invalid and raising HTTPException(500, ...) is reasonable for Transfermarkt pages and ensures _monitor.record_browser_request(success=False) is triggered via the except block.
  • Re‑raising the original http_error keeps the HTTP error semantics for callers when browser fallback also fails.

One optional improvement: you’re creating a fresh event loop via asyncio.new_event_loop() and never closing it. In high‑throughput environments this can leak resources. If this code path starts being hit frequently, consider:

loop = asyncio.new_event_loop()
try:
    asyncio.set_event_loop(loop)
    html_content = loop.run_until_complete(_browser_scraper.scrape_with_fallback(url))
finally:
    loop.close()

This keeps the current behavior but cleans up the loop explicitly.


892-895: Block detection threshold and small-content handling meet objectives; consider HTTPException instead of ValueError

  • Raising _detect_block’s size threshold to len(response.text) < 5000 when "transfermarkt" is absent matches your observation that normal pages are 50k+ bytes and will catch more small block pages.
  • request_url_page now re-checks the parsed text via bsoup.get_text() and, on small content (< 5000 and no “transfermarkt”), forces the browser path, then validates that browser HTML is non-empty and also ≥ 5000 chars. That’s consistent with the new block heuristic and should prevent silently returning empty/skeleton pages.
  • Final conversion via convert_bsoup_to_page plus the explicit page is None guard provides clear 500-errors when parsing fails.

One behavior detail to be aware of:

if content_len < 5000 and "transfermarkt" not in content_text.lower():
    ...
    raise ValueError("Content too small, likely blocked")

If both HTTP and browser fallback ultimately fail, this ValueError becomes the http_error that is re‑raised, so callers will see a plain ValueError instead of a HTTPException. For FastAPI endpoints that don’t special‑case it, this still surfaces as a 500, but you lose structured status/detail.

If you want consistent API error typing, you could switch this to HTTPException:

-            if content_len < 5000 and "transfermarkt" not in content_text.lower():
-                print(
-                    f"Suspiciously small content ({content_len} bytes) for {self.URL}, trying browser fallback",
-                )
-                raise ValueError("Content too small, likely blocked")
+            if content_len < 5000 and "transfermarkt" not in content_text.lower():
+                print(
+                    f"Suspiciously small content ({content_len} bytes) for {self.URL}, trying browser fallback",
+                )
+                raise HTTPException(
+                    status_code=403,
+                    detail=f"Content too small ({content_len} bytes) and likely blocked for {self.URL}",
+                )

This keeps behavior the same while preserving HTTP semantics.

Also applies to: 977-1032

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8d35919 and c289405.

📒 Files selected for processing (11)
  • Dockerfile (1 hunks)
  • app/api/endpoints/clubs.py (1 hunks)
  • app/api/endpoints/competitions.py (1 hunks)
  • app/main.py (2 hunks)
  • app/schemas/clubs/players.py (1 hunks)
  • app/services/base.py (10 hunks)
  • app/services/clubs/players.py (3 hunks)
  • app/services/clubs/search.py (2 hunks)
  • app/services/competitions/search.py (1 hunks)
  • app/utils/xpath.py (1 hunks)
  • requirements.txt (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
app/api/endpoints/competitions.py (1)
app/services/competitions/search.py (2)
  • TransfermarktCompetitionSearch (9-117)
  • search_competitions (105-117)
app/api/endpoints/clubs.py (1)
app/services/clubs/search.py (2)
  • TransfermarktClubSearch (9-109)
  • search_clubs (96-109)
app/main.py (1)
app/services/base.py (4)
  • TransfermarktBase (713-1223)
  • make_request (806-868)
  • make_request_with_browser_fallback (743-804)
  • request_url_page (977-1032)
app/services/competitions/search.py (1)
app/services/base.py (1)
  • request_url_page (977-1032)
app/services/clubs/players.py (3)
app/utils/utils.py (1)
  • extract_from_url (19-46)
app/services/base.py (1)
  • get_text_by_xpath (1082-1144)
app/utils/xpath.py (4)
  • Clubs (112-212)
  • Clubs (265-300)
  • Players (1-109)
  • Players (163-205)
app/services/clubs/search.py (2)
app/services/base.py (2)
  • request_url_page (977-1032)
  • get_list_by_xpath (1047-1080)
app/utils/xpath.py (5)
  • Clubs (112-212)
  • Clubs (265-300)
  • Search (59-77)
  • Search (148-161)
  • Search (244-263)
🔇 Additional comments (8)
app/utils/xpath.py (1)

177-181: LGTM! More resilient XPath selectors.

The updated selectors improve robustness:

  • POSITIONS now uses contains(@class, 'zentriert') to handle additional CSS classes
  • DOB_AGE now pattern-matches cells containing date format (DD/MM/YYYY) to reduce false positives

These changes align well with the PR's goal of improving scraping reliability.

app/schemas/clubs/players.py (1)

27-28: LGTM! Clean schema extension.

The new optional fields (name and season_id) extend the schema without breaking backward compatibility. This aligns well with the service layer changes in app/services/clubs/players.py that populate these fields.

app/services/clubs/players.py (3)

43-51: Good defensive extraction with fallback.

The try/except wrapper around season ID extraction is appropriate—it tolerates extraction failures while preserving the original season_id. This aligns well with the PR's robustness improvements.


260-264: LGTM! Response population aligns with schema.

Populating seasonId and name in the response aligns with the schema extensions in app/schemas/clubs/players.py. The defensive check for club_name before assignment is good practice.


30-34: Default to current year is appropriate and well-handled.

Using the current year as the default for season_id is a reasonable approach, and the implementation validates this choice. The code includes error recovery via __update_season_id(), which extracts the actual season from the page and updates the value if it differs from the default. This pattern is applied consistently across both TransfermarktClubPlayers and TransfermarktClubCompetitions, is validated by existing tests with season_id=None, and allows callers to explicitly specify a season when needed. The default fallback is appropriate for a sports data API where the current season is typically the most frequently requested.

app/main.py (1)

63-70: Playwright-availability guard in test endpoint is sound

Checking PLAYWRIGHT_AVAILABLE and _browser_scraper is None before using the browser scraper prevents runtime errors when Playwright isn’t installed or configured, and returning a structured error payload is appropriate for this test route. You might eventually expand the error message to mention that the Playwright browser binaries must also be installed (e.g., playwright install chromium), but functionally this is fine.

app/services/base.py (2)

16-23: Playwright optional integration and monitoring wiring look correct

  • The guarded import with PLAYWRIGHT_AVAILABLE plus async_playwright = None is a solid pattern for making Playwright truly optional.
  • PlaywrightBrowserScraper.scrape_with_browser explicitly raising ImportError when PLAYWRIGHT_AVAILABLE is False prevents accidental usage if someone constructs the class manually.
  • _browser_scraper being Optional[PlaywrightBrowserScraper] and only instantiated when PLAYWRIGHT_AVAILABLE matches the new guards in callers.
  • get_monitoring_stats now reports browser_scraping_available and user-agent/viewport counts in a way that reflects these flags.

No functional issues; this is a good step toward clean optional browser scraping.

Also applies to: 387-403, 565-566, 1215-1222


1063-1075: Improved XPath error handling is helpful for debugging

Enhancing get_list_by_xpath to:

  • Include URL in the “Page not initialized” error, and
  • Wrap self.page.xpath(xpath) in a try/except that logs the offending XPath + URL and raises an HTTPException with detailed context

will make diagnosing structure changes or invalid XPaths much easier. This is consistent with the new debug endpoints and other error messages; no changes needed.

Comment thread requirements.txt
watchfiles==1.0.3 ; python_version >= "3.9" and python_version < "4.0"
websockets==14.1 ; python_version >= "3.9" and python_version < "4.0"
wrapt==1.17.0 ; python_version >= "3.9" and python_version < "4.0"
playwright==1.48.0 ; python_version >= "3.9" and python_version < "4.0"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

latest stable version of Playwright Python December 2025

💡 Result:

The latest stable Playwright for Python is v1.56.0 (released Nov 11, 2025). [1][2]

Sources:

  • GitHub releases (microsoft/playwright-python) — v1.56.0 release. [1]
  • Playwright Python release notes — Version 1.56. [2]

🌐 Web query:

Playwright 1.48.0 security vulnerabilities CVE

💡 Result:

Yes — Playwright versions before 1.55.1 (including 1.48.0) are affected by a reported vulnerability tracked as CVE-2025-59288 (improper verification of cryptographic signature in the browser download/install scripts). Upgrade to Playwright 1.55.1 or later to receive the patch. [1][2][3]

Sources:

  • NVD entry for CVE-2025-59288. [1]
  • Wiz vulnerability analysis (summary and fix in 1.55.1). [2]
  • Snyk advisories noting the affected Playwright / playwright-core versions and recommended upgrade. [3]

Upgrade Playwright to version 1.56.0; 1.48.0 is affected by CVE-2025-59288.

Playwright 1.48.0 is significantly outdated (current latest is 1.56.0, released Nov 11, 2025) and contains a security vulnerability (CVE-2025-59288) affecting improper verification of cryptographic signatures in browser download/install scripts. Upgrade to at least 1.55.1, preferably 1.56.0.

🤖 Prompt for AI Agents
In requirements.txt around line 36, the Playwright pin uses 1.48.0 which is
vulnerable (CVE-2025-59288); update the version spec to at least 1.55.1 but
preferably to 1.56.0 (e.g. playwright==1.56.0 ; python_version >= "3.9" and
python_version < "4.0"), preserve the existing environment marker, then
regenerate any lock/constraints files and run your test suite and CI to ensure
compatibility with the new Playwright release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant