fix: Improve block detection and browser fallback for small responses#8
fix: Improve block detection and browser fallback for small responses#8
Conversation
- Add full support for national teams across all club endpoints
- Add new /clubs/{club_id}/competitions endpoint to retrieve club competitions
- Add isNationalTeam field to Club Profile response schema
- Make Club Profile fields optional to accommodate national teams
- Enhance Club Players endpoint to handle national team HTML structure
- Update XPath expressions to support both club and national team structures
- Add intelligent detection logic for national teams
- Maintain backward compatibility with existing club endpoints
This update enables the API to work seamlessly with both regular clubs
and national teams, providing a unified interface for all club-related
data retrieval.
- Add GET /competitions/{competition_id}/seasons endpoint
- Implement TransfermarktCompetitionSeasons service to scrape season data
- Add CompetitionSeason and CompetitionSeasons Pydantic schemas
- Support both cross-year (e.g., 25/26) and single-year (e.g., 2025) seasons
- Handle historical seasons correctly (e.g., 99/00 -> 1999-2000)
- Extract seasons from competition page dropdown/table structure
- Return season_id, season_name, start_year, and end_year for each season
- Sort seasons by start_year descending (newest first)
Closes #[issue-number]
- Detect national team competitions (FIWC, EURO, COPA, AFAC, GOCU, AFCN) - Use /teilnehmer/pokalwettbewerb/ URL for national team competitions - Handle season_id correctly (year-1 for national teams in URL) - Add XPath expressions for participants table - Limit participants to expected tournament size to exclude non-qualified teams - Make season_id optional in CompetitionClubs schema - Update Dockerfile PYTHONPATH configuration
- Add length validation for ids and names before zip() to prevent silent data loss - Raise descriptive ValueError with logging if ids and names mismatch - Simplify seasonId assignment logic for national teams - Remove unnecessary try/except block (isdigit() prevents ValueError) - Clean up unreachable fallback code
- Add tournament size configuration to Settings class with environment variable support - Replace hardcoded dict with settings.get_tournament_size() method - Add warning logging when tournament size is not configured (instead of silent truncation) - Proceed without truncation when size is unavailable (no silent data loss) - Add validation for tournament sizes (must be positive integers) - Add comprehensive unit tests for both configured and fallback paths - Update README.md with new environment variables documentation This prevents silent truncation when tournament sizes change (e.g., World Cup expanding to 48) and allows easy configuration via environment variables.
- Remove extra HTTP request to fetch club profile just to read isNationalTeam - Set is_national_team=None to let TransfermarktClubPlayers use DOM heuristics - Remove broad except Exception that silently swallowed all errors - Improve performance by eliminating redundant network call - Players class already has robust DOM-based detection for national teams
- Move datetime and HTTPException imports from method level to module level - Improves code readability and marginally improves performance - Follows Python best practices for import organization
- Move datetime and HTTPException imports from method level to module level - Improves code readability and marginally improves performance - Follows Python best practices for import organization
- Keep imports at module level in clubs/competitions.py (from CodeRabbit review) - Preserve is_national_team flag logic in clubs/players.py - Keep name padding in competitions/search.py - Add .DS_Store to .gitignore
- Remove whitespace from blank lines (W293) - Add missing trailing commas (COM812) - Split long XPath lines to comply with E501 line length limit
- Format XPath strings to comply with line length - Format list comprehensions - Format is_season condition
- Fix session initialization issue causing all HTTP requests to fail - Improve block detection to avoid false positives - Optimize browser scraping delays (reduce from 12-13s to 0.4-0.8s) - Update XPath definitions for clubs, competitions, and players search - Fix nationalities parsing in player search (relative to each row) - Add comprehensive monitoring endpoints - Update settings for anti-scraping configuration Performance improvements: - HTTP success rate: 0% → 100% - Response time: 12-13s → 0.4-0.8s - Browser fallback: Always → Never needed - All endpoints now working correctly
Resolved conflicts: - app/services/clubs/players.py: Kept improved nationalities parsing with trim() - app/settings.py: Kept anti-scraping configuration settings - app/utils/xpath.py: Combined URL from HEAD with robust NAME fallbacks from main
- Fix import sorting - Add trailing commas - Replace single quotes with double quotes - Add noqa comments for long lines (User-Agent strings, XPath definitions) - Remove unused variables - Fix whitespace issues
- Change padding logic for players_joined_on, players_joined, and players_signed_from - Use "" instead of None to match the default value when elements are None - Fixes CodeRabbit review: inconsistent placeholder values
- Add try/except for playwright import to handle missing dependency - Make _browser_scraper optional (None if playwright unavailable) - Add checks in make_request_with_browser_fallback and get_monitoring_stats - Update test_browser_scraping endpoint to handle missing playwright - Add playwright to requirements.txt - App can now start without playwright, browser scraping disabled if unavailable
- Add playwright install chromium step in Dockerfile - Only runs if playwright is installed (graceful fallback) - Ensures browser binaries are available for Railway deployment
- Keep optional playwright import in base.py - Keep playwright availability check in test_browser_scraping endpoint - Maintains Railway deployment compatibility
- Keep optional playwright import and checks - Maintain browser scraper optional initialization - Preserve playwright availability checks in monitoring stats - All conflicts resolved, ready for merge
- Validate HTTP responses have content before returning - Validate browser scraping content before using - Add detailed logging for debugging deployment issues - Raise proper exceptions when content is empty or invalid - Helps diagnose why server returns 200 with empty content
- Add /debug/scraping endpoint to test HTTP, browser, and page requests - Shows content lengths, errors, and availability status - Helps diagnose why server returns empty responses - Better error messages in request_url_page and make_request_with_browser_fallback
- Validate page is not None after request_url_page() - Add exception handling in endpoints to catch and log errors - Add XPath error handling with detailed error messages - Warn when search results are empty (helps diagnose Railway issues) - Prevents silent failures that return 200 with empty data
- Log page HTML length and content validation - Log XPath extraction results - Warn when no results found - Helps diagnose why server returns empty results while local works
- Default to current year if season_id is None (like Club Competitions) - Add club name and seasonId to response - Fix URL formatting to prevent 'None' in URL - Improves __update_season_id to validate extracted season
- Fix DOB_AGE XPath to use TD[5] instead of zentriert[2] selector - Default season_id to current year if None (prevents 'None' in URL) - Add club name and seasonId to response - Resolves issue where 0 players were returned despite finding URLs/names
- Use more specific XPath matching zentriert td with date pattern - Filters for dates (contains '/') and minimum length to avoid false matches - Resolves issue where DOB_AGE returned 0 items
- Add Optional name and season_id fields to ClubPlayers schema - These fields are set in get_club_players() but were missing from schema - Resolves issue where name and seasonId were None in API response
- Increase block detection threshold from 1000 to 5000 bytes - Add content size validation in request_url_page to catch small block pages - Automatically trigger browser fallback when content is suspiciously small - Normal Transfermarkt pages are 50k+ bytes, so 5k is a safe threshold - Resolves issue where server received 2403 byte responses (block pages) that weren't detected
WalkthroughThis pull request introduces optional Playwright browser integration for fallback scraping, adds comprehensive error handling and content validation across the API and service layers, enhances XPath selectors for improved robustness, and includes a new debug endpoint for testing scraping functionality. The changes make the system more resilient to failures with diagnostic logging. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Areas requiring extra attention:
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (10)
app/api/endpoints/competitions.py (1)
15-24: Replaceprint()with proper logging.The error handling logic is sound, but using
print()for logging is not production-ready. Consider using Python'sloggingmodule for better observability and log management.Apply this pattern:
+import logging + +logger = logging.getLogger(__name__) + @router.get("/search/{competition_name}", response_model=schemas.CompetitionSearch) def search_competitions(competition_name: str, page_number: Optional[int] = 1): try: tfmkt = TransfermarktCompetitionSearch(query=competition_name, page_number=page_number) competitions = tfmkt.search_competitions() - # Validate we got actual results if not competitions.get("results"): - print(f"Warning: No results found for competition search: {competition_name} (page {page_number})") + logger.warning(f"No results found for competition search: {competition_name} (page {page_number})") return competitions except Exception as e: - print(f"Error in search_competitions for {competition_name}: {e}") + logger.error(f"Error in search_competitions for {competition_name}: {e}", exc_info=True) raiseapp/services/competitions/search.py (1)
28-35: Good defensive validation, but use proper logging.The page validation logic is solid and aligns with the PR's robustness goals. However, consider replacing
print()with Python'sloggingmodule for production-grade observability (similar to the suggestion inapp/api/endpoints/competitions.py).app/api/endpoints/clubs.py (1)
16-25: Replaceprint()with proper logging and consider DRY refactor.Similar to
app/api/endpoints/competitions.py, this endpoint usesprint()for logging. Additionally, the error-handling pattern is duplicated across multiple endpoints. Consider extracting this into a shared decorator or utility function.Example decorator approach:
import logging from functools import wraps logger = logging.getLogger(__name__) def handle_search_errors(search_type: str): def decorator(func): @wraps(func) def wrapper(*args, **kwargs): try: result = func(*args, **kwargs) if not result.get("results"): logger.warning(f"No results found for {search_type}: {args}") return result except Exception as e: logger.error(f"Error in {search_type}: {e}", exc_info=True) raise return wrapper return decoratorThen apply it to endpoints:
@router.get("/search/{club_name}", ...) @handle_search_errors("club search") def search_clubs(club_name: str, page_number: Optional[int] = 1) -> dict: tfmkt = TransfermarktClubSearch(query=club_name, page_number=page_number) return tfmkt.search_clubs()Dockerfile (1)
11-12: Conditional Playwright installation is redundant since Playwright is in requirements, but consider documenting Docker image size implications.Playwright is listed in requirements.txt (line 36), so the conditional import check on line 12 will always succeed—the
playwright install chromiumcommand runs regardless. While the approach is harmless, consider documenting this Chromium installation in the README since it significantly increases image size (~200-300 MB). Alternatively, use a multi-stage build to isolate browser dependencies if image size optimization is a priority.app/services/clubs/search.py (3)
25-35: Init-time page load validation is reasonable but slightly redundantCalling
self.page = self.request_url_page()inside__post_init__and checking forNonegives early failure, which fits with the stricterrequest_url_pagebehavior. Note thatconvert_bsoup_to_pageandrequest_url_pagealready raiseHTTPExceptionwhen parsing fails, so the explicitif self.page is None/ValueErroris mostly defensive. Keeping it is fine, but you could simplify later by relying solely on the base class’ exceptions.
46-63: Defensive page diagnostics look good; consider gating noisy printsThe additional checks around
self.page is Noneand thelxml.etree.tostring/length + “transfermarkt” presence are helpful for diagnosing block/small-content issues and align with the new small-content checks inTransfermarktBase.request_url_page. Just be aware these
70-74: XPath debug logging is useful for block/HTML changesLogging the counts for names/URLs/countries and warning when
len(clubs_names) == 0with the actualClubs.Search.NAMESXPath is very helpful when Transfermarkt changes markup or when block pages slip through. No functional issues here; again, if this endpoint is called frequently, you may eventually want to move these to structured logging rather than raw prints.app/main.py (1)
151-223: /debug/scraping endpoint is useful; remove unused event loopThe endpoint gives a clear picture of HTTP vs browser fallback vs full page behavior and aligns with the new small‑content checks in
TransfermarktBase. One minor nit:
- In the browser fallback block you do:
import asyncio loop = asyncio.new_event_loop() asyncio.set_event_loop(loop) browser_response = test_base.make_request_with_browser_fallback(use_browser=True)
make_request_with_browser_fallbackis synchronous and manages its own event loop internally, so thenew_event_loop()andset_event_loop()here are unused and just create an extra loop per call.You can safely drop those two lines:
- try: - import asyncio - - loop = asyncio.new_event_loop() - asyncio.set_event_loop(loop) - browser_response = test_base.make_request_with_browser_fallback(use_browser=True) + try: + browser_response = test_base.make_request_with_browser_fallback(use_browser=True)Everything else in the endpoint looks consistent with the base service behavior.
app/services/base.py (2)
743-805: Browser fallback validation is stricter; consider minor loop-hygiene tweakThe updated
make_request_with_browser_fallbackbehavior is aligned with the PR goals:
- Guarding on
not PLAYWRIGHT_AVAILABLE or _browser_scraper is Nonebefore attempting fallback avoids crashes when Playwright isn’t installed.- Treating
html_contentthat is empty or< 100chars as invalid and raisingHTTPException(500, ...)is reasonable for Transfermarkt pages and ensures_monitor.record_browser_request(success=False)is triggered via theexceptblock.- Re‑raising the original
http_errorkeeps the HTTP error semantics for callers when browser fallback also fails.One optional improvement: you’re creating a fresh event loop via
asyncio.new_event_loop()and never closing it. In high‑throughput environments this can leak resources. If this code path starts being hit frequently, consider:loop = asyncio.new_event_loop() try: asyncio.set_event_loop(loop) html_content = loop.run_until_complete(_browser_scraper.scrape_with_fallback(url)) finally: loop.close()This keeps the current behavior but cleans up the loop explicitly.
892-895: Block detection threshold and small-content handling meet objectives; consider HTTPException instead of ValueError
- Raising
_detect_block’s size threshold tolen(response.text) < 5000when"transfermarkt"is absent matches your observation that normal pages are 50k+ bytes and will catch more small block pages.request_url_pagenow re-checks the parsed text viabsoup.get_text()and, on small content (< 5000and no “transfermarkt”), forces the browser path, then validates that browser HTML is non-empty and also ≥ 5000 chars. That’s consistent with the new block heuristic and should prevent silently returning empty/skeleton pages.- Final conversion via
convert_bsoup_to_pageplus the explicitpage is Noneguard provides clear 500-errors when parsing fails.One behavior detail to be aware of:
if content_len < 5000 and "transfermarkt" not in content_text.lower(): ... raise ValueError("Content too small, likely blocked")If both HTTP and browser fallback ultimately fail, this
ValueErrorbecomes thehttp_errorthat is re‑raised, so callers will see a plainValueErrorinstead of aHTTPException. For FastAPI endpoints that don’t special‑case it, this still surfaces as a 500, but you lose structured status/detail.If you want consistent API error typing, you could switch this to
HTTPException:- if content_len < 5000 and "transfermarkt" not in content_text.lower(): - print( - f"Suspiciously small content ({content_len} bytes) for {self.URL}, trying browser fallback", - ) - raise ValueError("Content too small, likely blocked") + if content_len < 5000 and "transfermarkt" not in content_text.lower(): + print( + f"Suspiciously small content ({content_len} bytes) for {self.URL}, trying browser fallback", + ) + raise HTTPException( + status_code=403, + detail=f"Content too small ({content_len} bytes) and likely blocked for {self.URL}", + )This keeps behavior the same while preserving HTTP semantics.
Also applies to: 977-1032
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (11)
Dockerfile(1 hunks)app/api/endpoints/clubs.py(1 hunks)app/api/endpoints/competitions.py(1 hunks)app/main.py(2 hunks)app/schemas/clubs/players.py(1 hunks)app/services/base.py(10 hunks)app/services/clubs/players.py(3 hunks)app/services/clubs/search.py(2 hunks)app/services/competitions/search.py(1 hunks)app/utils/xpath.py(1 hunks)requirements.txt(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
app/api/endpoints/competitions.py (1)
app/services/competitions/search.py (2)
TransfermarktCompetitionSearch(9-117)search_competitions(105-117)
app/api/endpoints/clubs.py (1)
app/services/clubs/search.py (2)
TransfermarktClubSearch(9-109)search_clubs(96-109)
app/main.py (1)
app/services/base.py (4)
TransfermarktBase(713-1223)make_request(806-868)make_request_with_browser_fallback(743-804)request_url_page(977-1032)
app/services/competitions/search.py (1)
app/services/base.py (1)
request_url_page(977-1032)
app/services/clubs/players.py (3)
app/utils/utils.py (1)
extract_from_url(19-46)app/services/base.py (1)
get_text_by_xpath(1082-1144)app/utils/xpath.py (4)
Clubs(112-212)Clubs(265-300)Players(1-109)Players(163-205)
app/services/clubs/search.py (2)
app/services/base.py (2)
request_url_page(977-1032)get_list_by_xpath(1047-1080)app/utils/xpath.py (5)
Clubs(112-212)Clubs(265-300)Search(59-77)Search(148-161)Search(244-263)
🔇 Additional comments (8)
app/utils/xpath.py (1)
177-181: LGTM! More resilient XPath selectors.The updated selectors improve robustness:
POSITIONSnow usescontains(@class, 'zentriert')to handle additional CSS classesDOB_AGEnow pattern-matches cells containing date format (DD/MM/YYYY) to reduce false positivesThese changes align well with the PR's goal of improving scraping reliability.
app/schemas/clubs/players.py (1)
27-28: LGTM! Clean schema extension.The new optional fields (
nameandseason_id) extend the schema without breaking backward compatibility. This aligns well with the service layer changes inapp/services/clubs/players.pythat populate these fields.app/services/clubs/players.py (3)
43-51: Good defensive extraction with fallback.The try/except wrapper around season ID extraction is appropriate—it tolerates extraction failures while preserving the original
season_id. This aligns well with the PR's robustness improvements.
260-264: LGTM! Response population aligns with schema.Populating
seasonIdandnamein the response aligns with the schema extensions inapp/schemas/clubs/players.py. The defensive check forclub_namebefore assignment is good practice.
30-34: Default to current year is appropriate and well-handled.Using the current year as the default for
season_idis a reasonable approach, and the implementation validates this choice. The code includes error recovery via__update_season_id(), which extracts the actual season from the page and updates the value if it differs from the default. This pattern is applied consistently across bothTransfermarktClubPlayersandTransfermarktClubCompetitions, is validated by existing tests withseason_id=None, and allows callers to explicitly specify a season when needed. The default fallback is appropriate for a sports data API where the current season is typically the most frequently requested.app/main.py (1)
63-70: Playwright-availability guard in test endpoint is soundChecking
PLAYWRIGHT_AVAILABLEand_browser_scraper is Nonebefore using the browser scraper prevents runtime errors when Playwright isn’t installed or configured, and returning a structured error payload is appropriate for this test route. You might eventually expand the error message to mention that the Playwright browser binaries must also be installed (e.g.,playwright install chromium), but functionally this is fine.app/services/base.py (2)
16-23: Playwright optional integration and monitoring wiring look correct
- The guarded import with
PLAYWRIGHT_AVAILABLEplusasync_playwright = Noneis a solid pattern for making Playwright truly optional.PlaywrightBrowserScraper.scrape_with_browserexplicitly raisingImportErrorwhenPLAYWRIGHT_AVAILABLEis False prevents accidental usage if someone constructs the class manually._browser_scraperbeingOptional[PlaywrightBrowserScraper]and only instantiated whenPLAYWRIGHT_AVAILABLEmatches the new guards in callers.get_monitoring_statsnow reportsbrowser_scraping_availableand user-agent/viewport counts in a way that reflects these flags.No functional issues; this is a good step toward clean optional browser scraping.
Also applies to: 387-403, 565-566, 1215-1222
1063-1075: Improved XPath error handling is helpful for debuggingEnhancing
get_list_by_xpathto:
- Include
URLin the “Page not initialized” error, and- Wrap
self.page.xpath(xpath)in a try/except that logs the offending XPath + URL and raises anHTTPExceptionwith detailed contextwill make diagnosing structure changes or invalid XPaths much easier. This is consistent with the new debug endpoints and other error messages; no changes needed.
| watchfiles==1.0.3 ; python_version >= "3.9" and python_version < "4.0" | ||
| websockets==14.1 ; python_version >= "3.9" and python_version < "4.0" | ||
| wrapt==1.17.0 ; python_version >= "3.9" and python_version < "4.0" | ||
| playwright==1.48.0 ; python_version >= "3.9" and python_version < "4.0" |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
latest stable version of Playwright Python December 2025
💡 Result:
The latest stable Playwright for Python is v1.56.0 (released Nov 11, 2025). [1][2]
Sources:
- GitHub releases (microsoft/playwright-python) — v1.56.0 release. [1]
- Playwright Python release notes — Version 1.56. [2]
🌐 Web query:
Playwright 1.48.0 security vulnerabilities CVE
💡 Result:
Yes — Playwright versions before 1.55.1 (including 1.48.0) are affected by a reported vulnerability tracked as CVE-2025-59288 (improper verification of cryptographic signature in the browser download/install scripts). Upgrade to Playwright 1.55.1 or later to receive the patch. [1][2][3]
Sources:
- NVD entry for CVE-2025-59288. [1]
- Wiz vulnerability analysis (summary and fix in 1.55.1). [2]
- Snyk advisories noting the affected Playwright / playwright-core versions and recommended upgrade. [3]
Upgrade Playwright to version 1.56.0; 1.48.0 is affected by CVE-2025-59288.
Playwright 1.48.0 is significantly outdated (current latest is 1.56.0, released Nov 11, 2025) and contains a security vulnerability (CVE-2025-59288) affecting improper verification of cryptographic signatures in browser download/install scripts. Upgrade to at least 1.55.1, preferably 1.56.0.
🤖 Prompt for AI Agents
In requirements.txt around line 36, the Playwright pin uses 1.48.0 which is
vulnerable (CVE-2025-59288); update the version spec to at least 1.55.1 but
preferably to 1.56.0 (e.g. playwright==1.56.0 ; python_version >= "3.9" and
python_version < "4.0"), preserve the existing environment marker, then
regenerate any lock/constraints files and run your test suite and CI to ensure
compatibility with the new Playwright release.
Fix
Problem
Serveren modtog små responses (2403 bytes) fra Transfermarkt som ikke blev opdaget som blokering, hvilket resulterede i tomme resultater. Normal Transfermarkt sider er 50k+ bytes.
Solution
_detect_block()Changes
app/services/base.py:_detect_block()threshold fra 1000 til 5000 bytesrequest_url_page()Testing
Related Issues
Fixes issue hvor serveren modtog små block pages (2403 bytes) der ikke blev opdaget, hvilket resulterede i tomme resultater for alle endpoints (club search, club players, etc.).
Summary by CodeRabbit
New Features
Bug Fixes
Chores
✏️ Tip: You can customize this high-level summary in your review settings.