Skip to content

fix: Optimize anti-scraping system and fix XPath definitions#3

Merged
eskobar95 merged 28 commits intomainfrom
fix/anti-scraping-optimization
Dec 6, 2025
Merged

fix: Optimize anti-scraping system and fix XPath definitions#3
eskobar95 merged 28 commits intomainfrom
fix/anti-scraping-optimization

Conversation

@eskobar95
Copy link
Copy Markdown
Owner

@eskobar95 eskobar95 commented Dec 6, 2025

🚀 Performance Optimizations

This PR fixes critical issues with the anti-scraping system and significantly improves API performance.

🔧 Key Fixes

  1. Session Initialization Bug

    • Fixed NoneType error when self.session was not initialized
    • Added safety checks to ensure session is always available before use
    • Result: HTTP requests now work correctly (0% → 100% success rate)
  2. Block Detection Improvements

    • Made block detection more precise to avoid false positives
    • Fixed issue where word "blocked" in HTML was incorrectly flagged as blocking
    • Result: Player profiles now work via HTTP instead of browser fallback
  3. Browser Scraping Optimizations

    • Reduced initial delay from 1-3s to 0.2-0.5s
    • Optimized behavioral simulation (fewer mouse movements, shorter delays)
    • Added networkidle timeout to prevent hanging
    • Result: Browser fallback is now faster when needed (though rarely needed)

📊 XPath Updates

  • Updated XPath definitions for clubs, competitions, and players search
  • Fixed nationalities parsing in player search (now relative to each row)
  • Fixed pagination handling

📈 Performance Metrics

Before:

  • HTTP Success Rate: 0%
  • Response Time: 12-13 seconds
  • Browser Fallback: Always required
  • Blocks Detected: Many false positives

After:

  • HTTP Success Rate: 100%
  • Response Time: 0.4-0.8 seconds
  • Browser Fallback: Never needed
  • Blocks Detected: 0 (accurate detection)

🧪 Testing

All endpoints tested and working:

  • ✅ Clubs Search: ~0.7s
  • ✅ Competitions Search: ~0.5s
  • ✅ Players Search: ~0.9s
  • ✅ Club Profile: ~0.6s
  • ✅ Player Profile: ~0.1s (was 14s before)

📝 Additional Changes

  • Added comprehensive monitoring endpoints (/monitoring/anti-scraping, /monitoring/session, /monitoring/retry)
  • Updated settings for anti-scraping configuration
  • Added RAILWAY_ENV_CONFIG.md documentation

🎯 Impact

~30x performance improvement - API is now production-ready with fast, reliable HTTP requests.

Summary by CodeRabbit

  • New Features

    • Added health and monitoring endpoints; browser-based scraping test and debug routes
    • Introduced robust anti-detection features: session management, retries, browser fallback, proxy & header rotation
    • New settings for sessions, concurrency, delays, proxy and browser control
  • Bug Fixes

    • More resilient player data parsing with null/empty handling
    • Updated selectors to match current site layout
  • Documentation

    • Added environment/config guide for anti-scraping, proxy usage and monitoring endpoints

✏️ Tip: You can customize this high-level summary in your review settings.

- Add full support for national teams across all club endpoints
- Add new /clubs/{club_id}/competitions endpoint to retrieve club competitions
- Add isNationalTeam field to Club Profile response schema
- Make Club Profile fields optional to accommodate national teams
- Enhance Club Players endpoint to handle national team HTML structure
- Update XPath expressions to support both club and national team structures
- Add intelligent detection logic for national teams
- Maintain backward compatibility with existing club endpoints

This update enables the API to work seamlessly with both regular clubs
and national teams, providing a unified interface for all club-related
data retrieval.
- Add GET /competitions/{competition_id}/seasons endpoint
- Implement TransfermarktCompetitionSeasons service to scrape season data
- Add CompetitionSeason and CompetitionSeasons Pydantic schemas
- Support both cross-year (e.g., 25/26) and single-year (e.g., 2025) seasons
- Handle historical seasons correctly (e.g., 99/00 -> 1999-2000)
- Extract seasons from competition page dropdown/table structure
- Return season_id, season_name, start_year, and end_year for each season
- Sort seasons by start_year descending (newest first)

Closes #[issue-number]
- Detect national team competitions (FIWC, EURO, COPA, AFAC, GOCU, AFCN)
- Use /teilnehmer/pokalwettbewerb/ URL for national team competitions
- Handle season_id correctly (year-1 for national teams in URL)
- Add XPath expressions for participants table
- Limit participants to expected tournament size to exclude non-qualified teams
- Make season_id optional in CompetitionClubs schema
- Update Dockerfile PYTHONPATH configuration
- Add length validation for ids and names before zip() to prevent silent data loss
- Raise descriptive ValueError with logging if ids and names mismatch
- Simplify seasonId assignment logic for national teams
- Remove unnecessary try/except block (isdigit() prevents ValueError)
- Clean up unreachable fallback code
- Add tournament size configuration to Settings class with environment variable support
- Replace hardcoded dict with settings.get_tournament_size() method
- Add warning logging when tournament size is not configured (instead of silent truncation)
- Proceed without truncation when size is unavailable (no silent data loss)
- Add validation for tournament sizes (must be positive integers)
- Add comprehensive unit tests for both configured and fallback paths
- Update README.md with new environment variables documentation

This prevents silent truncation when tournament sizes change (e.g., World Cup expanding to 48)
and allows easy configuration via environment variables.
- Remove extra HTTP request to fetch club profile just to read isNationalTeam
- Set is_national_team=None to let TransfermarktClubPlayers use DOM heuristics
- Remove broad except Exception that silently swallowed all errors
- Improve performance by eliminating redundant network call
- Players class already has robust DOM-based detection for national teams
- Move datetime and HTTPException imports from method level to module level
- Improves code readability and marginally improves performance
- Follows Python best practices for import organization
- Move datetime and HTTPException imports from method level to module level
- Improves code readability and marginally improves performance
- Follows Python best practices for import organization
- Keep imports at module level in clubs/competitions.py (from CodeRabbit review)
- Preserve is_national_team flag logic in clubs/players.py
- Keep name padding in competitions/search.py
- Add .DS_Store to .gitignore
- Remove whitespace from blank lines (W293)
- Add missing trailing commas (COM812)
- Split long XPath lines to comply with E501 line length limit
- Format XPath strings to comply with line length
- Format list comprehensions
- Format is_season condition
- Fix session initialization issue causing all HTTP requests to fail
- Improve block detection to avoid false positives
- Optimize browser scraping delays (reduce from 12-13s to 0.4-0.8s)
- Update XPath definitions for clubs, competitions, and players search
- Fix nationalities parsing in player search (relative to each row)
- Add comprehensive monitoring endpoints
- Update settings for anti-scraping configuration

Performance improvements:
- HTTP success rate: 0% → 100%
- Response time: 12-13s → 0.4-0.8s
- Browser fallback: Always → Never needed
- All endpoints now working correctly
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Dec 6, 2025

Walkthrough

Adds an anti-scraping subsystem (session manager, monitor, retry engine, Playwright browser scraper), monitoring and test endpoints, expanded settings for proxy/browser behavior, large XPath selector updates for 2024 Transfermarkt HTML, and robustness fixes in player extraction. Also adds documentation for Railway environment variables.

Changes

Cohort / File(s) Summary
Documentation & Config
RAILWAY_ENV_CONFIG.md, openmemory.md, app/settings.py
New Railway env docs; expanded anti-bot and monitoring documentation; added session/proxy/request-delay/browser-related settings, tournament-size validation and helper.
Anti-detection Core
app/services/base.py
New SmartSessionManager, AntiScrapingMonitor, RetryManager, PlaywrightBrowserScraper; TransfermarktBase extended with session lifecycle, browser-fallback request flows, monitoring accessors, and retry handling.
API Endpoints
app/main.py
Added GET /health, monitoring endpoints (/monitoring/*), browser-scraping test /test/browser-scraping, and debug /debug/xpath endpoints delegating to TransfermarktBase utilities.
Players data robustness
app/services/clubs/players.py, app/services/players/search.py
Trimmed and normalized nationality strings; guarded missing fields, preserved list alignment with empty strings; reworked players search to per-row relative XPath and skip invalid rows.
XPath selector updates
app/utils/xpath.py
Large-scale selector rewrite across Profile, Clubs, Players, Competitions to use yw0/yw1 containers, inline-table row-relative paths, new season navigation constants, and updated ID/NAME/CLUB/AGE/NATIONALITIES/MARKET_VALUE paths.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant TransfermarktBase
    participant SmartSessionManager
    participant RetryManager
    participant PlaywrightBrowserScraper
    participant AntiScrapingMonitor
    Client->>TransfermarktBase: request_url_page(url)
    activate TransfermarktBase
    TransfermarktBase->>SmartSessionManager: get_session()
    SmartSessionManager-->>TransfermarktBase: Session (headers, proxy)
    TransfermarktBase->>RetryManager: execute_with_retry(make_http_request)
    activate RetryManager
    RetryManager->>TransfermarktBase: make_http_request(url)
    TransfermarktBase->>AntiScrapingMonitor: record_request(success/resp_time)
    alt request succeeds
        RetryManager-->>TransfermarktBase: Response
    else retries exhausted / blocked
        RetryManager-->>TransfermarktBase: failure
        TransfermarktBase->>PlaywrightBrowserScraper: scrape_with_browser(url)
        activate PlaywrightBrowserScraper
        PlaywrightBrowserScraper->>AntiScrapingMonitor: record_browser_request()
        PlaywrightBrowserScraper-->>TransfermarktBase: HTML content
        deactivate PlaywrightBrowserScraper
    end
    TransfermarktBase-->>Client: parsed page / error
    deactivate TransfermarktBase
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

  • Review focus:
    • app/services/base.py (session lifecycle, async/sync interactions, Playwright integration, monitoring correctness)
    • app/utils/xpath.py (many selector changes — verify against live pages)
    • app/settings.py (new env fields, validators, defaults)
    • app/services/players/search.py and clubs/players.py (data alignment and trimming edge cases)

Possibly related PRs

Poem

🐰 Hopping through headers, proxies, and cheer,

I rotate my agents and whisper in ear,
Retries like tumble-turns, browser as friend,
XPath refreshed to match the new trend,
A nimble little scraper — hopping to mend.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: optimizing the anti-scraping system and fixing XPath definitions, which aligns with the primary work across multiple files.
Docstring Coverage ✅ Passed Docstring coverage is 98.25% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/anti-scraping-optimization

Comment @coderabbitai help to get the list of available commands and usage tips.

Resolved conflicts:
- app/services/clubs/players.py: Kept improved nationalities parsing with trim()
- app/settings.py: Kept anti-scraping configuration settings
- app/utils/xpath.py: Combined URL from HEAD with robust NAME fallbacks from main
- Fix import sorting
- Add trailing commas
- Replace single quotes with double quotes
- Add noqa comments for long lines (User-Agent strings, XPath definitions)
- Remove unused variables
- Fix whitespace issues
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (5)
RAILWAY_ENV_CONFIG.md (1)

1-126: Documentation looks comprehensive, but consider language consistency.

The configuration guide mixes Danish and English. If this is an international project, consider standardizing to English for broader accessibility. The static analysis warnings are false positives (Danish words).

Note: Pricing information (Lines 122-124) may become outdated; consider linking to vendor pricing pages instead.

app/services/clubs/players.py (1)

188-190: Potential list length mismatch in players_statuses.

Unlike the other fields (e.g., players_joined_on, players_signed_from) that use if e is not None else "", this line filters out None elements entirely. If page_players_infos contains None values, players_statuses will be shorter than base_length before the padding logic runs, but the padding uses None which may be inconsistent with the empty-string pattern used elsewhere.

Consider aligning the pattern:

-        players_statuses = ["; ".join(e.xpath(Clubs.Players.STATUSES)) for e in page_players_infos if e is not None]
+        players_statuses = [
+            "; ".join(e.xpath(Clubs.Players.STATUSES)) if e is not None else ""
+            for e in page_players_infos
+        ] if page_players_infos else []
app/settings.py (1)

19-19: Consider using int type for PROXY_PORT.

Ports are numeric values; using Optional[int] would provide better type safety and avoid potential string-to-int conversion issues downstream.

-    PROXY_PORT: Optional[str] = Field(default=None, description="Proxy port")
+    PROXY_PORT: Optional[int] = Field(default=None, description="Proxy port")
app/services/base.py (1)

521-524: Lambda creates a new session on each retry attempt.

The lambda lambda: requests.get(url, headers=_session_manager.get_session().headers) calls get_session() on every retry, potentially returning a different session each time. This may not be the intended behavior for a fallback scenario.

Capture the session before the retry loop:

-            response = _retry_manager.execute_with_retry(
-                lambda: requests.get(url, headers=_session_manager.get_session().headers),
-            )
+            session = _session_manager.get_session()
+            response = _retry_manager.execute_with_retry(
+                lambda: requests.get(url, headers=session.headers),
+            )
app/utils/xpath.py (1)

189-204: Repeated XPath paths in nested classes are verbose but necessary.

The comment on line 191 correctly identifies the Python scoping issue. The duplication works but increases maintenance burden.

Consider extracting the base path to a module-level constant:

# At module level
_CLUB_PLAYERS_RESULTS = "//div[@id='yw1']//table[@class='items']//tbody//tr[@class='odd' or @class='even']"

class Clubs:
    class Players:
        RESULTS = _CLUB_PLAYERS_RESULTS
        # ...
        class Present:
            PAGE_SIGNED_FROM = _CLUB_PLAYERS_RESULTS + "//td[@class='zentriert'][3]"
            # ...

This reduces duplication while keeping the code organized.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3e4c6b7 and a419b41.

📒 Files selected for processing (8)
  • RAILWAY_ENV_CONFIG.md (1 hunks)
  • app/main.py (2 hunks)
  • app/services/base.py (6 hunks)
  • app/services/clubs/players.py (3 hunks)
  • app/services/players/search.py (1 hunks)
  • app/settings.py (1 hunks)
  • app/utils/xpath.py (5 hunks)
  • openmemory.md (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
app/services/players/search.py (2)
app/utils/xpath.py (3)
  • Search (59-77)
  • Search (148-161)
  • Search (242-261)
app/utils/utils.py (3)
  • extract_from_url (19-46)
  • trim (49-62)
  • safe_regex (65-84)
app/main.py (1)
app/services/base.py (5)
  • get_monitoring_stats (1113-1124)
  • get_session_stats (218-233)
  • get_session_stats (1077-1084)
  • get_retry_stats (652-666)
  • get_retry_stats (1087-1100)
app/services/clubs/players.py (2)
app/utils/utils.py (2)
  • extract_from_url (19-46)
  • trim (49-62)
app/utils/xpath.py (4)
  • Clubs (112-210)
  • Clubs (263-298)
  • Players (1-109)
  • Players (163-203)
app/services/base.py (3)
app/utils/utils.py (1)
  • trim (49-62)
app/utils/xpath.py (1)
  • Pagination (301-303)
app/main.py (2)
  • get_session_stats (46-48)
  • get_retry_stats (52-54)
🪛 GitHub Actions: Code Check
app/services/base.py

[error] 220-220: F401: 'time' imported but unused. Remove unused import: 'time'.


[error] 465-465: E501: Line too long (124 > 120).

🪛 LanguageTool
RAILWAY_ENV_CONFIG.md

[grammar] ~5-~5: Ensure spelling is correct
Context: ...scraping beskyttelse. ## Anti-Scraping Konfiguration ### Session Management ```bash SESSION_TIMEO...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~27-~27: Ensure spelling is correct
Context: ...TING_FREQUENCY=2/3seconds ``` ## Proxy Konfiguration ### Bright Data / Oxylabs Residential Proxie...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~44-~44: Ensure spelling is correct
Context: ...l PROXY_URL_10 ## Browser Scraping Konfiguration ### Playwright Browser Settingsbash ENAB...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~90-~90: Ensure spelling is correct
Context: .../session` - Formål: Session manager statistik - Målinger: Aktive/expired sessions, pro...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~91-~91: Ensure spelling is correct
Context: ...ssion manager statistik - Målinger: Aktive/expired sessions, proxies, user agents ...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~94-~94: Ensure spelling is correct
Context: .../monitoring/retry - Formål: Retry konfiguration - Målinger: Retry settings og performanc...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~122-122: Ensure spelling is correct
Context: ...alinger 1. Bright Data (Oxylabs) - Bedste residential proxies (
$500/måned) 2. **...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] 124-124: Ensure spelling is correct
Context: ...rnativ (
$300/måned) 3. ProxyMesh - God til testing (
$100/måned) Start med en...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[style] 124-124: The word ‘til’ with one ‘l’ is informal. Consider replacing it.
Context: ...iv (
$300/måned) 3. ProxyMesh - God til testing (
$100/måned) Start med en mån...

(TIL)

🔇 Additional comments (13)
app/services/players/search.py (1)

41-88: LGTM - robust per-row extraction with proper null handling.

The refactored extraction logic correctly:

  • Uses relative XPath expressions for each field within the row context
  • Guards against empty lists before indexing
  • Filters out incomplete entries (missing ID)
  • Properly handles multiple nationalities per player
app/services/clubs/players.py (2)

130-134: LGTM - improved null safety for nationalities.

The conditional expression properly handles None elements by providing an empty list fallback, and the inner list comprehension correctly filters empty strings after trimming.


157-176: LGTM - consistent null handling for joined/signed fields.

The pattern if e is not None else "" ensures alignment is preserved even when elements are None.

app/settings.py (1)

12-33: LGTM - well-structured anti-scraping configuration.

The new settings are appropriately typed with descriptive Field annotations. Defaults are reasonable for production use.

app/services/base.py (3)

673-702: LGTM - TransfermarktBase enhancements are well-structured.

The dataclass properly integrates with the session manager, adds useful fields (session_id, session), and initializes the session in __post_init__.


819-869: Block detection logic is well-designed with reduced false positives.

The layered approach (status codes → content length → explicit phrases → CAPTCHA indicators) addresses the PR objective of reducing false block detections. The check for substantial content with "transfermarkt" branding before flagging 403s is a good heuristic.


578-590: The exception handling logic is correct and does not silently swallow exceptions.

On the last retry attempt (attempt == self.max_attempts - 1), line 590 explicitly raises last_exception, ensuring no exceptions are silently swallowed. The retry logic works as intended: it retries on transient errors (429, 503) and on other non-4xx failures, then raises the exception when max attempts are exhausted. The hasattr(e, "status_code") check is a safe defensive practice for handling both requests.exceptions.RequestException and HTTPException, both of which can have a status_code attribute.

openmemory.md (1)

73-103: Documentation updates look consistent with PR changes.

The updated error handling semantics and new anti-bot protection documentation align with the implementation changes described in the PR objectives and AI summary. The documentation correctly references the new components (SmartSessionManager, RetryManager, AntiScrapingMonitor, PlaywrightBrowserScraper) and monitoring endpoints.

app/main.py (2)

25-28: Health endpoint looks good.

Simple and appropriate health check implementation.


31-54: Consider access control for monitoring endpoints.

These endpoints expose internal metrics (success rates, block detection, session info) that could aid attackers in probing the system. If deployed publicly, consider adding authentication or restricting access to these endpoints in production.

app/utils/xpath.py (3)

59-77: Updated XPath selectors for Players.Search look correct.

The updates to use inline-table layout and row-relative paths (.//) for extracting player data align with the described 2024 Transfermarkt HTML structure changes.


148-161: Clubs.Search XPath structure uses flattened extraction.

The XPaths extract flat lists across all rows (e.g., all NAMES, all COUNTRIES). This works correctly as long as every row contains exactly one of each element. If any row is missing data (e.g., no country flag), the lists will misalign when zipped together.

The Player.Search class uses relative paths (.//) which allows row-by-row iteration. Consider using the same pattern here for robustness if misalignment issues occur.


242-261: Competitions.Search selectors use positional indexing for columns.

The indexed column selectors ([1], [2], etc.) are appropriate for tables with consistent structure. The implementation aligns with the described 2024 HTML structure update.

Comment thread app/main.py
Comment on lines +57 to +92
@app.get("/test/browser-scraping", tags=["Testing"])
def test_browser_scraping(url: str = "https://httpbin.org/html", full: bool = False):
"""Test browser scraping capabilities with specified URL."""
try:
import asyncio

from app.services.base import _browser_scraper

loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)

# Test with specified URL
result = loop.run_until_complete(
_browser_scraper.scrape_with_browser(url),
)

if full:
return {
"status": "success",
"url": url,
"content": result,
}
else:
return {
"status": "success",
"url": url,
"content_length": len(result),
"has_transfermarkt": "transfermarkt" in result.lower(),
"preview": result[:500] + "..." if len(result) > 500 else result,
}
except Exception as e:
return {
"status": "error",
"url": url,
"error": str(e),
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

SSRF vulnerability and resource leak in browser scraping test endpoint.

Two issues:

  1. SSRF Risk: Accepting arbitrary URLs allows attackers to probe internal services (e.g., http://localhost:..., http://169.254.169.254/... for cloud metadata).

  2. Resource Leak: The event loop created on line 65 is never closed, leaking resources on each request.

Consider these mitigations:

 @app.get("/test/browser-scraping", tags=["Testing"])
 def test_browser_scraping(url: str = "https://httpbin.org/html", full: bool = False):
     """Test browser scraping capabilities with specified URL."""
     try:
         import asyncio
+        from urllib.parse import urlparse
 
         from app.services.base import _browser_scraper
 
+        # Validate URL to prevent SSRF
+        parsed = urlparse(url)
+        if parsed.hostname in ("localhost", "127.0.0.1", "0.0.0.0") or parsed.hostname.startswith("169.254."):
+            return {"status": "error", "url": url, "error": "Internal URLs not allowed"}
+
         loop = asyncio.new_event_loop()
         asyncio.set_event_loop(loop)
-
-        # Test with specified URL
-        result = loop.run_until_complete(
-            _browser_scraper.scrape_with_browser(url),
-        )
+        try:
+            result = loop.run_until_complete(
+                _browser_scraper.scrape_with_browser(url),
+            )
+        finally:
+            loop.close()

Additionally, consider disabling this endpoint in production or restricting it to allowlisted domains.

Comment thread app/main.py
Comment on lines +95 to +139
@app.get("/debug/xpath", tags=["Debug"])
def debug_xpath(url: str = "https://www.transfermarkt.com/schnellsuche/ergebnis/schnellsuche?query=Barcelona&Verein_page=1"):
"""Debug XPath extraction for club search."""
try:
from app.services.clubs.search import TransfermarktClubSearch
from app.utils.xpath import Clubs

# Create instance and get data
tfmkt = TransfermarktClubSearch(query="Barcelona", page_number=1)

# Get raw XPath results
clubs_names = tfmkt.get_list_by_xpath(Clubs.Search.NAMES)
clubs_urls = tfmkt.get_list_by_xpath(Clubs.Search.URLS)
clubs_countries = tfmkt.get_list_by_xpath(Clubs.Search.COUNTRIES)
clubs_squads = tfmkt.get_list_by_xpath(Clubs.Search.SQUADS)
clubs_market_values = tfmkt.get_list_by_xpath(Clubs.Search.MARKET_VALUES)

return {
"xpath_results": {
"names": clubs_names,
"urls": clubs_urls,
"countries": clubs_countries,
"squads": clubs_squads,
"market_values": clubs_market_values,
},
"counts": {
"names": len(clubs_names),
"urls": len(clubs_urls),
"countries": len(clubs_countries),
"squads": len(clubs_squads),
"market_values": len(clubs_market_values),
},
"xpath_definitions": {
"NAMES": Clubs.Search.NAMES,
"URLS": Clubs.Search.URLS,
"COUNTRIES": Clubs.Search.COUNTRIES,
"SQUADS": Clubs.Search.SQUADS,
"MARKET_VALUES": Clubs.Search.MARKET_VALUES,
},
}
except Exception as e:
return {
"error": str(e),
"traceback": __import__("traceback").format_exc(),
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Information leakage via traceback and XPath exposure in debug endpoint.

This debug endpoint exposes:

  1. Full stack traces (line 138) revealing internal paths and implementation details
  2. Internal XPath selector definitions (lines 127-133)

Both aid attackers in understanding the system's internals.

     except Exception as e:
         return {
             "error": str(e),
-            "traceback": __import__("traceback").format_exc(),
+            # Omit traceback in production; log it server-side instead
         }

Consider:

  • Disabling this endpoint in production (e.g., via environment variable check)
  • Logging the traceback server-side instead of returning it
  • Removing XPath definitions from the response or limiting this endpoint to development
🤖 Prompt for AI Agents
In app/main.py around lines 95 to 139, the debug endpoint currently returns full
stack traces and internal XPath selector definitions which can leak sensitive
implementation details; update the endpoint so it is only enabled in
non-production (guard with an environment/config flag like DEBUG or
FASTAPI_ENV), remove or omit the "xpath_definitions" field from the JSON
response, and stop returning the traceback to the client (log the full exception
and traceback server-side instead and return a minimal error message or HTTP
403/404 when disabled); ensure the guard prevents execution in production and
that any exception handling writes details to logs rather than the response.

Comment thread app/services/base.py
}

# Record session creation
_monitor.record_session_created()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: _monitor referenced before definition.

_monitor.record_session_created() is called in SmartSessionManager.get_session(), but _monitor is not instantiated until Line 336. This will cause a NameError at runtime when the first session is created.

Option 1: Move _monitor instantiation before _session_manager:

+# Global monitoring instance (must be defined before session manager)
+_monitor = AntiScrapingMonitor()
+
 # Global session manager instance
 _session_manager = SmartSessionManager()
-
-
-class AntiScrapingMonitor:
-    ...
-
-# Global monitoring instance
-_monitor = AntiScrapingMonitor()

Option 2: Defer the call or make it conditional:

-        # Record session creation
-        _monitor.record_session_created()
+        # Record session creation (if monitor is initialized)
+        if '_monitor' in globals():
+            _monitor.record_session_created()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_monitor.record_session_created()
# Record session creation (if monitor is initialized)
if '_monitor' in globals():
_monitor.record_session_created()
🤖 Prompt for AI Agents
In app/services/base.py around line 119, _monitor.record_session_created() is
called before _monitor is defined (definition occurs around line 336), causing a
NameError; fix by instantiating _monitor before any session manager usage (move
the _monitor creation up so it occurs before _session_manager and any calls to
_monitor.record_session_created()), or alternatively change the call at line 119
to check for _monitor existence (e.g., only call record_session_created if
_monitor is not None) so the call is deferred until after _monitor is
initialized.

Comment thread app/services/base.py
Comment thread app/services/base.py
Comment on lines +317 to +318
"session_manager_stats": _session_manager.get_session_stats(),
"retry_manager_stats": _retry_manager.get_retry_stats(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Potential NameError when accessing _retry_manager from _monitor.

Similar to the _monitor issue, AntiScrapingMonitor.get_stats() references _retry_manager.get_retry_stats() at Line 318, but _retry_manager is not instantiated until Line 670 (after _monitor at Line 336).

This will cause a NameError if get_stats() is called before _retry_manager is initialized. Reorder the global instantiations so dependencies are satisfied:

  1. AntiScrapingMonitor (no dependencies)
  2. RetryManager (no dependencies)
  3. SmartSessionManager (depends on _monitor)
  4. PlaywrightBrowserScraper (depends on _session_manager, _retry_manager)
🤖 Prompt for AI Agents
In app/services/base.py around lines 317-318, AntiScrapingMonitor.get_stats()
references _retry_manager before _retry_manager is instantiated later in the
file; reorder the global instantiations so dependencies are satisfied by
instantiating in this order: AntiScrapingMonitor first, then RetryManager, then
SmartSessionManager (which depends on _monitor), and finally
PlaywrightBrowserScraper (which depends on _session_manager and _retry_manager);
move the _retry_manager creation to occur immediately after the _monitor
creation and before any code that may call get_stats(), ensuring no NameError.

Comment thread app/services/base.py
Comment thread app/services/base.py
Comment on lines +730 to +753
# Create a mock Response object with browser content
try:
import asyncio
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
html_content = loop.run_until_complete(_browser_scraper.scrape_with_fallback(url))

# Track successful browser request
_monitor.record_browser_request(success=True)

# Create mock response
mock_response = Response()
mock_response.status_code = 200
mock_response._content = html_content.encode("utf-8")
mock_response.url = url
mock_response.headers = {"Content-Type": "text/html"}

return mock_response

except Exception as browser_error:
# Track failed browser request
_monitor.record_browser_request(success=False)
print(f"Browser fallback also failed for {url}: {browser_error}")
raise http_error
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Event loop handling may cause issues in async contexts.

Creating a new event loop with asyncio.new_event_loop() and asyncio.set_event_loop() can interfere with existing event loops (e.g., when running under FastAPI/uvicorn). This may cause RuntimeError: This event loop is already running in async endpoints.

Consider using asyncio.get_event_loop() with a fallback, or restructure to be fully async:

-                import asyncio
-                loop = asyncio.new_event_loop()
-                asyncio.set_event_loop(loop)
-                html_content = loop.run_until_complete(_browser_scraper.scrape_with_fallback(url))
+                try:
+                    loop = asyncio.get_running_loop()
+                    # Already in async context - schedule as task
+                    import concurrent.futures
+                    with concurrent.futures.ThreadPoolExecutor() as pool:
+                        html_content = loop.run_in_executor(
+                            pool,
+                            lambda: asyncio.run(_browser_scraper.scrape_with_fallback(url))
+                        )
+                except RuntimeError:
+                    # No running loop - safe to create one
+                    html_content = asyncio.run(_browser_scraper.scrape_with_fallback(url))

Alternatively, make make_request_with_browser_fallback an async method.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In app/services/base.py around lines 730 to 753, the current approach creating a
new asyncio event loop (asyncio.new_event_loop(); asyncio.set_event_loop(loop);
loop.run_until_complete(...)) can clash with existing event loops (eg under
FastAPI/uvicorn) and raise "This event loop is already running"; replace this by
either 1) converting make_request_with_browser_fallback into an async function
and awaiting _browser_scraper.scrape_with_fallback(url) so it runs on the
existing loop, or 2) if keeping a sync API, get the running loop with
asyncio.get_event_loop() and if it is running use
asyncio.run_coroutine_threadsafe(..., loop).result() or run the coroutine in a
separate thread via asyncio.run_in_executor to avoid setting a new global loop;
remove asyncio.set_event_loop/new_event_loop usage and ensure proper exception
propagation (raise the original browser_error or wrap it) and clean up any
created tasks/threads.

- Change padding logic for players_joined_on, players_joined, and players_signed_from
- Use "" instead of None to match the default value when elements are None
- Fixes CodeRabbit review: inconsistent placeholder values
@eskobar95 eskobar95 merged commit 5b04ef1 into main Dec 6, 2025
1 of 2 checks passed
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
app/services/base.py (4)

124-124: Critical: _monitor still referenced before definition.

This issue was flagged in a previous review but remains unresolved. Line 124 calls _monitor.record_session_created() within SmartSessionManager.get_session(), but _monitor is not instantiated until Line 353 (after _session_manager at Line 247). This causes a NameError when the first session is created.

Reorder the global instantiations:

+# Global monitoring instance (must be defined before session manager)
+_monitor = AntiScrapingMonitor()
+
 # Global session manager instance
 _session_manager = SmartSessionManager()
-
-# Global monitoring instance
-_monitor = AntiScrapingMonitor()

335-335: Critical: _retry_manager still referenced before definition.

This issue was flagged in a previous review but remains unresolved. Line 335 calls _retry_manager.get_retry_stats() within AntiScrapingMonitor.get_stats(), but _retry_manager is not instantiated until Line 696 (after both _session_manager and _monitor). This causes a NameError when get_stats() is called.

Reorder all global instantiations to satisfy dependencies:

+# Global monitoring instance (no dependencies)
+_monitor = AntiScrapingMonitor()
+
+# Global retry manager instance (no dependencies)
+_retry_manager = RetryManager()
+
 # Global session manager instance
 _session_manager = SmartSessionManager()
-
-# Global monitoring instance
-_monitor = AntiScrapingMonitor()
-
-# (later in file...)
-# Global retry manager instance
-_retry_manager = RetryManager()
+
+# Global browser scraper instance (depends on _session_manager, _retry_manager)
+_browser_scraper = PlaywrightBrowserScraper()

198-198: Remove redundant import time statements inside methods.

A previous review flagged these redundant imports, but they remain in the code. Line 4 already imports time at the module level. Remove the redundant inner imports at lines 198, 204, 210, 272, 309, and 900.

Apply this diff to remove redundant imports:

     def _get_timestamp(self) -> float:
         """Get current timestamp for session management."""
-        import time
         return time.time()
     
     def _is_session_expired(self, session_data: Dict) -> bool:
         """Check if a session has expired."""
-        import time
         return time.time() - session_data["created_at"] > self.session_timeout
     
     def _cleanup_expired_sessions(self):
         """Remove expired sessions."""
-        import time
         current_time = time.time()
         # ...

Apply similar changes to lines 272, 309, and 900.

Also applies to: 204-204, 210-210, 272-272, 309-309, 900-900


760-762: Event loop handling may cause issues in async contexts.

This issue was flagged in a previous review but remains unresolved. Creating a new event loop with asyncio.new_event_loop() and asyncio.set_event_loop() can interfere with existing event loops (e.g., when running under FastAPI/uvicorn), potentially causing RuntimeError: This event loop is already running.

Consider using asyncio.run() instead (available in Python 3.7+):

             try:
-                import asyncio
-                loop = asyncio.new_event_loop()
-                asyncio.set_event_loop(loop)
-                html_content = loop.run_until_complete(_browser_scraper.scrape_with_fallback(url))
+                import asyncio
+                html_content = asyncio.run(_browser_scraper.scrape_with_fallback(url))

Alternatively, convert make_request_with_browser_fallback to an async method to properly integrate with the existing event loop.

🧹 Nitpick comments (2)
app/settings.py (1)

24-25: Consider validating REQUEST_DELAY_MIN < REQUEST_DELAY_MAX.

While the defaults are correct (1.0 < 3.0), adding a validator would prevent misconfiguration via environment variables.

Apply this diff to add validation:

+    @field_validator("REQUEST_DELAY_MAX")
+    @classmethod
+    def validate_delay_range(cls, v: float, info) -> float:
+        """Validate that REQUEST_DELAY_MAX is greater than REQUEST_DELAY_MIN."""
+        if "REQUEST_DELAY_MIN" in info.data and v <= info.data["REQUEST_DELAY_MIN"]:
+            raise ValueError("REQUEST_DELAY_MAX must be greater than REQUEST_DELAY_MIN")
+        return v
app/services/base.py (1)

808-809: Optional: Redundant session check.

Lines 808-809 perform a double-check for session initialization, but lines 800-802 already ensure the session is initialized before _single_request is defined. This check is defensive but unnecessary.

Consider removing the redundant check:

         def _single_request() -> Response:
             """Single request attempt with error handling and monitoring."""
             try:
-                # Double-check session is available
-                if self.session is None:
-                    self.session = _session_manager.get_session(self.session_id)
-
                 response: Response = self.session.get(url=url, timeout=30)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a419b41 and 047078e.

📒 Files selected for processing (4)
  • app/main.py (2 hunks)
  • app/services/base.py (6 hunks)
  • app/services/clubs/players.py (3 hunks)
  • app/settings.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • app/main.py
🧰 Additional context used
🧬 Code graph analysis (2)
app/services/clubs/players.py (1)
app/utils/utils.py (3)
  • extract_from_url (19-46)
  • safe_regex (65-84)
  • trim (49-62)
app/services/base.py (3)
app/utils/utils.py (1)
  • trim (49-62)
app/utils/xpath.py (1)
  • Pagination (301-303)
app/main.py (2)
  • get_session_stats (46-48)
  • get_retry_stats (52-54)
🔇 Additional comments (13)
app/services/clubs/players.py (4)

6-6: LGTM! Import addition is appropriate.

The trim import is necessary for the new data normalization logic in nationalities handling and aligns with the existing usage patterns in the codebase.


130-141: LGTM! Robust nationalities extraction with proper data cleaning.

The new implementation correctly:

  • Applies trim to each nationality element extracted via XPath
  • Guards against None elements before calling .xpath()
  • Filters out empty strings after trimming
  • Returns an empty list when page_nationalities is missing

This ensures data quality and prevents runtime errors.


164-170: LGTM! Consistent handling with appropriate defaults.

The implementation properly:

  • Guards against None elements before calling .xpath()
  • Uses empty strings ("") instead of None for missing data
  • Joins multiple values with "; " separator
  • Returns an empty list when source data is unavailable

This maintains data consistency and alignment with the base length.


172-186: LGTM! Consistent pattern for joined and signedFrom fields.

Both players_joined and players_signed_from follow the same robust pattern:

  • None guards to prevent errors
  • Empty string defaults for missing data
  • Semicolon-separated joins for multiple values
  • Proper alignment with base length

This ensures consistency across all player data fields.

app/settings.py (2)

47-60: LGTM! Tournament size validation is correct.

The validator properly:

  • Allows None values for optional tournament sizes
  • Enforces positive integers when values are provided
  • Applies to all tournament size fields

62-80: LGTM! Clean tournament size lookup implementation.

The method provides a simple, efficient way to retrieve tournament sizes by competition ID with a safe None default for unknown competitions.

app/services/base.py (7)

608-612: Verify status code access for different exception types.

Line 608 uses hasattr(e, "status_code"), but this may not work correctly for requests.exceptions.RequestException, which stores the status code in e.response.status_code rather than directly on the exception.

Verify the correct way to access status codes from both exception types and adjust the logic:

# For requests exceptions: e.response.status_code
# For HTTPException: e.status_code

Consider this pattern:

                 # Don't retry on client errors (4xx) except rate limits
-                if hasattr(e, "status_code") and 400 <= e.status_code < 500:
+                status_code = None
+                if hasattr(e, "status_code"):
+                    status_code = e.status_code
+                elif hasattr(e, "response") and hasattr(e.response, "status_code"):
+                    status_code = e.response.status_code
+                
+                if status_code and 400 <= status_code < 500:
-                    if e.status_code in [429, 503]:  # Rate limit or service unavailable
+                    if status_code in [429, 503]:  # Rate limit or service unavailable
                         continue  # Retry these
                     else:
                         raise  # Don't retry other 4xx errors

Also applies to: 667-671


846-896: LGTM! Improved block detection reduces false positives.

The _detect_block method implements a sophisticated multi-stage approach:

  1. Status code checks with special handling for 403 responses containing valid content
  2. Detection of suspiciously short responses
  3. Explicit block phrases only checked when status suggests blocking (≥400)
  4. CAPTCHA indicators with contextual checks

This aligns with the PR objective to reduce false positives (e.g., from the word "blocked" appearing in normal HTML).


1006-1010: LGTM! Page initialization guards prevent runtime errors.

The new guards in get_list_by_xpath and get_text_by_xpath properly check if self.page is initialized before attempting XPath operations, raising clear HTTPException(500) errors if the page is None. This prevents cryptic AttributeError exceptions and provides better error context.

Also applies to: 1049-1053


1099-1105: LGTM! Session rotation method is straightforward.

The rotate_session method correctly creates a new session with a fresh UUID and retrieves it from the session manager, providing a clean way to force session rotation when encountering rate limits or blocks.


1107-1155: LGTM! Static accessor methods provide clean monitoring API.

The new static methods (get_session_stats, get_retry_stats, has_proxy_support, get_monitoring_stats) provide a clean public API for accessing anti-scraping subsystem statistics and configuration. The delegation to global singletons is appropriate, and get_monitoring_stats usefully augments the monitor data with browser scraper details.


904-928: LGTM! Enhanced request methods with browser fallback.

Both request_url_bsoup and request_url_page now implement intelligent fallback logic:

  1. Try standard HTTP request first
  2. Fall back to browser scraping on failure
  3. Re-raise original HTTP error if both methods fail

This provides robustness against anti-scraping measures while maintaining performance by preferring fast HTTP requests.

Also applies to: 952-975


944-950: LGTM! Added safety check for HTML parsing failures.

The new guard in convert_bsoup_to_page properly handles the case where etree.HTML() returns None (which can happen with malformed HTML), raising a clear HTTPException(500) with an informative error message instead of allowing a None value to propagate.

Comment thread app/services/base.py
"--no-first-run",
"--no-zygote",
"--disable-gpu",
"--disable-web-security",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Security concern: --disable-web-security weakens browser isolation.

The --disable-web-security flag disables CORS and other security boundaries. While this may be necessary for some scraping scenarios, it creates risk if the scraper processes untrusted URLs or if the browser context is reused across different security zones.

Consider whether this flag is strictly necessary. If you must disable web security:

  • Document why it's required
  • Ensure the scraper only processes trusted Transfermarkt URLs
  • Never reuse browser contexts across different domains
  • Consider removing this flag and handling CORS issues differently
                 args=[
                     "--no-sandbox",
                     "--disable-setuid-sandbox",
                     "--disable-dev-shm-usage",
                     "--disable-accelerated-2d-canvas",
                     "--no-first-run",
                     "--no-zygote",
                     "--disable-gpu",
-                    "--disable-web-security",
                     "--disable-features=VizDisplayCompositor",
                 ],
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"--disable-web-security",
args=[
"--no-sandbox",
"--disable-setuid-sandbox",
"--disable-dev-shm-usage",
"--disable-accelerated-2d-canvas",
"--no-first-run",
"--no-zygote",
"--disable-gpu",
"--disable-features=VizDisplayCompositor",
],
🤖 Prompt for AI Agents
In app/services/base.py around line 403, the Chromium launch args include
"--disable-web-security", which weakens browser isolation; remove this flag
unless absolutely required, and if it must remain, add an inline comment and doc
explaining why, restrict the scraper to a validated whitelist of Transfermarkt
URLs before loading any page, ensure a new isolated browser context is created
per domain (never reuse contexts across domains), and add runtime guard checks
that prevent processing untrusted URLs so the flag is only effective for
known-safe targets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant