Skip to content

fix: Add name and seasonId fields to ClubPlayers schema#7

Merged
eskobar95 merged 42 commits intomainfrom
fix/anti-scraping-optimization
Dec 6, 2025
Merged

fix: Add name and seasonId fields to ClubPlayers schema#7
eskobar95 merged 42 commits intomainfrom
fix/anti-scraping-optimization

Conversation

@eskobar95
Copy link
Copy Markdown
Owner

@eskobar95 eskobar95 commented Dec 6, 2025

Fix

Club Players Schema Update

  • Added missing fields - Added name and season_id fields to ClubPlayers schema
  • These fields are set in get_club_players() but were missing from the Pydantic schema
  • Resolves issue where name and seasonId were null in API response despite being set in service layer

Changes

  • app/schemas/clubs/players.py: Added name: Optional[str] = None and season_id: Optional[str] = None to ClubPlayers class

Testing

  • ✅ Schema validation passes with new fields
  • ✅ Local: Returns name='AC Milan' and seasonId='2024' correctly
  • ⚠️ Server: Currently returns null (will be fixed after deployment)

Related

Follow-up to PR #6 - adds schema fields that were missing for the response fields added in that PR.

Summary by CodeRabbit

  • New Features

    • Added debug endpoint for troubleshooting scraping issues and diagnosing page retrieval performance
    • ClubPlayers responses now include club name and season information
  • Bug Fixes

    • Improved robustness of player position and age data extraction with more flexible selectors
    • Enhanced error handling and logging for search endpoints with better diagnostics
    • Strengthened browser fallback mechanism with availability checks and explicit error reporting
  • Chores

    • Added Playwright dependency for browser-based content retrieval

✏️ Tip: You can customize this high-level summary in your review settings.

- Add full support for national teams across all club endpoints
- Add new /clubs/{club_id}/competitions endpoint to retrieve club competitions
- Add isNationalTeam field to Club Profile response schema
- Make Club Profile fields optional to accommodate national teams
- Enhance Club Players endpoint to handle national team HTML structure
- Update XPath expressions to support both club and national team structures
- Add intelligent detection logic for national teams
- Maintain backward compatibility with existing club endpoints

This update enables the API to work seamlessly with both regular clubs
and national teams, providing a unified interface for all club-related
data retrieval.
- Add GET /competitions/{competition_id}/seasons endpoint
- Implement TransfermarktCompetitionSeasons service to scrape season data
- Add CompetitionSeason and CompetitionSeasons Pydantic schemas
- Support both cross-year (e.g., 25/26) and single-year (e.g., 2025) seasons
- Handle historical seasons correctly (e.g., 99/00 -> 1999-2000)
- Extract seasons from competition page dropdown/table structure
- Return season_id, season_name, start_year, and end_year for each season
- Sort seasons by start_year descending (newest first)

Closes #[issue-number]
- Detect national team competitions (FIWC, EURO, COPA, AFAC, GOCU, AFCN)
- Use /teilnehmer/pokalwettbewerb/ URL for national team competitions
- Handle season_id correctly (year-1 for national teams in URL)
- Add XPath expressions for participants table
- Limit participants to expected tournament size to exclude non-qualified teams
- Make season_id optional in CompetitionClubs schema
- Update Dockerfile PYTHONPATH configuration
- Add length validation for ids and names before zip() to prevent silent data loss
- Raise descriptive ValueError with logging if ids and names mismatch
- Simplify seasonId assignment logic for national teams
- Remove unnecessary try/except block (isdigit() prevents ValueError)
- Clean up unreachable fallback code
- Add tournament size configuration to Settings class with environment variable support
- Replace hardcoded dict with settings.get_tournament_size() method
- Add warning logging when tournament size is not configured (instead of silent truncation)
- Proceed without truncation when size is unavailable (no silent data loss)
- Add validation for tournament sizes (must be positive integers)
- Add comprehensive unit tests for both configured and fallback paths
- Update README.md with new environment variables documentation

This prevents silent truncation when tournament sizes change (e.g., World Cup expanding to 48)
and allows easy configuration via environment variables.
- Remove extra HTTP request to fetch club profile just to read isNationalTeam
- Set is_national_team=None to let TransfermarktClubPlayers use DOM heuristics
- Remove broad except Exception that silently swallowed all errors
- Improve performance by eliminating redundant network call
- Players class already has robust DOM-based detection for national teams
- Move datetime and HTTPException imports from method level to module level
- Improves code readability and marginally improves performance
- Follows Python best practices for import organization
- Move datetime and HTTPException imports from method level to module level
- Improves code readability and marginally improves performance
- Follows Python best practices for import organization
- Keep imports at module level in clubs/competitions.py (from CodeRabbit review)
- Preserve is_national_team flag logic in clubs/players.py
- Keep name padding in competitions/search.py
- Add .DS_Store to .gitignore
- Remove whitespace from blank lines (W293)
- Add missing trailing commas (COM812)
- Split long XPath lines to comply with E501 line length limit
- Format XPath strings to comply with line length
- Format list comprehensions
- Format is_season condition
- Fix session initialization issue causing all HTTP requests to fail
- Improve block detection to avoid false positives
- Optimize browser scraping delays (reduce from 12-13s to 0.4-0.8s)
- Update XPath definitions for clubs, competitions, and players search
- Fix nationalities parsing in player search (relative to each row)
- Add comprehensive monitoring endpoints
- Update settings for anti-scraping configuration

Performance improvements:
- HTTP success rate: 0% → 100%
- Response time: 12-13s → 0.4-0.8s
- Browser fallback: Always → Never needed
- All endpoints now working correctly
Resolved conflicts:
- app/services/clubs/players.py: Kept improved nationalities parsing with trim()
- app/settings.py: Kept anti-scraping configuration settings
- app/utils/xpath.py: Combined URL from HEAD with robust NAME fallbacks from main
- Fix import sorting
- Add trailing commas
- Replace single quotes with double quotes
- Add noqa comments for long lines (User-Agent strings, XPath definitions)
- Remove unused variables
- Fix whitespace issues
- Change padding logic for players_joined_on, players_joined, and players_signed_from
- Use "" instead of None to match the default value when elements are None
- Fixes CodeRabbit review: inconsistent placeholder values
- Add try/except for playwright import to handle missing dependency
- Make _browser_scraper optional (None if playwright unavailable)
- Add checks in make_request_with_browser_fallback and get_monitoring_stats
- Update test_browser_scraping endpoint to handle missing playwright
- Add playwright to requirements.txt
- App can now start without playwright, browser scraping disabled if unavailable
- Add playwright install chromium step in Dockerfile
- Only runs if playwright is installed (graceful fallback)
- Ensures browser binaries are available for Railway deployment
- Keep optional playwright import in base.py
- Keep playwright availability check in test_browser_scraping endpoint
- Maintains Railway deployment compatibility
- Keep optional playwright import and checks
- Maintain browser scraper optional initialization
- Preserve playwright availability checks in monitoring stats
- All conflicts resolved, ready for merge
- Validate HTTP responses have content before returning
- Validate browser scraping content before using
- Add detailed logging for debugging deployment issues
- Raise proper exceptions when content is empty or invalid
- Helps diagnose why server returns 200 with empty content
- Add /debug/scraping endpoint to test HTTP, browser, and page requests
- Shows content lengths, errors, and availability status
- Helps diagnose why server returns empty responses
- Better error messages in request_url_page and make_request_with_browser_fallback
- Validate page is not None after request_url_page()
- Add exception handling in endpoints to catch and log errors
- Add XPath error handling with detailed error messages
- Warn when search results are empty (helps diagnose Railway issues)
- Prevents silent failures that return 200 with empty data
- Log page HTML length and content validation
- Log XPath extraction results
- Warn when no results found
- Helps diagnose why server returns empty results while local works
- Default to current year if season_id is None (like Club Competitions)
- Add club name and seasonId to response
- Fix URL formatting to prevent 'None' in URL
- Improves __update_season_id to validate extracted season
- Fix DOB_AGE XPath to use TD[5] instead of zentriert[2] selector
- Default season_id to current year if None (prevents 'None' in URL)
- Add club name and seasonId to response
- Resolves issue where 0 players were returned despite finding URLs/names
- Use more specific XPath matching zentriert td with date pattern
- Filters for dates (contains '/') and minimum length to avoid false matches
- Resolves issue where DOB_AGE returned 0 items
- Add Optional name and season_id fields to ClubPlayers schema
- These fields are set in get_club_players() but were missing from schema
- Resolves issue where name and seasonId were None in API response
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Dec 6, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

This PR adds Playwright browser scraping capability with graceful fallback handling, introduces a new diagnostic debug endpoint for multi-stage scraping metrics, enhances error handling and logging across search and competition endpoints, updates XPath selectors for robustness, expands the club player schema with season and name fields, and adds Playwright as a conditional dependency.

Changes

Cohort / File(s) Summary
Infrastructure & Dependencies
Dockerfile, requirements.txt
Adds conditional Playwright browser installation in Docker build and declares playwright==1.48.0 as a Python 3.9+ dependency.
Error Handling & Logging
app/api/endpoints/clubs.py, app/api/endpoints/competitions.py
Wraps search handlers in try/except blocks; logs warnings when "results" are empty and re-raises exceptions for proper error propagation.
Search Service Robustness
app/services/clubs/search.py, app/services/competitions/search.py
Adds try/except guards on page loading in __post_init__, explicit validation that page is not None, detailed debug logging for content inspection, and XPath extraction diagnostics.
New Debug Endpoint
app/main.py
Introduces /debug/scraping endpoint that orchestrates HTTP request, optional Playwright browser fallback, and full page retrieval, returning structured diagnostic metrics; enhances test_browser_scraping with availability guard.
Playwright Integration
app/services/base.py
Implements guarded Playwright import with PLAYWRIGHT_AVAILABLE flag, conditional _browser_scraper initialization, runtime ImportError raising when unavailable, enhanced monitoring stats, and expanded content validation.
Data Model & Schema
app/schemas/clubs/players.py
Adds optional name and season_id fields to ClubPlayers schema with default None values.
Service Logic Updates
app/services/clubs/players.py
Defaults season_id to current year in __post_init__, adds guarded extraction in __update_season_id, retrieves and includes club name and season ID in get_club_players response.
XPath Selector Improvements
app/utils/xpath.py
Updates player position and date-of-birth selectors from index-based to pattern-based matching; POSITIONS now uses substring containment for class matching, DOB_AGE matches date patterns with slash validation.

Sequence Diagram

sequenceDiagram
    participant Client
    participant DebugEndpoint as /debug/scraping Endpoint
    participant HTTPScraper as HTTP Scraper
    participant BrowserScraper as Browser Scraper<br/>(Playwright)
    participant PageProcessor as Page Processor

    Client->>DebugEndpoint: GET /debug/scraping?url=...
    
    DebugEndpoint->>HTTPScraper: scrape_http(url)
    HTTPScraper-->>DebugEndpoint: html_content or error
    
    rect rgb(200, 220, 255)
        note over DebugEndpoint,BrowserScraper: Browser Fallback (if needed)
        alt Playwright Available & HTTP Failed
            DebugEndpoint->>BrowserScraper: scrape_with_browser(url)
            BrowserScraper->>BrowserScraper: Launch browser<br/>Navigate page
            BrowserScraper-->>DebugEndpoint: full_page_content
        else Playwright Unavailable
            DebugEndpoint-->>DebugEndpoint: Report<br/>browser_available=False
        end
    end

    DebugEndpoint->>PageProcessor: Parse & retrieve full content
    PageProcessor-->>DebugEndpoint: Parsed results
    
    DebugEndpoint-->>Client: {<br/>  http_success, http_content_length,<br/>  browser_success, browser_content_length,<br/>  full_success, total_content_length<br/>}
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Areas requiring extra attention:
    • app/services/base.py: Conditional Playwright import and _browser_scraper initialization logic—verify graceful fallback doesn't mask underlying failures
    • app/services/clubs/search.py: Multiple debug logging blocks and page serialization logic—ensure performance impact is acceptable and logging doesn't leak sensitive data
    • app/utils/xpath.py: Updated XPath patterns for player position and DOB extraction—verify new substring/pattern matching produces consistent results across varied HTML structures
    • app/main.py: New debug endpoint and browser fallback orchestration—ensure error paths and metric reporting are exhaustive

Poem

🐰 Playwright now stands by, ready to catch what HTTP might miss,
With graceful fallbacks and logs that won't dismiss,
Debug endpoints divine, metrics precise and bright,
XPaths refined by pattern, not fragile by sight.
Robust and aware, we scrape through the night! 🌙✨

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/anti-scraping-optimization

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 057f689 and 58dda40.

📒 Files selected for processing (11)
  • Dockerfile (1 hunks)
  • app/api/endpoints/clubs.py (1 hunks)
  • app/api/endpoints/competitions.py (1 hunks)
  • app/main.py (2 hunks)
  • app/schemas/clubs/players.py (1 hunks)
  • app/services/base.py (8 hunks)
  • app/services/clubs/players.py (3 hunks)
  • app/services/clubs/search.py (2 hunks)
  • app/services/competitions/search.py (1 hunks)
  • app/utils/xpath.py (1 hunks)
  • requirements.txt (1 hunks)

Comment @coderabbitai help to get the list of available commands and usage tips.

@eskobar95 eskobar95 merged commit 8d35919 into main Dec 6, 2025
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant