Skip to content

MarsToGotlibre/isu-score-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

isu-score-parser

isu-score-parser is a versatile tool designed to extract and structure figure skating data, including results, metadata, and technical protocols.

While it is optimized for Synchronized Skating, it supports most artistic disciplines. The project is built into two independent modules, allowing you to use only what you need.

Note

Maintenance is primarily focused on Synchronized Skating. While other disciplines are generally supported, full compatibility is not guaranteed if their PDF or page structures deviate significantly from the tested standards.

To see examples of outputs you can check the test-outputs folder in the test branch.

Event Data Scraper

Extracts locations, results, and panel data from ISU event pages.

Features

  • Retro-compatible: Supports both modern and legacy competition page layouts.
  • Wayback Machine Integration: Automatically explores archived pages and their relative links using the archive.org API.
  • Robust Parsing: Handles various statuses (Ranked, Withdrawn, Did Not Reach Final).

Installation

pip install requests lxml pandas beautifulsoup4 regex

Usage

1. Scrape an event page

Extract metadata and results. You can also trigger the PDF download immediately with the -d flag.

python3 main.py event scrape <url> [OPTIONS]
Option Description
-d, --download-pdf Dowload the scores PDF found during the scrapping
-o, --output-dir Output directory. Created if it doesn't exist. If not specified, a generic directory will be created.

2. Download PDFs from a JSON output

If you already have a JSON result from a previous scrape, use this to fetch the PDF files.

python3 main.py event dl <FILE.json> [OPTIONS]
Option Description
-o, --output-dir Output directory name. If it doesnt exists it will be created. Defaults to the same directory as the JSON file.

Examples

python3 main.py event dl example.json -o Directory
python3 main.py event scrape https://example.com -o Directory

Extract PDFs scores

A tool to extract score tables from synchro skating score PDFs using python. The extracted tables of scores are stored into json files, and can be completed by adding a yaml file to the parser. The parser also support other artistic disciplines.

Features:

  • Retrocompatible (Up to 2005)
  • Multiple discipline support
    • base value bonus support
  • Deduction votes support
  • No call support

Installation

Requires :

  • Python 3.10+

  • Python dependencies :

    • pandas (2.3.3)
    • camelot-py (1.0.9)
    • pdfplumber (0.11.9)
    • PyYaml (6.0.3) (optional : download if you intend using YAML file to complete your output)
pip install pandas "camelot-py[base]" pdfplumber pyyaml

Usage

Use the following options to parse your pdf:

Options Required Descriptions
-p, --pdf yes PDF file path
-y, --yaml no YAML file path to complete the competition info
-b, --begin yes First page to parse
-e, --end no Last page to parse. If not specified only the first page entered will be parsed
-o, --output no Output directory. If it doesnt exists it will be created, if not specified a generical output directory will be created to put the jsons generated.

Usage :

python3 main.py [OPTIONS]

Add info to the jsons generated

With the a YAML file following this patern:

schema_version: 1
competition:
  name: ISU World Synchronized Skating Championships
  location:
    country: SWE
    city: Stockholm
  date: 2018-04-06
season: 2017-2018
source_url: example.org

None of the entries (except shema_version) or required when parsing. You can remove some of them if data is missing.

Futur Objectvies

  • Possibility to parse multiple pdfs at a time.
  • Connect the 2 modules

About

A tool to extract score tables from figure skating score pdfs using python.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages