|
| 1 | +## Tools Directory Usage |
| 2 | + |
| 3 | +The `tools` directory contains utility scripts to help you work with OSL (Open Sports Lab) datasets, particularly for downloading annotated datasets and associated videos from Hugging Face. Below you'll find an explanation and usage instructions. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +### 1. Download OSL Dataset and Videos from Hugging Face |
| 8 | + |
| 9 | +**Script:** `tools/download_osl_hf.py` |
| 10 | + |
| 11 | +This script automates the download of an OSL-format JSON file (annotation file) and all referenced videos from a Hugging Face dataset repository. |
| 12 | + |
| 13 | +#### **Features:** |
| 14 | + |
| 15 | +* Downloads a specific OSL JSON annotation file. |
| 16 | +* Parses the JSON to identify referenced video files and downloads them as well. |
| 17 | +* Can perform a “dry run” to show which files would be downloaded and their total size, without actually downloading. |
| 18 | + |
| 19 | +#### **Requirements** |
| 20 | + |
| 21 | +* Python 3.x |
| 22 | +* `huggingface_hub` Python package (install with `pip install huggingface_hub`) |
| 23 | + |
| 24 | +#### **Usage** |
| 25 | + |
| 26 | +**Basic Command:** |
| 27 | + |
| 28 | +```bash |
| 29 | +python tools/download_osl_hf.py \ |
| 30 | + --url https://huggingface.co/datasets/<user>/<dataset>/blob/main/<file.json> \ |
| 31 | + --output-dir <output_directory> |
| 32 | +``` |
| 33 | + |
| 34 | +**Example:** |
| 35 | + |
| 36 | +```bash |
| 37 | +python tools/download_osl_hf.py \ |
| 38 | + --url https://huggingface.co/datasets/OpenSportsLab/HistWC/blob/main/HistWC-finals.json \ |
| 39 | + --output-dir /Users/giancos/Documents/HistWC/ |
| 40 | +``` |
| 41 | + |
| 42 | +**Arguments:** |
| 43 | + |
| 44 | +* `--url`: (required) The direct Hugging Face URL of the OSL JSON file (should be in “blob/main/...” form, like you see in the web interface). |
| 45 | +* `--output-dir`: (optional) Path to the directory where the dataset and videos should be downloaded. Defaults to `downloaded_data` if not specified. |
| 46 | +* `--dry-run`: (optional) If provided, lists all files that would be downloaded and total size, but does not actually download any files. |
| 47 | + |
| 48 | +**Dry Run Example:** |
| 49 | + |
| 50 | +```bash |
| 51 | +python tools/download_osl_hf.py \ |
| 52 | + --url https://huggingface.co/datasets/OpenSportsLab/HistWC/blob/main/HistWC-finals.json \ |
| 53 | + --output-dir /Users/giancos/Documents/HistWC/ \ |
| 54 | + --dry-run |
| 55 | +``` |
| 56 | + |
| 57 | +--- |
| 58 | + |
| 59 | +### 2. Zip the folder |
| 60 | + |
| 61 | +```bash |
| 62 | +zip -r DatasetAnnotationTool.zip * |
| 63 | +``` |
| 64 | + |
| 65 | +--- |
| 66 | + |
| 67 | +### **Notes** |
| 68 | + |
| 69 | +* The script automatically converts Hugging Face “blob” URLs to the proper “resolve” format for direct file access. |
| 70 | +* After downloading, the output directory will contain the JSON annotation and all video files referenced in it, keeping the original folder structure. |
| 71 | +* For datasets with a large number of videos, downloads will be parallelized for efficiency. |
| 72 | +* If a video is missing in the repo, it will be reported (especially useful in dry run mode). |
0 commit comments