Skip to content

Commit ca8c8ff

Browse files
Added Tools README.md
1 parent ba5c9f2 commit ca8c8ff

2 files changed

Lines changed: 75 additions & 0 deletions

File tree

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -211,3 +211,6 @@ __marimo__/
211211
KQL
212212
.DS_Store
213213
**/.DS_Store
214+
215+
216+
DatasetAnnotationTool.zip

tools/README.md

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
## Tools Directory Usage
2+
3+
The `tools` directory contains utility scripts to help you work with OSL (Open Sports Lab) datasets, particularly for downloading annotated datasets and associated videos from Hugging Face. Below you'll find an explanation and usage instructions.
4+
5+
---
6+
7+
### 1. Download OSL Dataset and Videos from Hugging Face
8+
9+
**Script:** `tools/download_osl_hf.py`
10+
11+
This script automates the download of an OSL-format JSON file (annotation file) and all referenced videos from a Hugging Face dataset repository.
12+
13+
#### **Features:**
14+
15+
* Downloads a specific OSL JSON annotation file.
16+
* Parses the JSON to identify referenced video files and downloads them as well.
17+
* Can perform a “dry run” to show which files would be downloaded and their total size, without actually downloading.
18+
19+
#### **Requirements**
20+
21+
* Python 3.x
22+
* `huggingface_hub` Python package (install with `pip install huggingface_hub`)
23+
24+
#### **Usage**
25+
26+
**Basic Command:**
27+
28+
```bash
29+
python tools/download_osl_hf.py \
30+
--url https://huggingface.co/datasets/<user>/<dataset>/blob/main/<file.json> \
31+
--output-dir <output_directory>
32+
```
33+
34+
**Example:**
35+
36+
```bash
37+
python tools/download_osl_hf.py \
38+
--url https://huggingface.co/datasets/OpenSportsLab/HistWC/blob/main/HistWC-finals.json \
39+
--output-dir /Users/giancos/Documents/HistWC/
40+
```
41+
42+
**Arguments:**
43+
44+
* `--url`: (required) The direct Hugging Face URL of the OSL JSON file (should be in “blob/main/...” form, like you see in the web interface).
45+
* `--output-dir`: (optional) Path to the directory where the dataset and videos should be downloaded. Defaults to `downloaded_data` if not specified.
46+
* `--dry-run`: (optional) If provided, lists all files that would be downloaded and total size, but does not actually download any files.
47+
48+
**Dry Run Example:**
49+
50+
```bash
51+
python tools/download_osl_hf.py \
52+
--url https://huggingface.co/datasets/OpenSportsLab/HistWC/blob/main/HistWC-finals.json \
53+
--output-dir /Users/giancos/Documents/HistWC/ \
54+
--dry-run
55+
```
56+
57+
---
58+
59+
### 2. Zip the folder
60+
61+
```bash
62+
zip -r DatasetAnnotationTool.zip *
63+
```
64+
65+
---
66+
67+
### **Notes**
68+
69+
* The script automatically converts Hugging Face “blob” URLs to the proper “resolve” format for direct file access.
70+
* After downloading, the output directory will contain the JSON annotation and all video files referenced in it, keeping the original folder structure.
71+
* For datasets with a large number of videos, downloads will be parallelized for efficiency.
72+
* If a video is missing in the repo, it will be reported (especially useful in dry run mode).

0 commit comments

Comments
 (0)