Download all images per gbifID with -NN suffix, plus circuit breaker … by Jiahao419 · Pull Request #34 · gardoslab/herbdl

Jiahao419 · 2026-05-21T14:51:14Z

This is a consolidation of changes I made to image_install_parallel.py while taking over the pipeline. The new SQLite-tracked downloader in gardoslab/herbaria-data#1 is based on this same logic — opening this PR mainly to record the work in the original repo's history, as Tom suggested.

What changed

Download every image per gbifID, not just the first. Each image is saved as <gbifID>-00.jpg, <gbifID>-01.jpg, etc. A gbifID is written to processed_ids.txt only when all of its images succeed; partial failures go to failed_ids.txt for retry.
Hierarchical storage path — <base>/<id[:3]>/<id[3:6]>/<id>-NN.jpg — to avoid millions of files in a single directory.
Per-host circuit breaker (threshold raised from 50 → 500) plus a timed cooldown for HTTP 429 / timeouts, so one bad host (e.g. swbiodiversity.org, mnhn.fr) can't block the whole run.
IIIF manifest parsing: manifests are expanded into direct image URLs at 1600 / 1200 / 800 px so the downloader actually has something to fetch.
Retry-failed-first on startup: IDs in failed_ids.txt are tried before new ones; an ID is removed from failed_ids only on full success.
Removed the duplicate check against legacy datasets (GBIF-F24, harvard-herbaria) — it was loading those multimedia.txt files at startup, which contributed to the early memory issues.
.gitignore: exclude local scratch files (backups, dry runs, sorted/merged checkpoint copies).

Status

Last batch run on this code (job 5752217, 5/20–5/21) added ~620k processed IDs on top of the inherited baseline (13.6M processed, 19.3M failed).
Memory issues were resolved separately by submitting with -l mem_per_core=8G; no code change needed.
Job has been canceled so the new SQLite-tracked pipeline can take over.

…and IIIF support - Save each gbifID's images as <id>-00.jpg, <id>-01.jpg... and mark a gbifID processed only when every image succeeds (partial failures go to failed_ids for retry). - Add hierarchical storage path (<base>/<prefix1>/<prefix2>/<id>-NN.jpg) to avoid millions of files in a single directory. - Add a per-host circuit breaker (threshold 500) and a timed cooldown for 429/timeout responses, so one bad host can't block the whole run. - Parse IIIF manifests into direct image URLs (1600/1200/800 variants). - On startup, retry IDs from failed_ids.txt first; remove them only on full success. - Drop the duplicate-check against legacy datasets; disable it for now. - .gitignore: exclude local scratch files (backups, dry runs, sorted/ merged checkpoint copies).

trgardos marked this pull request as draft June 4, 2026 16:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download all images per gbifID with -NN suffix, plus circuit breaker …#34

Download all images per gbifID with -NN suffix, plus circuit breaker …#34
Jiahao419 wants to merge 1 commit into
mainfrom
multi-image-download

Jiahao419 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Jiahao419 commented May 21, 2026

What changed

Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant