Skip to content

Download all images per gbifID with -NN suffix, plus circuit breaker …#34

Draft
Jiahao419 wants to merge 1 commit into
mainfrom
multi-image-download
Draft

Download all images per gbifID with -NN suffix, plus circuit breaker …#34
Jiahao419 wants to merge 1 commit into
mainfrom
multi-image-download

Conversation

@Jiahao419
Copy link
Copy Markdown
Collaborator

This is a consolidation of changes I made to image_install_parallel.py while taking over the pipeline. The new SQLite-tracked downloader in gardoslab/herbaria-data#1 is based on this same logic — opening this PR mainly to record the work in the original repo's history, as Tom suggested.

What changed

  • Download every image per gbifID, not just the first. Each image is saved as <gbifID>-00.jpg, <gbifID>-01.jpg, etc. A gbifID is written to processed_ids.txt only when all of its images succeed; partial failures go to failed_ids.txt for retry.
  • Hierarchical storage path<base>/<id[:3]>/<id[3:6]>/<id>-NN.jpg — to avoid millions of files in a single directory.
  • Per-host circuit breaker (threshold raised from 50 → 500) plus a timed cooldown for HTTP 429 / timeouts, so one bad host (e.g. swbiodiversity.org, mnhn.fr) can't block the whole run.
  • IIIF manifest parsing: manifests are expanded into direct image URLs at 1600 / 1200 / 800 px so the downloader actually has something to fetch.
  • Retry-failed-first on startup: IDs in failed_ids.txt are tried before new ones; an ID is removed from failed_ids only on full success.
  • Removed the duplicate check against legacy datasets (GBIF-F24, harvard-herbaria) — it was loading those multimedia.txt files at startup, which contributed to the early memory issues.
  • .gitignore: exclude local scratch files (backups, dry runs, sorted/merged checkpoint copies).

Status

  • Last batch run on this code (job 5752217, 5/20–5/21) added ~620k processed IDs on top of the inherited baseline (13.6M processed, 19.3M failed).
  • Memory issues were resolved separately by submitting with -l mem_per_core=8G; no code change needed.
  • Job has been canceled so the new SQLite-tracked pipeline can take over.

…and IIIF support

- Save each gbifID's images as <id>-00.jpg, <id>-01.jpg... and mark a
  gbifID processed only when every image succeeds (partial failures go
  to failed_ids for retry).
- Add hierarchical storage path (<base>/<prefix1>/<prefix2>/<id>-NN.jpg)
  to avoid millions of files in a single directory.
- Add a per-host circuit breaker (threshold 500) and a timed cooldown
  for 429/timeout responses, so one bad host can't block the whole run.
- Parse IIIF manifests into direct image URLs (1600/1200/800 variants).
- On startup, retry IDs from failed_ids.txt first; remove them only on
  full success.
- Drop the duplicate-check against legacy datasets; disable it for now.
- .gitignore: exclude local scratch files (backups, dry runs, sorted/
  merged checkpoint copies).
@trgardos trgardos marked this pull request as draft June 4, 2026 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant