switch to sqlite database based download tracking by trgardos · Pull Request #1 · gardoslab/herbaria-data

trgardos · 2026-05-21T13:39:18Z

No description provided.

…rrent read-only access to db

Rework the SQLite download tracking to count and download one file per distinct image instead of one per multimedia.txt row. - download_db: images table is now one row per distinct image (image_no, image_key, packed candidate urls). canonical_image_key() collapses a IIIF manifest and the resolution variants of one specimen photo to a single key. - init_download_db: ingest groups multimedia rows into distinct images, ordering candidate URLs highest-resolution-first. - image_install_db: downloads one file per distinct image; detects HTML/text responses (Content-Type header + body sniff) and captures the page text into error_detail for follow-up; keeps undecodable camera-raw DNG as <id>-NN.dng flagged raw_unprocessed instead of discarding it. - status_report: distinct-image counts, non-image-responses-by-host section, raw_unprocessed count. - init_download_db.sh: take the build mode (--reset / --legacy-only) from the qsub command line. The images table schema changed, so this requires a --reset rebuild. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The previous ingest used df.groupby([gbifID, image_no]).agg("\n".join, image_key="first", host="first") to build one row per distinct image. With ~52M groups and a Python callable, that one statement did not finish inside the 12 h qsub limit -- the job died before ever reaching the first INSERT, so nothing was committed. Replace it with a single-pass streaming groupby that walks the already- sorted rows in native Python lists and emits one INSERT batch per chunk of distinct images (commit every 200k images, so partial progress is saved on a future kill too). Replace the pandas gbif_ids build with one SQL `INSERT INTO gbif_ids SELECT gbif_id, COUNT(*) FROM images GROUP BY gbif_id` which rides the (gbif_id, image_no) primary-key index. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The final step of the legacy import rolled images.status up into gbif_ids.status via one giant UPDATE with three correlated COUNT subqueries plus an `IN (SELECT DISTINCT gbif_id FROM images ...)` filter. SQLite's planner pessimised it badly enough that it did not finish in 24 h, and because the whole thing was a single transaction, nothing was committed and no progress was visible. Replace it with _finalize_gbif_ids_status(): one SELECT gbif_id, COUNT(*) FROM images WHERE status='success' GROUP BY gbif_id to materialise per-gbifID success counts, an in-memory n_images map from gbif_ids, then chunked UPDATEs of 50k rows each with a commit and a progress print per batch. A kill mid-finalize now leaves a partially updated table and re-running is a safe idempotent retry. Add --finalize-only so the finalize can be re-run without redoing the disk re-scan -- useful when import_legacy completed its per-image flushes but the recompute did not. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Update DEPLOYMENT.md for the recent init_download_db.py changes: - Phase 1 qsub example now passes --reset and explains the wrapper forwards "$@" through, so --legacy-only and --finalize-only work the same way. - Options table gains a --finalize-only row. - "Re-running is safe" rewritten as three resumable stages: ingest (--reset), legacy import (--legacy-only), finalize (--finalize-only). - Troubleshooting gets a row for jobs killed during "Recomputing gbifID statuses". Cross-link gbif-metadata-download.md to DEPLOYMENT.md so the GBIF query how-to ends pointing at the project-specific build/download pipeline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

status_report.py was opening the DB read-write with no busy timeout, so a brief WAL contention window from the running downloader could surface as `sqlite3.OperationalError: disk I/O error` mid-report. Open the connection read-only via the URI form and set a 60 s busy timeout so SQLite waits through a writer checkpoint instead of erroring out. Add db_integrity_check.sh: a qsub wrapper that runs PRAGMA integrity_check (read-only against the live DB, safe alongside an active writer in WAL mode) and prints fresh row counts at the end so results can be cross-checked against status_report.py. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Brief WAL checkpoint races on GPFS occasionally surface as transient `disk I/O error` or `file is not a database` when reading the live DB while image_install_db is writing. PRAGMA integrity_check confirmed the DB itself is fine; the read is just unlucky. Add _q_all() / _q_one() helpers that wrap conn.execute().fetchall() / .fetchone() with up to three attempts, sleeping 0.5 s and 2 s between tries on sqlite3.OperationalError / DatabaseError. Switch every query call site in main() to go through them. If all three attempts still fail, the original exception propagates. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The `images` column is `urls` (plural -- newline-joined candidate URLs for a distinct image, since IIIF manifest + resolution variants collapse into one row). The example was carrying over the old per-URL column name and would have failed with "no such column: url". Also add LIMIT 50 so the example does not dump every retryable row. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Replace the Pushover POST in send_notification() with a Slack incoming webhook POST. SLACK_WEBHOOK_URL is read from .env; without it set the function logs a one-line warning and silently no-ops so the downloader keeps working without notifications. A try/except wraps the POST so a Slack hiccup never interrupts the caller. Update the README: - Add a "Push notifications (optional)" callout in the image_install_db.py section that points at the notifications.py setup. - Rewrite the notifications.py setup as a five-step Slack walkthrough (create Slack app -> enable Incoming Webhooks -> add to channel -> paste URL into .env -> chmod 600 + verify with a one-liner). - Add a db_integrity_check.sh section describing when/how to run the integrity check (pause downloader -> qdel -> qsub -> resume). - Fix the example query to use `urls` (plural) and add LIMIT. Add `.env` to .gitignore so the Slack webhook URL never gets committed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

trgardos added 3 commits May 21, 2026 09:13

ignore pycache

f6b4b4c

new sqlite based download tracking

5f70e57

ignore generated summary files

0b59c51

Jiahao419 mentioned this pull request May 21, 2026

Download all images per gbifID with -NN suffix, plus circuit breaker … gardoslab/herbdl#34

Draft

trgardos and others added 16 commits May 21, 2026 13:37

ignore batch scheduler log files

8f3e299

add legacy-only mode to finish reading processed IDs, and allow concu…

8a9d326

…rrent read-only access to db

doc on how to download GBIF metadata

5abbaea

flush printing so it shows up in logs right away

ed94e60

add schema description for occurence.txt and multimedia.txt files

26c4169

document the location of the other processed_ids.txt file

0197d99

add documentation on db_integrity_check.sh

9ec6b41

change slack notification interval to 10000

7811b97

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

switch to sqlite database based download tracking#1

switch to sqlite database based download tracking#1
trgardos wants to merge 19 commits into
mainfrom
db

trgardos commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

trgardos commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant