Conversation
…rrent read-only access to db
Rework the SQLite download tracking to count and download one file per distinct image instead of one per multimedia.txt row. - download_db: images table is now one row per distinct image (image_no, image_key, packed candidate urls). canonical_image_key() collapses a IIIF manifest and the resolution variants of one specimen photo to a single key. - init_download_db: ingest groups multimedia rows into distinct images, ordering candidate URLs highest-resolution-first. - image_install_db: downloads one file per distinct image; detects HTML/text responses (Content-Type header + body sniff) and captures the page text into error_detail for follow-up; keeps undecodable camera-raw DNG as <id>-NN.dng flagged raw_unprocessed instead of discarding it. - status_report: distinct-image counts, non-image-responses-by-host section, raw_unprocessed count. - init_download_db.sh: take the build mode (--reset / --legacy-only) from the qsub command line. The images table schema changed, so this requires a --reset rebuild. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous ingest used df.groupby([gbifID, image_no]).agg("\n".join,
image_key="first", host="first") to build one row per distinct image.
With ~52M groups and a Python callable, that one statement did not
finish inside the 12 h qsub limit -- the job died before ever reaching
the first INSERT, so nothing was committed.
Replace it with a single-pass streaming groupby that walks the already-
sorted rows in native Python lists and emits one INSERT batch per chunk
of distinct images (commit every 200k images, so partial progress is
saved on a future kill too). Replace the pandas gbif_ids build with one
SQL `INSERT INTO gbif_ids SELECT gbif_id, COUNT(*) FROM images
GROUP BY gbif_id` which rides the (gbif_id, image_no) primary-key index.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The final step of the legacy import rolled images.status up into gbif_ids.status via one giant UPDATE with three correlated COUNT subqueries plus an `IN (SELECT DISTINCT gbif_id FROM images ...)` filter. SQLite's planner pessimised it badly enough that it did not finish in 24 h, and because the whole thing was a single transaction, nothing was committed and no progress was visible. Replace it with _finalize_gbif_ids_status(): one SELECT gbif_id, COUNT(*) FROM images WHERE status='success' GROUP BY gbif_id to materialise per-gbifID success counts, an in-memory n_images map from gbif_ids, then chunked UPDATEs of 50k rows each with a commit and a progress print per batch. A kill mid-finalize now leaves a partially updated table and re-running is a safe idempotent retry. Add --finalize-only so the finalize can be re-run without redoing the disk re-scan -- useful when import_legacy completed its per-image flushes but the recompute did not. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Update DEPLOYMENT.md for the recent init_download_db.py changes: - Phase 1 qsub example now passes --reset and explains the wrapper forwards "$@" through, so --legacy-only and --finalize-only work the same way. - Options table gains a --finalize-only row. - "Re-running is safe" rewritten as three resumable stages: ingest (--reset), legacy import (--legacy-only), finalize (--finalize-only). - Troubleshooting gets a row for jobs killed during "Recomputing gbifID statuses". Cross-link gbif-metadata-download.md to DEPLOYMENT.md so the GBIF query how-to ends pointing at the project-specific build/download pipeline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
status_report.py was opening the DB read-write with no busy timeout, so a brief WAL contention window from the running downloader could surface as `sqlite3.OperationalError: disk I/O error` mid-report. Open the connection read-only via the URI form and set a 60 s busy timeout so SQLite waits through a writer checkpoint instead of erroring out. Add db_integrity_check.sh: a qsub wrapper that runs PRAGMA integrity_check (read-only against the live DB, safe alongside an active writer in WAL mode) and prints fresh row counts at the end so results can be cross-checked against status_report.py. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Brief WAL checkpoint races on GPFS occasionally surface as transient `disk I/O error` or `file is not a database` when reading the live DB while image_install_db is writing. PRAGMA integrity_check confirmed the DB itself is fine; the read is just unlucky. Add _q_all() / _q_one() helpers that wrap conn.execute().fetchall() / .fetchone() with up to three attempts, sleeping 0.5 s and 2 s between tries on sqlite3.OperationalError / DatabaseError. Switch every query call site in main() to go through them. If all three attempts still fail, the original exception propagates. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The `images` column is `urls` (plural -- newline-joined candidate URLs for a distinct image, since IIIF manifest + resolution variants collapse into one row). The example was carrying over the old per-URL column name and would have failed with "no such column: url". Also add LIMIT 50 so the example does not dump every retryable row. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the Pushover POST in send_notification() with a Slack incoming webhook POST. SLACK_WEBHOOK_URL is read from .env; without it set the function logs a one-line warning and silently no-ops so the downloader keeps working without notifications. A try/except wraps the POST so a Slack hiccup never interrupts the caller. Update the README: - Add a "Push notifications (optional)" callout in the image_install_db.py section that points at the notifications.py setup. - Rewrite the notifications.py setup as a five-step Slack walkthrough (create Slack app -> enable Incoming Webhooks -> add to channel -> paste URL into .env -> chmod 600 + verify with a one-liner). - Add a db_integrity_check.sh section describing when/how to run the integrity check (pause downloader -> qdel -> qsub -> resume). - Fix the example query to use `urls` (plural) and add LIMIT. Add `.env` to .gitignore so the Slack webhook URL never gets committed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.