Skip to content

switch to sqlite database based download tracking#1

Open
trgardos wants to merge 19 commits into
mainfrom
db
Open

switch to sqlite database based download tracking#1
trgardos wants to merge 19 commits into
mainfrom
db

Conversation

@trgardos
Copy link
Copy Markdown
Contributor

No description provided.

trgardos and others added 16 commits May 21, 2026 13:37
Rework the SQLite download tracking to count and download one file per
distinct image instead of one per multimedia.txt row.

- download_db: images table is now one row per distinct image
  (image_no, image_key, packed candidate urls). canonical_image_key()
  collapses a IIIF manifest and the resolution variants of one specimen
  photo to a single key.
- init_download_db: ingest groups multimedia rows into distinct images,
  ordering candidate URLs highest-resolution-first.
- image_install_db: downloads one file per distinct image; detects
  HTML/text responses (Content-Type header + body sniff) and captures
  the page text into error_detail for follow-up; keeps undecodable
  camera-raw DNG as <id>-NN.dng flagged raw_unprocessed instead of
  discarding it.
- status_report: distinct-image counts, non-image-responses-by-host
  section, raw_unprocessed count.
- init_download_db.sh: take the build mode (--reset / --legacy-only)
  from the qsub command line.

The images table schema changed, so this requires a --reset rebuild.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous ingest used df.groupby([gbifID, image_no]).agg("\n".join,
image_key="first", host="first") to build one row per distinct image.
With ~52M groups and a Python callable, that one statement did not
finish inside the 12 h qsub limit -- the job died before ever reaching
the first INSERT, so nothing was committed.

Replace it with a single-pass streaming groupby that walks the already-
sorted rows in native Python lists and emits one INSERT batch per chunk
of distinct images (commit every 200k images, so partial progress is
saved on a future kill too). Replace the pandas gbif_ids build with one
SQL `INSERT INTO gbif_ids SELECT gbif_id, COUNT(*) FROM images
GROUP BY gbif_id` which rides the (gbif_id, image_no) primary-key index.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The final step of the legacy import rolled images.status up into
gbif_ids.status via one giant UPDATE with three correlated COUNT
subqueries plus an `IN (SELECT DISTINCT gbif_id FROM images ...)`
filter. SQLite's planner pessimised it badly enough that it did not
finish in 24 h, and because the whole thing was a single transaction,
nothing was committed and no progress was visible.

Replace it with _finalize_gbif_ids_status(): one
  SELECT gbif_id, COUNT(*) FROM images WHERE status='success' GROUP BY gbif_id
to materialise per-gbifID success counts, an in-memory n_images map
from gbif_ids, then chunked UPDATEs of 50k rows each with a commit and
a progress print per batch. A kill mid-finalize now leaves a partially
updated table and re-running is a safe idempotent retry.

Add --finalize-only so the finalize can be re-run without redoing the
disk re-scan -- useful when import_legacy completed its per-image
flushes but the recompute did not.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Update DEPLOYMENT.md for the recent init_download_db.py changes:
- Phase 1 qsub example now passes --reset and explains the wrapper
  forwards "$@" through, so --legacy-only and --finalize-only work the
  same way.
- Options table gains a --finalize-only row.
- "Re-running is safe" rewritten as three resumable stages: ingest
  (--reset), legacy import (--legacy-only), finalize (--finalize-only).
- Troubleshooting gets a row for jobs killed during "Recomputing gbifID
  statuses".

Cross-link gbif-metadata-download.md to DEPLOYMENT.md so the GBIF query
how-to ends pointing at the project-specific build/download pipeline.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
status_report.py was opening the DB read-write with no busy timeout, so
a brief WAL contention window from the running downloader could surface
as `sqlite3.OperationalError: disk I/O error` mid-report. Open the
connection read-only via the URI form and set a 60 s busy timeout so
SQLite waits through a writer checkpoint instead of erroring out.

Add db_integrity_check.sh: a qsub wrapper that runs PRAGMA
integrity_check (read-only against the live DB, safe alongside an
active writer in WAL mode) and prints fresh row counts at the end so
results can be cross-checked against status_report.py.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Brief WAL checkpoint races on GPFS occasionally surface as transient
`disk I/O error` or `file is not a database` when reading the live DB
while image_install_db is writing. PRAGMA integrity_check confirmed the
DB itself is fine; the read is just unlucky.

Add _q_all() / _q_one() helpers that wrap conn.execute().fetchall() /
.fetchone() with up to three attempts, sleeping 0.5 s and 2 s between
tries on sqlite3.OperationalError / DatabaseError. Switch every query
call site in main() to go through them. If all three attempts still
fail, the original exception propagates.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The `images` column is `urls` (plural -- newline-joined candidate URLs
for a distinct image, since IIIF manifest + resolution variants collapse
into one row). The example was carrying over the old per-URL column name
and would have failed with "no such column: url". Also add LIMIT 50 so
the example does not dump every retryable row.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the Pushover POST in send_notification() with a Slack incoming
webhook POST. SLACK_WEBHOOK_URL is read from .env; without it set the
function logs a one-line warning and silently no-ops so the downloader
keeps working without notifications. A try/except wraps the POST so a
Slack hiccup never interrupts the caller.

Update the README:
- Add a "Push notifications (optional)" callout in the image_install_db.py
  section that points at the notifications.py setup.
- Rewrite the notifications.py setup as a five-step Slack walkthrough
  (create Slack app -> enable Incoming Webhooks -> add to channel ->
   paste URL into .env -> chmod 600 + verify with a one-liner).
- Add a db_integrity_check.sh section describing when/how to run the
  integrity check (pause downloader -> qdel -> qsub -> resume).
- Fix the example query to use `urls` (plural) and add LIMIT.

Add `.env` to .gitignore so the Slack webhook URL never gets committed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant