This directory contains utility scripts for managing herbarium specimen images, including downloading, processing, organizing, and labeling datasets.
Purpose: Primary script for downloading herbarium specimen images from GBIF (Global Biodiversity Information Facility) multimedia datasets.
Key Features:
- Parallel downloading with ThreadPoolExecutor (5 workers)
- Host-based rate limiting and circuit breaker pattern
- Duplicate detection across multiple GBIF datasets
- IIIF (International Image Interoperability Framework) manifest support
- Automatic image resizing to 1024px max dimension
- Checkpoint system for resumable downloads
- Failed download tracking
- Hierarchical directory organization (3-digit prefix structure)
Usage:
python image_install_parallel.py [-c COUNTRY_CODE]Configuration:
- Input:
/projectnb/herbdl/data/GBIF-F25/multimedia.txt - Output:
/projectnb/herbdl/data/GBIF-F25h/ - Logs:
/projectnb/herbdl/logs/image_install_*.log - Checkpoints:
processed_ids.txt,failed_ids.txt
Advanced Features:
- Host cooldown on rate limiting (429 errors): 30 minutes default
- Host cooldown on timeouts: 60 minutes
- Circuit breaker: Permanently blocks hosts after 50+ errors
- Multiple URL fallback per GBIF ID
- Retry strategy with backoff for 500-level errors
Purpose: SCC job submission wrapper for image_install_parallel.py.
Usage:
qsub -N image_install -l h_rt=48:00:00 -pe omp 16 -P herbdl -m beas -M your_email@bu.edu image_install.shPurpose: Core image processing utilities used by other scripts.
Functions:
get_file_size_in_mb(file_path): Returns file size in megabytesresize_with_aspect_ratio(image_path, output_path, max_size=1600, format="JPEG", quality=85):- Downscales images preserving aspect ratio
- Handles alpha channels (RGBA/LA/P with transparency)
- Converts to RGB JPEG format
- Returns (changed: bool, final_size: tuple)
Key Features:
- Safe alpha channel removal with white background
- Progressive JPEG encoding
- Optimized output
- LANCZOS resampling for high-quality downscaling
Purpose: Batch resize images in a directory using parallel processing.
Configuration:
- Input directory:
/projectnb/herbdl/data/harvard-herbaria/images - Target size: Images > 2MB are resized
- Workers: 10 parallel threads
- Log:
image_resize.log
Usage:
python resize_images.pyPurpose: SCC job submission wrapper for resize_images.py.
Usage:
qsub -l h_rt=24:00:00 -pe omp 10 -P herbdl -m beas -M your_email@bu.edu image_resize.shPurpose: Compress images to 2MB target size and measure quality degradation using PSNR (Peak Signal-to-Noise Ratio).
How it works:
- Reads images from
./images/directory - Compresses each to 2MB using ImageMagick
convert - Saves compressed versions to
./compressed/directory - Calculates PSNR using FFmpeg and logs to
./logs/
Dependencies: ImageMagick, FFmpeg
Usage:
./compression.shPurpose: Reorganize images from flat directory structure into hierarchical structure based on GBIF IDs.
How it works:
- Reads images from source directory
- Uses GBIF ID from filename (must be numeric)
- Creates hierarchical structure:
prefix1/prefix2/filename.jpg- prefix1: First 3 digits of GBIF ID
- prefix2: Digits 4-6 of GBIF ID
- Skips non-numeric filenames
Configuration:
- Source:
/projectnb/herbdl/data/GBIF-F25/images - Destination:
/projectnb/herbdl/data/GBIF-F25h - Supported formats: jpg, jpeg, png, tif, tiff (case-insensitive)
Example:
Image: 1234567.jpg
→ Moved to: 123/456/1234567.jpg
Usage:
python reorganize_images.pyPurpose: Process Kaggle Herbarium 2021 and 2022 metadata to create labeled training/validation datasets.
What it does:
- Loads metadata JSON files from Kaggle Herbarium competitions
- Extracts taxonomic information (family, genus, species)
- Generates natural language captions for each specimen
- Encodes scientific names as numeric labels
- Creates 80/20 train/validation splits
- Exports to CSV and JSON formats
Output Files:
train_2022.csv,val_2022.csv(Herbarium 2022)train_2021.csv,val_2021.csv(Herbarium 2021)- JSON versions for direct use with HuggingFace datasets
Columns:
image_id: Unique identifierfilename: Relative path to imagecaption: Natural language descriptionscientificName: Family + Genus + Speciesfamily,genus,species: Taxonomic labelsscientificNameEncoded: Numeric label for classification
Caption Format:
"This is an image of species {species}, in the genus {genus} of family {family}. It is part of the collection of institution {institution}."
Purpose: Validate image URLs from GBIF multimedia.txt files to identify broken links.
How it works:
- Reads multimedia.txt with GBIF IDs and image URLs
- Makes HEAD/GET requests to verify accessibility
- Checks Content-Type headers for valid image types
- Logs invalid links and their GBIF IDs
- Uses parallel processing (10 workers)
Configuration:
- Input:
/projectnb/herbdl/data/harvard-herbaria/gbif/multimedia.txt - Log:
link_check.log - Retry strategy: Up to 5 retries with backoff
Usage:
python link_check.pyPurpose: Send push notifications via Pushover API for long-running job monitoring.
Setup:
- Create a
.envfile with:
PUSHOVER_API_TOKEN=your_token_here
PUSHOVER_USER_KEY=your_user_key_here
Function:
send_notification(title, message)Usage Example:
from notifications import send_notification
send_notification("Image Installation", "Downloaded 50,000 images")Integration: Used by image_install_parallel.py to send progress updates every 50,000 images.
# Submit parallel download job
qsub -N image_install -l h_rt=48:00:00 -pe omp 16 -P herbdl image_install.sh
# Monitor progress in logs
tail -f /projectnb/herbdl/logs/image_install_*.log# Reorganize flat structure to hierarchical
python reorganize_images.py# Submit resize job
qsub -l h_rt=24:00:00 -pe omp 10 -P herbdl image_resize.sh# Run labeling notebook
jupyter notebook labeling.ipynb# Check for broken URLs
python link_check.pyImages are organized by GBIF ID prefix for efficient filesystem access:
/projectnb/herbdl/data/GBIF-F25h/
├── 000/
│ ├── 000/
│ │ ├── 000000.jpg
│ │ ├── 000001.jpg
│ ├── 001/
│ │ ├── 000001000.jpg
├── 001/
│ ├── 000/
│ ├── 001/
This structure prevents issues with directories containing millions of files.
Python Libraries:
- pandas
- PIL (Pillow)
- requests
- scikit-learn (for labeling)
- python-dotenv (for notifications)
System Tools:
- ImageMagick (for compression.sh)
- FFmpeg (for compression.sh)
- All scripts are designed for use on Boston University's Shared Computing Cluster (SCC)
- Many scripts use parallel processing for performance
- Checkpoint files enable resumable operations after interruptions
- Always verify paths before running scripts to avoid data loss