Utils Directory

This directory contains utility scripts for managing herbarium specimen images, including downloading, processing, organizing, and labeling datasets.

Scripts Overview

Image Download & Installation

`image_install_parallel.py`

Purpose: Primary script for downloading herbarium specimen images from GBIF (Global Biodiversity Information Facility) multimedia datasets.

Key Features:

Parallel downloading with ThreadPoolExecutor (5 workers)
Host-based rate limiting and circuit breaker pattern
Duplicate detection across multiple GBIF datasets
IIIF (International Image Interoperability Framework) manifest support
Automatic image resizing to 1024px max dimension
Checkpoint system for resumable downloads
Failed download tracking
Hierarchical directory organization (3-digit prefix structure)

Usage:

python image_install_parallel.py [-c COUNTRY_CODE]

Configuration:

Input: /projectnb/herbdl/data/GBIF-F25/multimedia.txt
Output: /projectnb/herbdl/data/GBIF-F25h/
Logs: /projectnb/herbdl/logs/image_install_*.log
Checkpoints: processed_ids.txt, failed_ids.txt

Advanced Features:

Host cooldown on rate limiting (429 errors): 30 minutes default
Host cooldown on timeouts: 60 minutes
Circuit breaker: Permanently blocks hosts after 50+ errors
Multiple URL fallback per GBIF ID
Retry strategy with backoff for 500-level errors

`image_install.sh`

Purpose: SCC job submission wrapper for image_install_parallel.py.

Usage:

qsub -N image_install -l h_rt=48:00:00 -pe omp 16 -P herbdl -m beas -M your_email@bu.edu image_install.sh

Image Processing

`image_utils.py`

Purpose: Core image processing utilities used by other scripts.

Functions:

get_file_size_in_mb(file_path): Returns file size in megabytes
resize_with_aspect_ratio(image_path, output_path, max_size=1600, format="JPEG", quality=85):
- Downscales images preserving aspect ratio
- Handles alpha channels (RGBA/LA/P with transparency)
- Converts to RGB JPEG format
- Returns (changed: bool, final_size: tuple)

Key Features:

Safe alpha channel removal with white background
Progressive JPEG encoding
Optimized output
LANCZOS resampling for high-quality downscaling

`resize_images.py`

Purpose: Batch resize images in a directory using parallel processing.

Configuration:

Input directory: /projectnb/herbdl/data/harvard-herbaria/images
Target size: Images > 2MB are resized
Workers: 10 parallel threads
Log: image_resize.log

Usage:

python resize_images.py

`image_resize.sh`

Purpose: SCC job submission wrapper for resize_images.py.

Usage:

qsub -l h_rt=24:00:00 -pe omp 10 -P herbdl -m beas -M your_email@bu.edu image_resize.sh

`compression.sh`

Purpose: Compress images to 2MB target size and measure quality degradation using PSNR (Peak Signal-to-Noise Ratio).

How it works:

Reads images from ./images/ directory
Compresses each to 2MB using ImageMagick convert
Saves compressed versions to ./compressed/ directory
Calculates PSNR using FFmpeg and logs to ./logs/

Dependencies: ImageMagick, FFmpeg

Usage:

./compression.sh

Image Organization

`reorganize_images.py`

Purpose: Reorganize images from flat directory structure into hierarchical structure based on GBIF IDs.

How it works:

Reads images from source directory
Uses GBIF ID from filename (must be numeric)
Creates hierarchical structure: prefix1/prefix2/filename.jpg
- prefix1: First 3 digits of GBIF ID
- prefix2: Digits 4-6 of GBIF ID
Skips non-numeric filenames

Configuration:

Source: /projectnb/herbdl/data/GBIF-F25/images
Destination: /projectnb/herbdl/data/GBIF-F25h
Supported formats: jpg, jpeg, png, tif, tiff (case-insensitive)

Example:

Image: 1234567.jpg
→ Moved to: 123/456/1234567.jpg

Usage:

python reorganize_images.py

Dataset Labeling

`labeling.ipynb`

Purpose: Process Kaggle Herbarium 2021 and 2022 metadata to create labeled training/validation datasets.

What it does:

Loads metadata JSON files from Kaggle Herbarium competitions
Extracts taxonomic information (family, genus, species)
Generates natural language captions for each specimen
Encodes scientific names as numeric labels
Creates 80/20 train/validation splits
Exports to CSV and JSON formats

Output Files:

train_2022.csv, val_2022.csv (Herbarium 2022)
train_2021.csv, val_2021.csv (Herbarium 2021)
JSON versions for direct use with HuggingFace datasets

Columns:

image_id: Unique identifier
filename: Relative path to image
caption: Natural language description
scientificName: Family + Genus + Species
family, genus, species: Taxonomic labels
scientificNameEncoded: Numeric label for classification

Caption Format:

"This is an image of species {species}, in the genus {genus} of family {family}. It is part of the collection of institution {institution}."

Validation & Monitoring

`link_check.py`

Purpose: Validate image URLs from GBIF multimedia.txt files to identify broken links.

How it works:

Reads multimedia.txt with GBIF IDs and image URLs
Makes HEAD/GET requests to verify accessibility
Checks Content-Type headers for valid image types
Logs invalid links and their GBIF IDs
Uses parallel processing (10 workers)

Configuration:

Input: /projectnb/herbdl/data/harvard-herbaria/gbif/multimedia.txt
Log: link_check.log
Retry strategy: Up to 5 retries with backoff

Usage:

python link_check.py

`notifications.py`

Purpose: Send push notifications via Pushover API for long-running job monitoring.

Setup:

Create a .env file with:

PUSHOVER_API_TOKEN=your_token_here
PUSHOVER_USER_KEY=your_user_key_here

Function:

send_notification(title, message)

Usage Example:

from notifications import send_notification
send_notification("Image Installation", "Downloaded 50,000 images")

Integration: Used by image_install_parallel.py to send progress updates every 50,000 images.

Common Workflows

1. Download GBIF Images

# Submit parallel download job
qsub -N image_install -l h_rt=48:00:00 -pe omp 16 -P herbdl image_install.sh

# Monitor progress in logs
tail -f /projectnb/herbdl/logs/image_install_*.log

2. Organize Downloaded Images

# Reorganize flat structure to hierarchical
python reorganize_images.py

3. Batch Resize Images

# Submit resize job
qsub -l h_rt=24:00:00 -pe omp 10 -P herbdl image_resize.sh

4. Create Training Datasets

# Run labeling notebook
jupyter notebook labeling.ipynb

5. Validate Image Links

# Check for broken URLs
python link_check.py

Directory Structures

Hierarchical Image Storage

Images are organized by GBIF ID prefix for efficient filesystem access:

/projectnb/herbdl/data/GBIF-F25h/
├── 000/
│   ├── 000/
│   │   ├── 000000.jpg
│   │   ├── 000001.jpg
│   ├── 001/
│   │   ├── 000001000.jpg
├── 001/
│   ├── 000/
│   ├── 001/

This structure prevents issues with directories containing millions of files.

Dependencies

Python Libraries:

pandas
PIL (Pillow)
requests
scikit-learn (for labeling)
python-dotenv (for notifications)

System Tools:

ImageMagick (for compression.sh)
FFmpeg (for compression.sh)

Notes

All scripts are designed for use on Boston University's Shared Computing Cluster (SCC)
Many scripts use parallel processing for performance
Checkpoint files enable resumable operations after interruptions
Always verify paths before running scripts to avoid data loss

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Utils Directory

Scripts Overview

Image Download & Installation

`image_install_parallel.py`

`image_install.sh`

Image Processing

`image_utils.py`

`resize_images.py`

`image_resize.sh`

`compression.sh`

Image Organization

`reorganize_images.py`

Dataset Labeling

`labeling.ipynb`

Validation & Monitoring

`link_check.py`

`notifications.py`

Common Workflows

1. Download GBIF Images

2. Organize Downloaded Images

3. Batch Resize Images

4. Create Training Datasets

5. Validate Image Links

Directory Structures

Hierarchical Image Storage

Dependencies

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
analyze_image_progress.py		analyze_image_progress.py
analyze_progress.sh		analyze_progress.sh
compression.sh		compression.sh
image_install.sh		image_install.sh
image_install_parallel.py		image_install_parallel.py
image_resize.sh		image_resize.sh
image_utils.py		image_utils.py
labeling.ipynb		labeling.ipynb
link_check.py		link_check.py
notifications.py		notifications.py
reorganize_images.py		reorganize_images.py
requirements.txt		requirements.txt
resize_images.py		resize_images.py

Folders and files

Latest commit

History

Repository files navigation

Utils Directory

Scripts Overview

Image Download & Installation

image_install_parallel.py

image_install.sh

Image Processing

image_utils.py

resize_images.py

image_resize.sh

compression.sh

Image Organization

reorganize_images.py

Dataset Labeling

labeling.ipynb

Validation & Monitoring

link_check.py

notifications.py

Common Workflows

1. Download GBIF Images

2. Organize Downloaded Images

3. Batch Resize Images

4. Create Training Datasets

5. Validate Image Links

Directory Structures

Hierarchical Image Storage

Dependencies

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`image_install_parallel.py`

`image_install.sh`

`image_utils.py`

`resize_images.py`

`image_resize.sh`

`compression.sh`

`reorganize_images.py`

`labeling.ipynb`

`link_check.py`

`notifications.py`

Packages