Skip to content

mun3im/seabad

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

42 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SEABAD: Southeast Asian Bird Audio Detection Dataset

A dataset of 50,000 automatically curated 3-second clips spanning 1,677 Southeast Asian bird species, standardized to 16 kHz mono for binary bird presence–absence detection on edge devices.

πŸ“¦ Dataset: zenodo.org/records/18290494 πŸ“„ Paper: SEABAD: A Tropical Bird Audio Detection Dataset for Passive Acoustic Monitoring (2025) πŸ’» Code: github.com/mun3im/seabad


Overview

Passive acoustic monitoring (PAM) enables large-scale biodiversity assessment, but most recordings contain non-informative audio. Bird audio detection (BAD)β€”determining bird presence without species classificationβ€”can suppress non-target recordings on-device, extending deployment and reducing annotation burden.

SEABAD addresses critical gaps in existing datasets:

  • βœ… Tropical soundscape coverage (Southeast Asia)
  • βœ… 3-second clip length matching edge-AI inference windows
  • βœ… Diversity-aware species balancing (1,677 species, Gini 0.519)
  • βœ… Multi-source negative curation (environmental + field recordings)
Property Value
Total clips 50,000
Positive (bird present) 25,000
Negative (bird absent) 25,000
Unique bird species 1,677
Clip duration 3 seconds
Sample rate 16 kHz mono
Bit depth 16-bit PCM
Geography Indonesia, Malaysia, Thailand, Singapore, Brunei
Train / Val / Test split 40k / 5k / 5k (80% / 10% / 10%)

🎯 Baseline Validation Results

Standard CNN architectures (ImageNet pre-trained, fine-tuned on SEABAD) averaged across 3 random seeds (42, 100, 786):

Model Params Accuracy AUC Precision Recall F1
MobileNetV3-Small† 1.1M 99.57 Β± 0.25% 0.9985 Β± 0.0002 0.9956 Β± 0.0012 0.9957 Β± 0.0008 0.9957 Β± 0.0025
EfficientNetB0 4.4M 99.49 Β± 0.23% 0.9991 Β± 0.0004 0.9959 Β± 0.0018 0.9939 Β± 0.0051 0.9949 Β± 0.0023
VGG16 14.9M 99.61 Β± 0.03% 0.9995 Β± 0.0001 0.9960 Β± 0.0014 0.9963 Β± 0.0010 0.9961 Β± 0.0025
ResNet50 24.2M 99.73 Β± 0.02% 0.9992 Β± 0.0003 0.9965 Β± 0.0013 0.9980 Β± 0.0012 0.9973 Β± 0.0019

†Primary baseline for edge deployment

Key Findings

  • βœ… All models achieve >99.4% accuracy with minimal variance (std <0.25%)
  • βœ… MobileNetV3-Small is only 0.16% behind ResNet50 but has 22Γ— fewer parameters
  • βœ… Excellent training stability confirmed across diverse architectures and random seeds
  • βœ… High dataset quality validated by consistent performance

See validation/README.md for detailed experimental setup and analysis.


πŸ“‚ Repository Structure

seabad/
β”œβ”€β”€ positive-label-curation/   # Stages 1–9: Xeno-Canto bird clips
β”‚   β”œβ”€β”€ Stage1_xc_fetch_bird_metadata.py
β”‚   β”œβ”€β”€ Stage2_analyze_metadata.py
β”‚   β”œβ”€β”€ Stage3_download_and_convert.py
β”‚   β”œβ”€β”€ Stage4_deduplicate_flac.py
β”‚   β”œβ”€β”€ Stage5_extract_wav_from_flac.py
β”‚   β”œβ”€β”€ Stage6_balance_species.py
β”‚   β”œβ”€β”€ Stage7_qa_spectrograms.py
β”‚   β”œβ”€β”€ Stage8_adjust_onset.py
β”‚   └── Stage9_qa_apply_corrections.py
β”‚
β”œβ”€β”€ negative-sample-curation/  # Stages 1–6: Non-bird clips
β”‚   β”œβ”€β”€ Stage1_extract_birdvox.py
β”‚   β”œβ”€β”€ Stage2_extract_freefield.py
β”‚   β”œβ”€β”€ Stage3_extract_warblr.py
β”‚   β”œβ”€β”€ Stage4_extract_fsc22.py
β”‚   β”œβ”€β”€ Stage5_extract_esc50.py
β”‚   └── Stage6_extract_datasec.py
β”‚
└── validation/                # CNN baseline training & evaluation
    β”œβ”€β”€ validate_seabad_pretrained.py
    β”œβ”€β”€ run_all_cnn.sh
    β”œβ”€β”€ utils.py
    └── README.md

πŸ”§ Curation Pipeline

Six-Stage Methodology

  1. Metadata Acquisition β€” Xeno-Canto API query for Southeast Asian bird species
  2. Sound Acquisition β€” Download + convert to 16 kHz mono FLAC
  3. Acoustic Deduplication β€” FAISS approximate nearest-neighbor on mel embeddings
  4. Segment Extraction β€” RMS-based sliding window with minimum separation
  5. Species Balancing β€” Diversity-aware balancing using MiniBatch K-Means + salience ranking
  6. Quality Assurance β€” Manual audit with interactive correction tool

Positive Curation Highlights

  • Acoustic deduplication: FAISS cosine similarity on mel-spectrogram embeddings
    • 13 near-duplicates identified and removed (0.03% of 38,481 recordings)
  • Diversity-aware segment extraction:
    • RMS sliding window with 1.5s minimum separation
    • One representative clip per source recording
  • Species balancing:
    • MiniBatch K-Means (5 clusters/species) + salience-ranked priority queue
    • Gini coefficient reduced from 0.601 β†’ 0.519 (13.7% reduction)
    • Mean samples/species: 14.9 (preserves all 1,677 species)
  • Quality assurance:
    • 1,000 clips manually audited (Cochran n=639 for 95% CI)
    • 97.8% accuracy (22 corrections: 15 onset, 6 noise, 1 false positive)
    • 92.1% rated quality A/B by Xeno-Canto community

Negative Curation Highlights

Sources totaling 25,000 bird-absent clips:

Dataset Clips Description
BirdVox-DCASE-20k 9,983 Northeast USA field recordings
Freefield1010 5,755 Global crowdsourced environmental audio
Warblr 1,950 UK field recordings
FSC-22 1,875 Forest sounds (mammals, insects, weather)
ESC-50 444 Urban/mechanical/human sounds (high RMS)
DataSEC 3,597 Mediterranean soundscapes

All clips:

  • Resampled to 16 kHz mono
  • Centered 3-second extraction
  • No normalization (preserves naturalistic amplitude distribution)
  • Avian classes excluded from general sound datasets

πŸš€ Quick Start

Prerequisites

# Python dependencies
pip install pandas requests librosa soundfile tqdm scikit-learn matplotlib faiss-cpu numpy tensorflow

# System dependencies
# macOS: brew install ffmpeg
# Linux: apt-get install ffmpeg

For GPU FAISS acceleration: replace faiss-cpu with faiss-gpu.


1️⃣ Run Positive Pipeline

cd positive-label-curation

# Stage 1: Fetch metadata from Xeno-Canto API (~10 min)
python Stage1_xc_fetch_bird_metadata.py --country all

# Stage 2: Analyze metadata (optional statistics)
python Stage2_analyze_metadata.py

# Stage 3: Download and convert to FLAC (~2-6 hours)
python Stage3_download_and_convert.py

# Stage 4: Deduplicate using FAISS (~30 min)
python Stage4_deduplicate_flac.py --quarantine-all

# Stage 5: Extract 3s clips from FLAC (~1 hour)
python Stage5_extract_wav_from_flac.py --no-quarantine

# Stage 6: Balance species distribution (~15 min)
python Stage6_balance_species.py

# Stage 7: Generate QA spectrograms for manual review
python Stage7_qa_spectrograms.py

# Stage 8: Interactive onset correction tool (optional)
python Stage8_adjust_onset.py

# Stage 9: Apply corrections from QA
python Stage9_qa_apply_corrections.py

2️⃣ Run Negative Pipeline

cd negative-sample-curation

# Extract from DCASE 2018 datasets
python Stage1_extract_birdvox.py
python Stage2_extract_freefield.py
python Stage3_extract_warblr.py

# Extract from environmental sound datasets
python Stage4_extract_fsc22.py
python Stage5_extract_esc50.py
python Stage6_extract_datasec.py

Note: Ensure source datasets are downloaded and paths are configured in each script.


3️⃣ Run Baseline Validation

cd validation

# Train single model
python validate_seabad_pretrained.py --model mobilenetv3s --seed 42

# Train all models across all seeds
./run_all_cnn.sh

# Train specific seeds only
./run_all_cnn.sh 42 100

Supported models: mobilenetv3s, resnet50, vgg16, efficientnetb0

See validation/README.md for detailed usage and configuration.


πŸ“¦ Pre-Compiled Dataset (Zenodo)

Download the complete 50,000-clip dataset: zenodo.org/records/18290494

Includes:

  • βœ… 50,000 Γ— 3-second clips (WAV, 16 kHz mono, 16-bit PCM)
  • βœ… Train / validation / test CSVs (80/10/10 stratified split)
  • βœ… Full provenance metadata:
    • Xeno-Canto recording IDs
    • GPS coordinates
    • Original licenses (CC BY, CC BY-NC-SA, etc.)
    • Species taxonomy (IOC World Bird List)
    • Quality ratings
    • Source dataset attribution (for negative samples)

πŸŽ›οΈ Audio Processing Standards

All clips standardized to:

Parameter Value Rationale
Sample rate 16,000 Hz AudioMoth default, Nyquist covers 0-8 kHz
Channels Mono Edge deployment memory constraint
Duration 3.0 seconds BirdNET inference window
Format WAV PCM_16 Lossless, hardware-compatible
Normalization None Preserves naturalistic amplitude for energy-based detection
Source format FLAC PCM_16 Intermediate lossless storage

πŸ“Š Dataset Statistics

Geographic Distribution (Post-Balancing)

Country Clips Percentage
Indonesia 9,155 36.6%
Malaysia 8,400 33.6%
Thailand 5,996 24.0%
Singapore 1,388 5.6%
Brunei 61 0.2%

Species Diversity

  • Total species: 1,677
  • Mean samples per species: 14.9
  • Gini coefficient: 0.519 (post-balancing)
  • Acoustic clusters: 3,553 (via K-Means)
  • Mean cluster salience: 0.378

Quality Metrics

  • A/B quality rating: 92.1% (Xeno-Canto community ratings)
  • Manual QA accuracy: 97.8% (n=1,000)
  • Deduplication rate: 0.03% (13/38,481)

πŸ“œ License

Curation Code

MIT License β€” Free to use, modify, and distribute with attribution.

Audio Clips

  • Positive samples: Inherit original Xeno-Canto Creative Commons licenses
    • CC BY, CC BY-SA, CC BY-NC, CC BY-NC-SA
    • Full attribution metadata included in dataset
    • See Zenodo for per-recording licenses
  • Negative samples: Subject to licenses of source datasets:
    • BirdVox, Freefield1010, Warblr (DCASE 2018)
    • FSC-22, ESC-50, DataSEC
    • See source dataset documentation for terms

Important: When using SEABAD, you must:

  1. Credit original Xeno-Canto recordists (metadata provided)
  2. Respect original Creative Commons license terms
  3. Cite SEABAD dataset and paper

πŸ“š Citation

If you use SEABAD or this curation pipeline, please cite:

@article{seabad2025,
  title   = {{SEABAD}: A Tropical Bird Audio Detection Dataset for Passive Acoustic Monitoring},
  author  = {Author Names},
  journal = {arXiv preprint arXiv:XXXX.XXXXX},
  year    = {2025},
  url     = {https://github.com/mun3im/seabad}
}

@dataset{seabad_zenodo2025,
  title   = {{SEABAD} Dataset: 50,000 Southeast Asian Bird Audio Clips},
  year    = {2025},
  url     = {https://zenodo.org/records/18290494},
  note    = {50,000 curated 3-second clips spanning 1,677 Southeast Asian bird species}
}

Please also credit:

  • Xeno-Canto and original recordists
  • Source datasets for negative samples (BirdVox, Freefield1010, Warblr, FSC-22, ESC-50, DataSEC)

🀝 Contributing

We welcome contributions to improve the curation pipeline or extend SEABAD to other regions!

  • Issues: Report bugs or request features via GitHub Issues
  • Pull requests: Submit improvements to scripts or documentation
  • Regional adaptations: Contact us if you're adapting the pipeline for other tropical/temperate regions

πŸ”— Related Resources


πŸ“§ Contact

For questions about SEABAD or the curation methodology:


Built with ❀️ for tropical biodiversity monitoring

About

This repo contain scripts to develop the SEABAD dataset and SEABADnet lightweight SEAsian bird presence detection models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors