A dataset of 50,000 automatically curated 3-second clips spanning 1,677 Southeast Asian bird species, standardized to 16 kHz mono for binary bird presenceβabsence detection on edge devices.
π¦ Dataset: zenodo.org/records/18290494 π Paper: SEABAD: A Tropical Bird Audio Detection Dataset for Passive Acoustic Monitoring (2025) π» Code: github.com/mun3im/seabad
Passive acoustic monitoring (PAM) enables large-scale biodiversity assessment, but most recordings contain non-informative audio. Bird audio detection (BAD)βdetermining bird presence without species classificationβcan suppress non-target recordings on-device, extending deployment and reducing annotation burden.
SEABAD addresses critical gaps in existing datasets:
- β Tropical soundscape coverage (Southeast Asia)
- β 3-second clip length matching edge-AI inference windows
- β Diversity-aware species balancing (1,677 species, Gini 0.519)
- β Multi-source negative curation (environmental + field recordings)
| Property | Value |
|---|---|
| Total clips | 50,000 |
| Positive (bird present) | 25,000 |
| Negative (bird absent) | 25,000 |
| Unique bird species | 1,677 |
| Clip duration | 3 seconds |
| Sample rate | 16 kHz mono |
| Bit depth | 16-bit PCM |
| Geography | Indonesia, Malaysia, Thailand, Singapore, Brunei |
| Train / Val / Test split | 40k / 5k / 5k (80% / 10% / 10%) |
Standard CNN architectures (ImageNet pre-trained, fine-tuned on SEABAD) averaged across 3 random seeds (42, 100, 786):
| Model | Params | Accuracy | AUC | Precision | Recall | F1 |
|---|---|---|---|---|---|---|
| MobileNetV3-Smallβ | 1.1M | 99.57 Β± 0.25% | 0.9985 Β± 0.0002 | 0.9956 Β± 0.0012 | 0.9957 Β± 0.0008 | 0.9957 Β± 0.0025 |
| EfficientNetB0 | 4.4M | 99.49 Β± 0.23% | 0.9991 Β± 0.0004 | 0.9959 Β± 0.0018 | 0.9939 Β± 0.0051 | 0.9949 Β± 0.0023 |
| VGG16 | 14.9M | 99.61 Β± 0.03% | 0.9995 Β± 0.0001 | 0.9960 Β± 0.0014 | 0.9963 Β± 0.0010 | 0.9961 Β± 0.0025 |
| ResNet50 | 24.2M | 99.73 Β± 0.02% | 0.9992 Β± 0.0003 | 0.9965 Β± 0.0013 | 0.9980 Β± 0.0012 | 0.9973 Β± 0.0019 |
β Primary baseline for edge deployment
- β All models achieve >99.4% accuracy with minimal variance (std <0.25%)
- β MobileNetV3-Small is only 0.16% behind ResNet50 but has 22Γ fewer parameters
- β Excellent training stability confirmed across diverse architectures and random seeds
- β High dataset quality validated by consistent performance
See validation/README.md for detailed experimental setup and analysis.
seabad/
βββ positive-label-curation/ # Stages 1β9: Xeno-Canto bird clips
β βββ Stage1_xc_fetch_bird_metadata.py
β βββ Stage2_analyze_metadata.py
β βββ Stage3_download_and_convert.py
β βββ Stage4_deduplicate_flac.py
β βββ Stage5_extract_wav_from_flac.py
β βββ Stage6_balance_species.py
β βββ Stage7_qa_spectrograms.py
β βββ Stage8_adjust_onset.py
β βββ Stage9_qa_apply_corrections.py
β
βββ negative-sample-curation/ # Stages 1β6: Non-bird clips
β βββ Stage1_extract_birdvox.py
β βββ Stage2_extract_freefield.py
β βββ Stage3_extract_warblr.py
β βββ Stage4_extract_fsc22.py
β βββ Stage5_extract_esc50.py
β βββ Stage6_extract_datasec.py
β
βββ validation/ # CNN baseline training & evaluation
βββ validate_seabad_pretrained.py
βββ run_all_cnn.sh
βββ utils.py
βββ README.md
- Metadata Acquisition β Xeno-Canto API query for Southeast Asian bird species
- Sound Acquisition β Download + convert to 16 kHz mono FLAC
- Acoustic Deduplication β FAISS approximate nearest-neighbor on mel embeddings
- Segment Extraction β RMS-based sliding window with minimum separation
- Species Balancing β Diversity-aware balancing using MiniBatch K-Means + salience ranking
- Quality Assurance β Manual audit with interactive correction tool
- Acoustic deduplication: FAISS cosine similarity on mel-spectrogram embeddings
- 13 near-duplicates identified and removed (0.03% of 38,481 recordings)
- Diversity-aware segment extraction:
- RMS sliding window with 1.5s minimum separation
- One representative clip per source recording
- Species balancing:
- MiniBatch K-Means (5 clusters/species) + salience-ranked priority queue
- Gini coefficient reduced from 0.601 β 0.519 (13.7% reduction)
- Mean samples/species: 14.9 (preserves all 1,677 species)
- Quality assurance:
- 1,000 clips manually audited (Cochran n=639 for 95% CI)
- 97.8% accuracy (22 corrections: 15 onset, 6 noise, 1 false positive)
- 92.1% rated quality A/B by Xeno-Canto community
Sources totaling 25,000 bird-absent clips:
| Dataset | Clips | Description |
|---|---|---|
| BirdVox-DCASE-20k | 9,983 | Northeast USA field recordings |
| Freefield1010 | 5,755 | Global crowdsourced environmental audio |
| Warblr | 1,950 | UK field recordings |
| FSC-22 | 1,875 | Forest sounds (mammals, insects, weather) |
| ESC-50 | 444 | Urban/mechanical/human sounds (high RMS) |
| DataSEC | 3,597 | Mediterranean soundscapes |
All clips:
- Resampled to 16 kHz mono
- Centered 3-second extraction
- No normalization (preserves naturalistic amplitude distribution)
- Avian classes excluded from general sound datasets
# Python dependencies
pip install pandas requests librosa soundfile tqdm scikit-learn matplotlib faiss-cpu numpy tensorflow
# System dependencies
# macOS: brew install ffmpeg
# Linux: apt-get install ffmpegFor GPU FAISS acceleration: replace faiss-cpu with faiss-gpu.
cd positive-label-curation
# Stage 1: Fetch metadata from Xeno-Canto API (~10 min)
python Stage1_xc_fetch_bird_metadata.py --country all
# Stage 2: Analyze metadata (optional statistics)
python Stage2_analyze_metadata.py
# Stage 3: Download and convert to FLAC (~2-6 hours)
python Stage3_download_and_convert.py
# Stage 4: Deduplicate using FAISS (~30 min)
python Stage4_deduplicate_flac.py --quarantine-all
# Stage 5: Extract 3s clips from FLAC (~1 hour)
python Stage5_extract_wav_from_flac.py --no-quarantine
# Stage 6: Balance species distribution (~15 min)
python Stage6_balance_species.py
# Stage 7: Generate QA spectrograms for manual review
python Stage7_qa_spectrograms.py
# Stage 8: Interactive onset correction tool (optional)
python Stage8_adjust_onset.py
# Stage 9: Apply corrections from QA
python Stage9_qa_apply_corrections.pycd negative-sample-curation
# Extract from DCASE 2018 datasets
python Stage1_extract_birdvox.py
python Stage2_extract_freefield.py
python Stage3_extract_warblr.py
# Extract from environmental sound datasets
python Stage4_extract_fsc22.py
python Stage5_extract_esc50.py
python Stage6_extract_datasec.pyNote: Ensure source datasets are downloaded and paths are configured in each script.
cd validation
# Train single model
python validate_seabad_pretrained.py --model mobilenetv3s --seed 42
# Train all models across all seeds
./run_all_cnn.sh
# Train specific seeds only
./run_all_cnn.sh 42 100Supported models: mobilenetv3s, resnet50, vgg16, efficientnetb0
See validation/README.md for detailed usage and configuration.
Download the complete 50,000-clip dataset: zenodo.org/records/18290494
- β 50,000 Γ 3-second clips (WAV, 16 kHz mono, 16-bit PCM)
- β Train / validation / test CSVs (80/10/10 stratified split)
- β
Full provenance metadata:
- Xeno-Canto recording IDs
- GPS coordinates
- Original licenses (CC BY, CC BY-NC-SA, etc.)
- Species taxonomy (IOC World Bird List)
- Quality ratings
- Source dataset attribution (for negative samples)
All clips standardized to:
| Parameter | Value | Rationale |
|---|---|---|
| Sample rate | 16,000 Hz | AudioMoth default, Nyquist covers 0-8 kHz |
| Channels | Mono | Edge deployment memory constraint |
| Duration | 3.0 seconds | BirdNET inference window |
| Format | WAV PCM_16 | Lossless, hardware-compatible |
| Normalization | None | Preserves naturalistic amplitude for energy-based detection |
| Source format | FLAC PCM_16 | Intermediate lossless storage |
| Country | Clips | Percentage |
|---|---|---|
| Indonesia | 9,155 | 36.6% |
| Malaysia | 8,400 | 33.6% |
| Thailand | 5,996 | 24.0% |
| Singapore | 1,388 | 5.6% |
| Brunei | 61 | 0.2% |
- Total species: 1,677
- Mean samples per species: 14.9
- Gini coefficient: 0.519 (post-balancing)
- Acoustic clusters: 3,553 (via K-Means)
- Mean cluster salience: 0.378
- A/B quality rating: 92.1% (Xeno-Canto community ratings)
- Manual QA accuracy: 97.8% (n=1,000)
- Deduplication rate: 0.03% (13/38,481)
MIT License β Free to use, modify, and distribute with attribution.
- Positive samples: Inherit original Xeno-Canto Creative Commons licenses
- CC BY, CC BY-SA, CC BY-NC, CC BY-NC-SA
- Full attribution metadata included in dataset
- See Zenodo for per-recording licenses
- Negative samples: Subject to licenses of source datasets:
- BirdVox, Freefield1010, Warblr (DCASE 2018)
- FSC-22, ESC-50, DataSEC
- See source dataset documentation for terms
Important: When using SEABAD, you must:
- Credit original Xeno-Canto recordists (metadata provided)
- Respect original Creative Commons license terms
- Cite SEABAD dataset and paper
If you use SEABAD or this curation pipeline, please cite:
@article{seabad2025,
title = {{SEABAD}: A Tropical Bird Audio Detection Dataset for Passive Acoustic Monitoring},
author = {Author Names},
journal = {arXiv preprint arXiv:XXXX.XXXXX},
year = {2025},
url = {https://github.com/mun3im/seabad}
}
@dataset{seabad_zenodo2025,
title = {{SEABAD} Dataset: 50,000 Southeast Asian Bird Audio Clips},
year = {2025},
url = {https://zenodo.org/records/18290494},
note = {50,000 curated 3-second clips spanning 1,677 Southeast Asian bird species}
}Please also credit:
- Xeno-Canto and original recordists
- Source datasets for negative samples (BirdVox, Freefield1010, Warblr, FSC-22, ESC-50, DataSEC)
We welcome contributions to improve the curation pipeline or extend SEABAD to other regions!
- Issues: Report bugs or request features via GitHub Issues
- Pull requests: Submit improvements to scripts or documentation
- Regional adaptations: Contact us if you're adapting the pipeline for other tropical/temperate regions
- Xeno-Canto: https://www.xeno-canto.org/
- AudioMoth: https://www.openacousticdevices.info/
- BirdNET: https://birdnet.cornell.edu/
- DCASE 2018 Task 3: http://dcase.community/challenge2018/task-bird-audio-detection
For questions about SEABAD or the curation methodology:
- Open an issue: github.com/mun3im/seabad/issues
- Email: [contact information]
Built with β€οΈ for tropical biodiversity monitoring