A general-purpose audio transcription and sonic archival toolkit.
Aural Archive is a suite of tools designed to facilitate the collection, transcription, and time-coding of "found sound," field recordings, and digital artifacts. Built for researchers, ethnomusicologists, and sound artists, it provides a structured pipeline for organizing audio data and extracting meaningful segments with archival precision.
This toolkit is designed to:
- Archive audio from diverse digital repositories and local field recordings into structured research projects.
- Transcribe spoken word and sonic events with high-fidelity time-codes using local AI models.
- Analyze long-form recordings to identify and preserve specific "found sound" moments.
- Preserve Context via advanced JSON sidecar metadata and an SQLite-based state journal.
- 🔍 Research-First Acquisition — The
capturecommand allows for targeted, ethical gathering of audio artifact metadata and binary data. - 🔌 Plugin Architecture — Modular
Extractorsystem for easily adding new acquisition sources (YouTube, local, HTML, etc.). - 📜 SQLite State Journal — Robust state management ensures idempotency, crash recovery, and a clear audit trail of all archival jobs.
- 🎙️ Time-Coded Transcription:
- OpenAI Whisper — Local, high-accuracy speech-to-text with segment-level timestamps.
- Multi-Strategy — Support for YouTube captions, Whisper, and Gemini Vision transcripts.
- 📁 Advanced Organization — Automatic generation of
.info.jsonsidecar metadata and generic, research-oriented project structures.
git clone https://github.com/eric-rolph/aural-archive.git
cd aural-archive
.\setup.ps1
.\venv\Scripts\Activate.ps1python -m media_harvest init archival-study-001# Interactive search & select
python -m media_harvest capture -p archival-study-001 --mode searchpython -m media_harvest transcribe -p archival-study-001# Read transcript + metadata directly in terminal
python -m media_harvest view -p archival-study-001 --num 1 --meta
# Check archival metrics & level
python -m media_harvest stats -p archival-study-001| Command | Description |
|---|---|
init <name> |
Initialize a new archival project with templates |
capture -p <name> |
Acquire audio artifacts (--mode search, batch, or url) |
journal -p <name> |
Inspect and manage the archival queue and job states |
transcribe -p <name> |
Generate time-coded transcripts for all project audio |
view -p <name> |
View transcripts and metadata sidecars in the terminal |
stats -p <name> |
View deep archival metrics and storage statistics |
extract -p <name> |
Cut precise clips based on extractions.json |
doctor |
Check health of dependencies (FFmpeg, yt-dlp, etc.) |
list |
List all active archival projects |
aural-archive/
├── media_harvest/ # Core library & plugins
├── projects/ # Archival research projects
│ ├── archival-study-01/ # User-defined study
│ │ ├── presets.json # Capture parameters
│ │ ├── extractions.json # Sample definitions
│ │ ├── output/ # Archived recordings & metadata
│ │ └── samples/ # Extracted sonic events
├── LICENSE.md # MIT + Ethical Archival Notice
├── setup.ps1
└── README.md
Aural Archive is designed for researchers and archivists to manage recordings in accordance with applicable laws.
- Personal/Research Use Only: Intended for public domain content or content you have a legal right to access.
- Respect TOSe: Users are responsible for complying with the Terms of Service of all repositories accessed.
- No Infringement: We do not condone or support the use of this tool for copyright infringement.
Distributed under the MIT License with an included Ethical Archival Research Notice. See LICENSE.md for details.