DataHackIL · shaypal5 · May 12, 2026 · May 12, 2026
diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
@@ -0,0 +1,29 @@
+name: Deploy docs
+
+on:
+  push:
+    branches:
+      - main
+    paths:
+      - "docs/**"
+      - "mkdocs.yml"
+  workflow_dispatch:
+
+permissions:
+  contents: write
+
+jobs:
+  deploy:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+
+      - name: Install MkDocs Material
+        run: pip install mkdocs-material
+
+      - name: Deploy to GitHub Pages
+        run: mkdocs gh-deploy --force
diff --git a/docs/assets/logo.svg b/docs/assets/logo.svg
diff --git a/docs/audio-format.md b/docs/audio-format.md
@@ -0,0 +1,132 @@
+# Audio Format
+
+All clips in the corpus conform to the following hard constraints. Clips that fail these checks are rejected at generation time and will not appear in the corpus.
+
+---
+
+## Format requirements
+
+| Property | Value |
+|----------|-------|
+| Sample rate | 16 000 Hz |
+| Channels | 1 (mono) |
+| Bit depth | 16-bit PCM |
+| Peak level | ≤ –1.0 dBFS (safety ceiling) |
+| Duration | ≥ 3.0 s |
+| Encoding | WAV (no lossy formats) |
+
+```python
+import soundfile as sf
+import numpy as np
+
+wav, sr = sf.read("data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav")
+assert sr == 16000
+assert wav.ndim == 1               # mono
+assert wav.dtype == np.float64     # soundfile returns float64 by default
+assert np.abs(wav).max() <= 1.0   # -1.0 dBFS ≈ linear amplitude 1.0
+
+# Check format info
+info = sf.info("data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav")
+print(info.subtype)  # PCM_16
+```
+
+---
+
+## Normalization pipeline
+
+Each clip passes through two normalization steps:
+
+```
+TTS render (float32, arbitrary loudness)
+    ↓
+[1] Per-turn RMS gain (M3a)        — preserves inter-turn contrast
+    ↓
+[2] Single global peak gain         — lands absolute peak at target_peak_dbfs
+    ↓
+[3] Safety limiter                  — clips at ≤ –1.0 dBFS (guaranteed no-op for target ≥ –12.0)
+    ↓
+Tier B only: room IR + device → renormalize to same target
+    ↓
+Output WAV
+```
+
+### Step 1 — Per-turn RMS gain (M3a)
+
+Each dialogue turn is gain-adjusted so its RMS matches a per-intensity target. This preserves the acoustic contrast between calm and escalated turns — a whispered turn at I1 stays quieter than a shouted turn at I5 — while giving the subsequent global normalization a stable peak-to-RMS ratio to work with.
+
+??? info "Why per-turn RMS matters"
+    Without per-turn normalization, the TTS engine produces flat RMS across intensities regardless of the requested prosody. The raw Azure and Google outputs are nearly constant-loudness even when the SSML requests "shout" style. Per-turn RMS gain is the mechanism that creates the acoustic loudness gradient you expect to see in the data.
+
+### Step 2 — Single global peak gain
+
+A single gain is applied to the whole mix so the clip's absolute peak lands at `loudness_target_peak_dbfs` (default: –2.0 dBFS). Because it's a single gain, all per-turn RMS *ratios* survive unchanged — the contrast from Step 1 is preserved.
+
+The configured target is recorded in `generation_metadata.loudness_target_peak_dbfs`.
+The measured output peak is recorded in `preprocessing_applied.normalized_dbfs`.
+
+### Step 3 — Safety limiter
+
+A hard ceiling at –1.0 dBFS. For in-spec targets (range: [–12.0, –1.5] dBFS), this is a guaranteed no-op. It exists as a safety rail against misconfiguration.
+
+---
+
+## Silence padding
+
+Every clip has at least 0.5 s of ambient silence at the head and tail. This is applied by `preprocess()` and logged in `preprocessing_applied.silence_padded: true`.
+
+Onset/offset timestamps in the `.txt` transcript and `.jsonl` events are already shifted to account for the leading pad — they refer to positions in the final processed WAV, not the raw TTS output.
+
+---
+
+## Dirty files
+
+`preprocessing_applied` records the processing that was applied. The **pre-preprocessing WAV** is retained as `{clip_id}_dirty.wav` under `assets/speech/dirty/`. These are the raw TTS-mixer outputs before normalization, padding, or denoising.
+
+The `dirty_file_path` field in ClipMetadata gives the repo-relative path:
+```
+"dirty_file_path": "assets/speech/dirty/sp_sv_a_0001_00_dirty.wav"
+```
+
+Dirty files are useful for:
+- Diagnosing normalization issues (compare dirty peak vs. `normalized_dbfs`)
+- Checking raw TTS prosody before processing
+- Re-running preprocessing with different parameters
+
+!!! warning "Do not modify dirty files"
+    The `assets/` directory is managed by SynthBanshee. Manual edits to `.wav` files under `assets/speech/` will break SHA-256 cache lookups.
+
+---
+
+## TTS backends
+
+| Backend | Voices | Clips in delivery-003 |
+|---------|--------|----------------------|
+| Azure Cognitive Services | `he-IL-AvriNeural` (M), `he-IL-HilaNeural` (F) | 18 |
+| Google Cloud TTS Chirp 3 HD | `he-IL-Chirp3-HD-Achird` (M), `he-IL-Chirp3-HD-Achernar` (F) | 2 |
+
+The backend per speaker is recorded in `generation_metadata.tts_backend`:
+```json
+"tts_backend": {
+    "AGG_M_30-45_002": "google",
+    "VIC_F_25-40_003": "google"
+}
+```
+
+??? info "Azure SSML cache"
+    SynthBanshee caches per-utterance WAVs under `assets/speech/` keyed by SHA-256 of the full rendered SSML string. Re-running generation with the same SSML is **free** for Azure clips — the file is returned directly from cache without an API call. Google Chirp HD does not use the same cache: it produces slightly different audio on each synthesis (minor bit-level variation at the same parameters).
+
+---
+
+## Known audio quirks
+
+### `vic_f0_high` — Google Chirp HD female F0 baseline
+
+The two Google Chirp 3 HD clips (`sp_sv_a_0003_00`, `sp_it_a_0003_00`) use the female voice `he-IL-Chirp3-HD-Achernar`. This voice's F0 baseline runs measurably higher than `he-IL-HilaNeural` (Azure), against which the corpus QA M10a thresholds were calibrated.
+
+Both clips are flagged `vic_f0_high` in the QA report. This is expected and tracked — it reflects a real backend difference, not a synthesis failure. **Do not exclude these clips** on the basis of this flag; calibrate your model's F0 features against the correct baseline per backend.
+
+### `quality_flags: ["emotion_downgrade"]`
+
+Several clips carry an `emotion_downgrade` quality flag. This means the TTS engine produced a less emotionally intense output than requested by the SSML prosody hints — the pipeline detected the downgrade and flagged it. Audio quality is still acceptable; the prosody is slightly less extreme than the scene specification intended.
+
+In delivery-003: 15 clips carry at least one quality flag, mostly from prosody cap activations at I3+.
diff --git a/docs/deliveries.md b/docs/deliveries.md
@@ -0,0 +1,86 @@
+# Deliveries
+
+All data deliveries are logged here. Each entry links to per-delivery notes with clip counts, QA findings, known limitations, and the SynthBanshee commit that produced the batch.
+
+---
+
+## Delivery 003 — multi-project, multi-voice
+
+**Date:** 2026-05-12 · **Status:** provisional · **PR:** [#5](https://github.com/DataHackIL/avdp-synth-corpus/pull/5)
+
+This is the current working delivery. It replaces delivery-002.
+
+### At a glance
+
+| | |
+|---|---|
+| Clips | 20 |
+| Total duration | ~41.6 min |
+| Projects | `she_proves` (12) + `elephant_in_the_room` (8) |
+| Tiers | A (12 clean) + B (8 room-augmented) |
+| TTS backends | Azure (18) + Google Chirp 3 HD (2) |
+| Validation failures | 0 / 20 |
+| Pipeline | SynthBanshee `0.1.0` @ [`1ea48f3`](https://github.com/DataHackIL/SynthBanshee/commit/1ea48f3) |
+
+[Full notes](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/003-multi-project-multi-voice/notes.md) · [QA report](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/003-multi-project-multi-voice/qa-report.json)
+
+### QA findings — closed (vs. delivery-002)
+
+| Finding | Delivery-002 | Delivery-003 |
+|---------|:---:|:---:|
+| `agg_no_escalation` | 3 clips | **0** — AGG RMS now escalates with intensity |
+| `warn_no_overlap` | 4 clips | **0** — overlap_ratio 100% on I4+ clips |
+| `warn_emotion_downgrade` | 4 clips | **0** — emotion_downgrade_ratio 0% |
+| `generation_metadata` absent | 0 of 8 clips | **20 of 20** carry the full block |
+| `dirty_file_path` null | 7 of 8 clips | **20 of 20** retain dirty files |
+| `normalized_dbfs` hardcoded `-1.0` | all 8 clips | **fixed** — now the measured peak |
+
+Additional findings closed by the 2026-05-12 schema-shift regen (PRs [#110](https://github.com/DataHackIL/SynthBanshee/pull/110)/[#111](https://github.com/DataHackIL/SynthBanshee/pull/111)/[#112](https://github.com/DataHackIL/SynthBanshee/pull/112)):
+
+| Finding | Resolution |
+|---------|-----------|
+| `single_backend` false positive | `qa.py` now derives backend diversity from `generation_metadata.tts_backend.values()`; reports `clips_by_tts_backend: {azure: 18, google: 2}` |
+| Absolute paths in clip JSON | `dirty_file_path` and `transcript_path` are now repo-relative POSIX strings |
+| Leaked pytest tmp_path on `sp_neu_a_0001_00` | Regen overwrote with canonical path; autouse env-var strip fixture prevents future leaks |
+
+### QA findings — open
+
+| Finding | Detail |
+|---------|--------|
+| `low_voice_diversity_male` | 2 voice families per gender; threshold ≥ 3 |
+| `low_voice_diversity_female` | 2 voice families per gender; threshold ≥ 3 |
+| `vic_f0_high` (2 clips) | `sp_sv_a_0003_00` and `sp_it_a_0003_00` — Google Chirp HD female F0 runs higher than Azure Hila reference |
+| `quality_flagged_clips: 15` | Mostly from prosody cap activations at I3+; expected behaviour |
+
+### Known limitations
+
+- **Speaker-disjoint splits not feasible.** 4 unique speaker personas across 20 clips; all clips are `split: train`.
+- **Two speaker directories only.** `agg_m_30-45_002/` and `ben_m_40-55_003/` are first appearances — code hardcoding `agg_m_30-45_001/` will miss them.
+- **One room type.** All 8 Elephant Tier B clips use `clinic_office`. Future deliveries will add `welfare_office` and `open_office`.
+- **Toy corpus only.** 20 clips is not sufficient for training production models.
+
+### What this delivery exercises
+
+1. Full `ClipMetadata` schema including `generation_metadata`, `voice_family`, and (for Tier B) the populated `acoustic_scene` block
+2. Per-surface casing rules: UPPERCASE `speaker_id`, lowercase paths and clip IDs
+3. `has_violence` derivation from events: NEG clips are correctly `false` even at `max_intensity ≥ 3`
+4. Multi-project layout under a single `data/he/` root
+5. Multi-backend provenance: `generation_metadata.tts_backend` per speaker
+
+---
+
+## Delivery log
+
+| # | Date | Slug | Project | Tier | Clips | Duration | Status |
+|---|------|------|---------|------|------:|------:|--------|
+| [003](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/003-multi-project-multi-voice/notes.md) | 2026-05-12 | multi-project-multi-voice | she_proves + elephant | A + B | 20 | ~42m | provisional |
+| [002](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/002-m2a-wettest/notes.md) | 2026-04-15 | m2a-wettest | she_proves | A | 8 | ~17m | superseded |
+| [001](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/001-debug-run-1/notes.md) | 2026-04-15 | debug-run-1 | she_proves | A | 1 | 2m 36s | superseded |
+
+## Status definitions
+
+| Status | Meaning |
+|--------|---------|
+| `provisional` | Wet-test batch; not yet approved for model training |
+| `approved` | QA passed; cleared for training use |
+| `superseded` | Replaced by a later delivery with the same scenes at higher quality |