diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
new file mode 100644
index 0000000..c300fe0
--- /dev/null
+++ b/.github/workflows/docs.yml
@@ -0,0 +1,29 @@
+name: Deploy docs
+
+on:
+  push:
+    branches:
+      - main
+    paths:
+      - "docs/**"
+      - "mkdocs.yml"
+  workflow_dispatch:
+
+permissions:
+  contents: write
+
+jobs:
+  deploy:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+
+      - name: Install MkDocs Material
+        run: pip install mkdocs-material
+
+      - name: Deploy to GitHub Pages
+        run: mkdocs gh-deploy --force
diff --git a/docs/assets/logo.svg b/docs/assets/logo.svg
new file mode 100644
index 0000000..138e42d
--- /dev/null
+++ b/docs/assets/logo.svg
@@ -0,0 +1,10 @@
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 48 48" fill="none">
+  <circle cx="24" cy="24" r="22" fill="#00897B" opacity="0.15"/>
+  <circle cx="24" cy="24" r="16" fill="#00897B" opacity="0.25"/>
+  <!-- Waveform bars -->
+  <rect x="10" y="20" width="4" height="8" rx="2" fill="#00897B"/>
+  <rect x="16" y="14" width="4" height="20" rx="2" fill="#00897B"/>
+  <rect x="22" y="10" width="4" height="28" rx="2" fill="#00ACC1"/>
+  <rect x="28" y="16" width="4" height="16" rx="2" fill="#00897B"/>
+  <rect x="34" y="21" width="4" height="6" rx="2" fill="#00897B"/>
+</svg>
diff --git a/docs/audio-format.md b/docs/audio-format.md
new file mode 100644
index 0000000..b0173a9
--- /dev/null
+++ b/docs/audio-format.md
@@ -0,0 +1,132 @@
+# Audio Format
+
+All clips in the corpus conform to the following hard constraints. Clips that fail these checks are rejected at generation time and will not appear in the corpus.
+
+---
+
+## Format requirements
+
+| Property | Value |
+|----------|-------|
+| Sample rate | 16 000 Hz |
+| Channels | 1 (mono) |
+| Bit depth | 16-bit PCM |
+| Peak level | ≤ –1.0 dBFS (safety ceiling) |
+| Duration | ≥ 3.0 s |
+| Encoding | WAV (no lossy formats) |
+
+```python
+import soundfile as sf
+import numpy as np
+
+wav, sr = sf.read("data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav")
+assert sr == 16000
+assert wav.ndim == 1               # mono
+assert wav.dtype == np.float64     # soundfile returns float64 by default
+assert np.abs(wav).max() <= 1.0   # -1.0 dBFS ≈ linear amplitude 1.0
+
+# Check format info
+info = sf.info("data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav")
+print(info.subtype)  # PCM_16
+```
+
+---
+
+## Normalization pipeline
+
+Each clip passes through two normalization steps:
+
+```
+TTS render (float32, arbitrary loudness)
+    ↓
+[1] Per-turn RMS gain (M3a)        — preserves inter-turn contrast
+    ↓
+[2] Single global peak gain         — lands absolute peak at target_peak_dbfs
+    ↓
+[3] Safety limiter                  — clips at ≤ –1.0 dBFS (guaranteed no-op for target ≥ –12.0)
+    ↓
+Tier B only: room IR + device → renormalize to same target
+    ↓
+Output WAV
+```
+
+### Step 1 — Per-turn RMS gain (M3a)
+
+Each dialogue turn is gain-adjusted so its RMS matches a per-intensity target. This preserves the acoustic contrast between calm and escalated turns — a whispered turn at I1 stays quieter than a shouted turn at I5 — while giving the subsequent global normalization a stable peak-to-RMS ratio to work with.
+
+??? info "Why per-turn RMS matters"
+    Without per-turn normalization, the TTS engine produces flat RMS across intensities regardless of the requested prosody. The raw Azure and Google outputs are nearly constant-loudness even when the SSML requests "shout" style. Per-turn RMS gain is the mechanism that creates the acoustic loudness gradient you expect to see in the data.
+
+### Step 2 — Single global peak gain
+
+A single gain is applied to the whole mix so the clip's absolute peak lands at `loudness_target_peak_dbfs` (default: –2.0 dBFS). Because it's a single gain, all per-turn RMS *ratios* survive unchanged — the contrast from Step 1 is preserved.
+
+The configured target is recorded in `generation_metadata.loudness_target_peak_dbfs`.
+The measured output peak is recorded in `preprocessing_applied.normalized_dbfs`.
+
+### Step 3 — Safety limiter
+
+A hard ceiling at –1.0 dBFS. For in-spec targets (range: [–12.0, –1.5] dBFS), this is a guaranteed no-op. It exists as a safety rail against misconfiguration.
+
+---
+
+## Silence padding
+
+Every clip has at least 0.5 s of ambient silence at the head and tail. This is applied by `preprocess()` and logged in `preprocessing_applied.silence_padded: true`.
+
+Onset/offset timestamps in the `.txt` transcript and `.jsonl` events are already shifted to account for the leading pad — they refer to positions in the final processed WAV, not the raw TTS output.
+
+---
+
+## Dirty files
+
+`preprocessing_applied` records the processing that was applied. The **pre-preprocessing WAV** is retained as `{clip_id}_dirty.wav` under `assets/speech/dirty/`. These are the raw TTS-mixer outputs before normalization, padding, or denoising.
+
+The `dirty_file_path` field in ClipMetadata gives the repo-relative path:
+```
+"dirty_file_path": "assets/speech/dirty/sp_sv_a_0001_00_dirty.wav"
+```
+
+Dirty files are useful for:
+- Diagnosing normalization issues (compare dirty peak vs. `normalized_dbfs`)
+- Checking raw TTS prosody before processing
+- Re-running preprocessing with different parameters
+
+!!! warning "Do not modify dirty files"
+    The `assets/` directory is managed by SynthBanshee. Manual edits to `.wav` files under `assets/speech/` will break SHA-256 cache lookups.
+
+---
+
+## TTS backends
+
+| Backend | Voices | Clips in delivery-003 |
+|---------|--------|----------------------|
+| Azure Cognitive Services | `he-IL-AvriNeural` (M), `he-IL-HilaNeural` (F) | 18 |
+| Google Cloud TTS Chirp 3 HD | `he-IL-Chirp3-HD-Achird` (M), `he-IL-Chirp3-HD-Achernar` (F) | 2 |
+
+The backend per speaker is recorded in `generation_metadata.tts_backend`:
+```json
+"tts_backend": {
+    "AGG_M_30-45_002": "google",
+    "VIC_F_25-40_003": "google"
+}
+```
+
+??? info "Azure SSML cache"
+    SynthBanshee caches per-utterance WAVs under `assets/speech/` keyed by SHA-256 of the full rendered SSML string. Re-running generation with the same SSML is **free** for Azure clips — the file is returned directly from cache without an API call. Google Chirp HD does not use the same cache: it produces slightly different audio on each synthesis (minor bit-level variation at the same parameters).
+
+---
+
+## Known audio quirks
+
+### `vic_f0_high` — Google Chirp HD female F0 baseline
+
+The two Google Chirp 3 HD clips (`sp_sv_a_0003_00`, `sp_it_a_0003_00`) use the female voice `he-IL-Chirp3-HD-Achernar`. This voice's F0 baseline runs measurably higher than `he-IL-HilaNeural` (Azure), against which the corpus QA M10a thresholds were calibrated.
+
+Both clips are flagged `vic_f0_high` in the QA report. This is expected and tracked — it reflects a real backend difference, not a synthesis failure. **Do not exclude these clips** on the basis of this flag; calibrate your model's F0 features against the correct baseline per backend.
+
+### `quality_flags: ["emotion_downgrade"]`
+
+Several clips carry an `emotion_downgrade` quality flag. This means the TTS engine produced a less emotionally intense output than requested by the SSML prosody hints — the pipeline detected the downgrade and flagged it. Audio quality is still acceptable; the prosody is slightly less extreme than the scene specification intended.
+
+In delivery-003: 15 clips carry at least one quality flag, mostly from prosody cap activations at I3+.
diff --git a/docs/deliveries.md b/docs/deliveries.md
new file mode 100644
index 0000000..71dac3b
--- /dev/null
+++ b/docs/deliveries.md
@@ -0,0 +1,86 @@
+# Deliveries
+
+All data deliveries are logged here. Each entry links to per-delivery notes with clip counts, QA findings, known limitations, and the SynthBanshee commit that produced the batch.
+
+---
+
+## Delivery 003 — multi-project, multi-voice
+
+**Date:** 2026-05-12 · **Status:** provisional · **PR:** [#5](https://github.com/DataHackIL/avdp-synth-corpus/pull/5)
+
+This is the current working delivery. It replaces delivery-002.
+
+### At a glance
+
+| | |
+|---|---|
+| Clips | 20 |
+| Total duration | ~41.6 min |
+| Projects | `she_proves` (12) + `elephant_in_the_room` (8) |
+| Tiers | A (12 clean) + B (8 room-augmented) |
+| TTS backends | Azure (18) + Google Chirp 3 HD (2) |
+| Validation failures | 0 / 20 |
+| Pipeline | SynthBanshee `0.1.0` @ [`1ea48f3`](https://github.com/DataHackIL/SynthBanshee/commit/1ea48f3) |
+
+[Full notes](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/003-multi-project-multi-voice/notes.md) · [QA report](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/003-multi-project-multi-voice/qa-report.json)
+
+### QA findings — closed (vs. delivery-002)
+
+| Finding | Delivery-002 | Delivery-003 |
+|---------|:---:|:---:|
+| `agg_no_escalation` | 3 clips | **0** — AGG RMS now escalates with intensity |
+| `warn_no_overlap` | 4 clips | **0** — overlap_ratio 100% on I4+ clips |
+| `warn_emotion_downgrade` | 4 clips | **0** — emotion_downgrade_ratio 0% |
+| `generation_metadata` absent | 0 of 8 clips | **20 of 20** carry the full block |
+| `dirty_file_path` null | 7 of 8 clips | **20 of 20** retain dirty files |
+| `normalized_dbfs` hardcoded `-1.0` | all 8 clips | **fixed** — now the measured peak |
+
+Additional findings closed by the 2026-05-12 schema-shift regen (PRs [#110](https://github.com/DataHackIL/SynthBanshee/pull/110)/[#111](https://github.com/DataHackIL/SynthBanshee/pull/111)/[#112](https://github.com/DataHackIL/SynthBanshee/pull/112)):
+
+| Finding | Resolution |
+|---------|-----------|
+| `single_backend` false positive | `qa.py` now derives backend diversity from `generation_metadata.tts_backend.values()`; reports `clips_by_tts_backend: {azure: 18, google: 2}` |
+| Absolute paths in clip JSON | `dirty_file_path` and `transcript_path` are now repo-relative POSIX strings |
+| Leaked pytest tmp_path on `sp_neu_a_0001_00` | Regen overwrote with canonical path; autouse env-var strip fixture prevents future leaks |
+
+### QA findings — open
+
+| Finding | Detail |
+|---------|--------|
+| `low_voice_diversity_male` | 2 voice families per gender; threshold ≥ 3 |
+| `low_voice_diversity_female` | 2 voice families per gender; threshold ≥ 3 |
+| `vic_f0_high` (2 clips) | `sp_sv_a_0003_00` and `sp_it_a_0003_00` — Google Chirp HD female F0 runs higher than Azure Hila reference |
+| `quality_flagged_clips: 15` | Mostly from prosody cap activations at I3+; expected behaviour |
+
+### Known limitations
+
+- **Speaker-disjoint splits not feasible.** 4 unique speaker personas across 20 clips; all clips are `split: train`.
+- **Two speaker directories only.** `agg_m_30-45_002/` and `ben_m_40-55_003/` are first appearances — code hardcoding `agg_m_30-45_001/` will miss them.
+- **One room type.** All 8 Elephant Tier B clips use `clinic_office`. Future deliveries will add `welfare_office` and `open_office`.
+- **Toy corpus only.** 20 clips is not sufficient for training production models.
+
+### What this delivery exercises
+
+1. Full `ClipMetadata` schema including `generation_metadata`, `voice_family`, and (for Tier B) the populated `acoustic_scene` block
+2. Per-surface casing rules: UPPERCASE `speaker_id`, lowercase paths and clip IDs
+3. `has_violence` derivation from events: NEG clips are correctly `false` even at `max_intensity ≥ 3`
+4. Multi-project layout under a single `data/he/` root
+5. Multi-backend provenance: `generation_metadata.tts_backend` per speaker
+
+---
+
+## Delivery log
+
+| # | Date | Slug | Project | Tier | Clips | Duration | Status |
+|---|------|------|---------|------|------:|------:|--------|
+| [003](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/003-multi-project-multi-voice/notes.md) | 2026-05-12 | multi-project-multi-voice | she_proves + elephant | A + B | 20 | ~42m | provisional |
+| [002](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/002-m2a-wettest/notes.md) | 2026-04-15 | m2a-wettest | she_proves | A | 8 | ~17m | superseded |
+| [001](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/001-debug-run-1/notes.md) | 2026-04-15 | debug-run-1 | she_proves | A | 1 | 2m 36s | superseded |
+
+## Status definitions
+
+| Status | Meaning |
+|--------|---------|
+| `provisional` | Wet-test batch; not yet approved for model training |
+| `approved` | QA passed; cleared for training use |
+| `superseded` | Replaced by a later delivery with the same scenes at higher quality |
diff --git a/docs/elephant.md b/docs/elephant.md
new file mode 100644
index 0000000..fa62e35
--- /dev/null
+++ b/docs/elephant.md
@@ -0,0 +1,178 @@
+# Elephant in the Room Guide
+
+**Elephant in the Room (הפיל שבחדר)** is a Raspberry Pi–class device placed in clinic and welfare offices that alerts security when a social worker is under threat.
+
+**Optimization target: high precision.** False alarms erode trust with security staff and social workers alike.
+
+---
+
+## Scene structure
+
+| Property | Value |
+|----------|-------|
+| Duration | 1–4 minutes |
+| Tier | B (room IR + device + noise augmentation) |
+| Alert window | Final 40% of the clip |
+| Device profile | `pi_budget_mic` |
+| Room types | `clinic_office`, `welfare_office`, `open_office` |
+| Language | Hebrew (`he`) |
+
+The alert-in-final-40% constraint reflects real-world deployment: the device picks up normal consultation audio before a client becomes threatening. The model must recognize genuine escalation from a baseline of professional interaction.
+
+??? info "Tier B acoustic augmentation pipeline"
+    Tier B clips go through three augmentation steps after TTS rendering and preprocessing:
+
+    1. **Room impulse response (IR)** — the clean speech is convolved with a synthetic room IR (generated by `pyroomacoustics` image-source method) to simulate the acoustic of the target room type.
+    2. **Device frequency response** — the `pi_budget_mic` profile applies the frequency response of a budget Raspberry Pi microphone capsule.
+    3. **Background noise injection** — ambient noise events (HVAC hum, equipment sounds) are mixed in at specified SNR levels.
+
+    After augmentation, the clip is renormalized to the same peak target (–2.0 dBFS) via the shared `peak_normalize_to_target` helper — so all tiers exit at the same absolute peak level.
+
+---
+
+## Speaker pair
+
+Delivery-003 has one Elephant speaker pair.
+
+| Speaker dir | Male speaker | Female speaker | Backend |
+|-------------|--------------|----------------|---------|
+| `ben_m_40-55_003/` | `BEN_M_40-55_003` → `he-IL-AvriNeural` | `SW_F_30-45_001` → `he-IL-HilaNeural` | Azure |
+
+The roles are **BEN (beneficiary/client, male) + SW (social worker, female)** — matching the most common demographic in Israeli welfare/clinic settings.
+
+!!! note "`ben_m_40-55_003/` is a new speaker directory in delivery-003"
+    Downstream code that hardcoded `agg_m_30-45_001/` for She-Proves will not find these clips. Use `manifest.csv` or filter by `meta["project"] == "elephant_in_the_room"`.
+
+---
+
+## The `acoustic_scene` block
+
+This is the key difference between Tier A and Tier B metadata. For Elephant clips, `acoustic_scene` is fully populated:
+
+```json
+"acoustic_scene": {
+    "room_type": "clinic_office",
+    "device": "pi_budget_mic",
+    "ir_source": "pyroomacoustics_ism",
+    "snr_db_actual": 11.2,
+    "speaker_distance_meters": 1.2,
+    "background_events": [
+        {"type": "hvac_hum",   "onset": 0.0,     "offset": 147.0, "level_db": -37.4},
+        {"type": "ACOU_SLAM",  "onset": 72.164,  "offset": 72.476, "level_db": 9.9},
+        {"type": "ACOU_FALL",  "onset": 97.57,   "offset": 98.473, "level_db": 9.6}
+    ]
+}
+```
+
+| Field | Meaning |
+|-------|---------|
+| `room_type` | Simulated room environment |
+| `device` | Microphone/device profile applied |
+| `ir_source` | Method used to generate room IR |
+| `snr_db_actual` | Measured speech-to-noise ratio after mixing |
+| `speaker_distance_meters` | Simulated speaker-to-mic distance |
+| `background_events` | Non-speech acoustic events: type, timestamps, level |
+
+??? info "What is `pyroomacoustics_ism`?"
+    The image-source method (ISM) is an algorithm for computing room impulse responses by reflecting a virtual point source off the room's walls. `pyroomacoustics` is a Python library that implements it.
+
+    The resulting IR simulates how sound travels from a speaker to a microphone in a room of specified dimensions and surface absorption coefficients — giving the audio the characteristic reverb of the target room type without recording in a real room.
+
+??? info "Background event types"
+    | Type | Description |
+    |------|-------------|
+    | `hvac_hum` | Constant HVAC/ventilation hum (low level, full duration) |
+    | `ACOU_SLAM` | Door slam or hard object impact (brief, high level) |
+    | `ACOU_FALL` | Object falling or being thrown (brief, high level) |
+
+    `ACOU_*` events are also tagged as `EventLabel` entries in the `.jsonl` strong labels with `tier1_category: "ACOU"`. This means they contribute to `weak_label.violence_categories` even in SV/IT clips where the primary violence is verbal or physical.
+
+---
+
+## Clips in delivery-003
+
+`data/he/ben_m_40-55_003/`
+
+| Clip ID | Typology | `has_violence` | Duration | SNR (dB) |
+|---------|----------|:---:|------:|:---:|
+| `el_sv_b_0001_00` | SV | ✓ | 2m 27.0s | ~11 |
+| `el_sv_b_0002_00` | SV | ✓ | 2m 18.5s | ~11 |
+| `el_it_b_0001_00` | IT | ✓ | 2m 30.0s | ~11 |
+| `el_it_b_0002_00` | IT | ✓ | 2m 31.6s | ~11 |
+| `el_neg_b_0001_00` | NEG | — | 1m 53.8s | ~11 |
+| `el_neg_b_0002_00` | NEG | — | 2m 54.6s | ~11 |
+| `el_neu_b_0001_00` | NEU | — | 1m 56.9s | ~11 |
+| `el_neu_b_0002_00` | NEU | — | 1m 19.7s | ~11 |
+
+All 8 clips are Tier B with `device: pi_budget_mic` and `room_type: clinic_office`.
+
+---
+
+## Loading Elephant clips
+
+```python
+import json
+import soundfile as sf
+import numpy as np
+import pandas as pd
+from pathlib import Path
+
+root = Path(".")
+df = pd.read_csv("data/he/manifest.csv")
+el_clips = df[df["project"] == "elephant_in_the_room"]
+
+# Load audio + metadata for a Tier B clip
+clip_id = "el_sv_b_0001_00"
+wav, sr = sf.read(root / f"data/he/ben_m_40-55_003/{clip_id}.wav")
+meta = json.loads((root / f"data/he/ben_m_40-55_003/{clip_id}.json").read_text())
+
+# Inspect acoustic scene
+scene = meta["acoustic_scene"]
+print(f"Room: {scene['room_type']}  Device: {scene['device']}  SNR: {scene['snr_db_actual']} dB")
+# Room: clinic_office  Device: pi_budget_mic  SNR: 11.2 dB
+
+# Find background acoustic events
+for evt in scene["background_events"]:
+    print(f"{evt['type']}: {evt['onset']:.1f}s – {evt['offset']:.1f}s  @ {evt['level_db']} dB")
+# hvac_hum: 0.0s – 147.0s  @ -37.4 dB
+# ACOU_SLAM: 72.2s – 72.5s  @ 9.9 dB
+# ACOU_FALL: 97.6s – 98.5s  @ 9.6 dB
+
+# Get alert window (final 40%)
+duration = meta["duration_seconds"]
+alert_start = duration * 0.60
+print(f"Alert window: {alert_start:.1f}s – {duration:.1f}s")
+
+# Filter strong labels to alert window only
+events = [json.loads(l) for l in
+          (root / f"data/he/ben_m_40-55_003/{clip_id}.jsonl").read_text().splitlines()]
+alert_events = [e for e in events if e["onset"] >= alert_start]
+```
+
+---
+
+## Guidance for model training
+
+!!! warning "This is a toy corpus — not for production training"
+    8 Elephant clips from 1 speaker pair in 1 room type is insufficient for training. This delivery exists to bootstrap your data pipeline and acoustic-scene parsing code.
+
+**High-precision orientation:**
+
+- **NEG clips are essential.** Your precision target means you must not fire on `el_neg_b_*` clips — intense speech in a clinic room with background noise, but no violence. Train hard against these.
+- **The alert-in-final-40% window** is where violence events concentrate. Consider a sliding-window detector that scores the final portion of each clip more aggressively than the opening.
+- **SNR is ~11 dB.** This is a realistic but challenging condition for acoustic feature extraction. Verify that your features (MFCCs, log-mel, etc.) are robust at this SNR before comparing with She-Proves Tier A results.
+
+**Tier B–specific features:**
+
+- `acoustic_scene.snr_db_actual` gives you the ground-truth SNR per clip — useful for SNR-conditioned training or evaluation stratification.
+- `background_events` timestamps let you train event detectors separately from the speech violence detector.
+- `acoustic_scene.room_type` will diversify across room types at scale (`clinic_office`, `welfare_office`, `open_office`). Future deliveries will include all three.
+
+**What delivery-003 doesn't cover:**
+
+- Only `clinic_office` room type (all 8 clips)
+- Only one speaker pair (BEN_M_40-55_003 + SW_F_30-45_001)
+- No test/val split (4 unique speakers total; all are `split: train`)
+- SNR variation (all ~11 dB)
+
+Plan for room-type diversity, SNR stratification, and speaker disjoint splits at scale.
diff --git a/docs/getting-started.md b/docs/getting-started.md
new file mode 100644
index 0000000..0310028
--- /dev/null
+++ b/docs/getting-started.md
@@ -0,0 +1,183 @@
+# Getting Started
+
+This guide walks through loading and using clips from the corpus in Python. All paths are relative to the repository root.
+
+## Prerequisites
+
+```bash
+pip install soundfile numpy pandas pydantic
+```
+
+??? note "Optional: full SynthBanshee schema"
+    If you want strict Pydantic validation against the full `ClipMetadata` schema:
+    ```bash
+    git clone https://github.com/DataHackIL/SynthBanshee
+    cd SynthBanshee && pip install -e .
+    ```
+    This gives you `from synthbanshee.labels.schema import ClipMetadata` and `validate_clip()`.
+    For most DS workflows, plain `json.loads()` is sufficient.
+
+## Clone the corpus
+
+```bash
+git clone https://github.com/DataHackIL/avdp-synth-corpus.git
+cd avdp-synth-corpus
+```
+
+The repository contains the audio files directly (no LFS). Total size is moderate — `data/he/` is roughly a few hundred MB for delivery-003.
+
+---
+
+## Load a single clip
+
+```python
+import json
+from pathlib import Path
+import soundfile as sf
+import numpy as np
+
+root = Path(".")  # run from repo root
+
+clip_id = "sp_sv_a_0001_00"
+speaker_dir = root / "data/he/agg_m_30-45_001"
+
+# --- Audio ---
+wav, sr = sf.read(speaker_dir / f"{clip_id}.wav")
+# wav: float64 array, shape (N,). sr: always 16000.
+
+print(f"Duration: {len(wav)/sr:.1f}s  Sample rate: {sr}  Peak: {np.abs(wav).max():.4f}")
+# Duration: 110.5s  Sample rate: 16000  Peak: 0.7943
+
+# --- Weak labels (ClipMetadata) ---
+meta = json.loads((speaker_dir / f"{clip_id}.json").read_text())
+wl = meta["weak_label"]
+print(f"Typology: {meta['violence_typology']}  has_violence: {wl['has_violence']}  "
+      f"max_intensity: {wl['max_intensity']}")
+# Typology: SV  has_violence: True  max_intensity: 5
+
+# --- Transcript ---
+transcript = (speaker_dir / f"{clip_id}.txt").read_text(encoding="utf-8")
+print(transcript[:200])  # Hebrew turns with timestamps
+```
+
+??? info "Why is the peak ~0.79 (–2.0 dBFS) not 1.0?"
+    All clips are peak-normalized to a **–2.0 dBFS target** (not –1.0 dBFS = 1.0 linear).
+    This gives 2 dB of headroom above the safety limiter ceiling (–1.0 dBFS).
+    `preprocessing_applied.normalized_dbfs` in the JSON records the measured peak.
+    See [Audio Format](audio-format.md) for the full normalization pipeline.
+
+---
+
+## Load strong-label events
+
+```python
+import jsonlines  # pip install jsonlines
+
+events = []
+with jsonlines.open(speaker_dir / f"{clip_id}.jsonl") as reader:
+    for evt in reader:
+        events.append(evt)
+
+# Or without jsonlines:
+events = [
+    json.loads(line)
+    for line in (speaker_dir / f"{clip_id}.jsonl").read_text().splitlines()
+    if line.strip()
+]
+
+for evt in events[:3]:
+    print(f"[{evt['onset']:.1f}s – {evt['offset']:.1f}s] "
+          f"{evt['tier1_category']}/{evt['tier2_subtype']}  I{evt['intensity']}")
+# [0.8s – 10.1s] VERB/VERB_SHOUT  I2
+# [10.5s – 18.7s] VERB/VERB_SHOUT  I2
+# [18.3s – 29.7s] VERB/VERB_THREAT  I3
+```
+
+??? info "What are tier1_category and tier2_subtype?"
+    Strong labels follow a three-level taxonomy:
+
+    **Typology** (clip-level): `SV` · `IT` · `NEG` · `NEU`
+
+    **Tier 1 category** (event-level): `VERB` · `DIST` · `PHYS` · `EMOT` · `ACOU` · `NONE`
+
+    **Tier 2 subtype** (event-level): e.g. `VERB_SHOUT`, `VERB_THREAT`, `DIST_SCREAM`, `PHYS_HARD`, `ACOU_SLAM`
+
+    See [Label Taxonomy](taxonomy.md) for the full table and has_violence derivation rule.
+
+---
+
+## Work with the manifest
+
+`data/he/manifest.csv` is a flat summary of all clips. It's the fastest entry point for filtering and dataset construction.
+
+```python
+import pandas as pd
+
+df = pd.read_csv("data/he/manifest.csv")
+print(df.columns.tolist())
+# ['clip_id', 'project', 'violence_typology', 'tier', 'duration_seconds',
+#  'speaker_ids', 'voice_families', 'has_violence', 'max_intensity',
+#  'quality_flags', 'split', 'wav_path', 'strong_labels_path']
+
+# Filter by project
+she_proves_clips = df[df["project"] == "she_proves"]
+
+# Filter by typology
+sv_clips = df[df["violence_typology"] == "SV"]
+
+# High-intensity violent clips only
+high_intensity = df[(df["has_violence"]) & (df["max_intensity"] >= 4)]
+
+# Load audio for a manifest row
+row = df.iloc[0]
+wav, sr = sf.read(row["wav_path"])  # paths are repo-relative POSIX strings
+```
+
+!!! warning "`speaker_ids` and `voice_families` are pipe-delimited"
+    These columns contain multiple values joined by `|`:
+    ```python
+    speakers = row["speaker_ids"].split("|")
+    # ['AGG_M_30-45_001', 'VIC_F_25-40_002']
+    ```
+
+!!! note "All clips are `split: train` in delivery-003"
+    The corpus has only 4 unique speaker personas across 20 clips — speaker-disjoint splits are not feasible at this scale. When the corpus scales, speaker-disjoint train/val/test splits will be assigned by SynthBanshee. Until then, treat this as an unpartitioned pool.
+
+---
+
+## Find a clip's speaker directory
+
+Clip IDs follow the pattern `{project_prefix}_{typology}_{tier}_{scene_num}_{take}`. The on-disk directory is the **lowercase** form of the first speaker ID listed in `speakers[]`:
+
+```python
+def clip_dir(root: Path, clip_id: str, meta: dict) -> Path:
+    first_speaker = meta["speakers"][0]["speaker_id"]
+    return root / "data" / meta["language"] / first_speaker.lower()
+```
+
+| clip_id | speaker_dir |
+|---------|-------------|
+| `sp_sv_a_0001_00` | `data/he/agg_m_30-45_001/` |
+| `sp_sv_a_0003_00` | `data/he/agg_m_30-45_002/` |
+| `el_sv_b_0001_00` | `data/he/ben_m_40-55_003/` |
+
+Or use `manifest.csv` directly — `wav_path` already contains the full repo-relative path.
+
+---
+
+## Validate a clip
+
+If you have SynthBanshee installed:
+
+```bash
+synthbanshee validate data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav
+```
+
+This checks: all four files present, WAV format (16 kHz mono), peak ≤ –1.0 dBFS, duration ≥ 3 s, JSON parses as `ClipMetadata`.
+
+To run QA over the entire language directory:
+
+```bash
+synthbanshee qa-report data/he/
+synthbanshee qa-report data/he/ --run-summary   # adds corpus-level aggregates
+```
diff --git a/docs/index.md b/docs/index.md
new file mode 100644
index 0000000..7b5b4d2
--- /dev/null
+++ b/docs/index.md
@@ -0,0 +1,127 @@
+# avdp-synth-corpus
+
+**Synthetic Hebrew audio corpus for the Audio Violence Detection Pipeline (AVDP)**
+
+Generated by [SynthBanshee](https://github.com/DataHackIL/SynthBanshee) · Hebrew (he-IL) · 16 kHz mono 16-bit PCM
+
+---
+
+!!! warning "Toy corpus — not approved for model training"
+    All current deliveries are provisional wet-test batches for spec validation and pipeline bootstrapping.
+    The `split` field in `manifest.csv` is informational only. **Do not train production models on this data.**
+    See [Deliveries](deliveries.md) for the full status of each batch.
+
+---
+
+## What is this?
+
+This repository contains **synthetic Hebrew audio clips** representing domestic-violence and threat scenarios, produced by a text-to-speech pipeline with automatic prosody modelling and acoustic augmentation.
+
+Two downstream products consume this data:
+
+=== "She-Proves"
+
+    A smartphone app that passively monitors audio for domestic violence incidents and preserves evidence for legal use. High-recall orientation — better to flag and review than to miss.
+
+    → [She-Proves team guide](she-proves.md)
+
+=== "Elephant in the Room"
+
+    A Raspberry Pi–class device placed in clinic and welfare offices that alerts security when a social worker is under threat. High-precision orientation — false alarms erode trust.
+
+    → [Elephant in the Room team guide](elephant.md)
+
+---
+
+## Current delivery at a glance
+
+**Delivery 003 — multi-project, multi-voice** · 2026-05-12 · provisional
+
+| Dimension | Value |
+|-----------|-------|
+| Clips | 20 |
+| Total duration | ~41.6 min |
+| Projects | `she_proves` (12 clips) + `elephant_in_the_room` (8 clips) |
+| Tiers | A — clean (12) + B — room-augmented (8) |
+| TTS backends | Azure (18 clips) + Google Chirp 3 HD (2 clips) |
+| Validation failures | 0 / 20 |
+| Pipeline | SynthBanshee `0.1.0` @ [`1ea48f3`](https://github.com/DataHackIL/SynthBanshee/commit/1ea48f3) |
+
+Full breakdown: [Deliveries](deliveries.md) · [She-Proves clips](she-proves.md#clips-in-delivery-003) · [Elephant clips](elephant.md#clips-in-delivery-003)
+
+---
+
+## Repository layout
+
+```
+data/
+  he/                        # ISO 639-1 language code
+    {speaker_dir}/           # e.g. agg_m_30-45_001/  (lowercase of first speaker ID)
+      {clip_id}.wav          # 16 kHz mono 16-bit PCM
+      {clip_id}.txt          # per-turn transcript with onset/offset markers
+      {clip_id}.json         # ClipMetadata (weak labels, provenance, speaker info)
+      {clip_id}.jsonl        # EventLabel records — one JSON object per line
+    manifest.csv             # flat summary of all clips under data/he/
+
+assets/
+  speech/                    # SHA-256-keyed per-utterance WAV cache (do not modify)
+    dirty/                   # pre-preprocessing WAVs, retained per spec
+  scripts/                   # SHA-256-keyed LLM script cache (do not modify)
+
+deliveries/
+  {slug}/
+    metadata.yaml            # structured delivery record
+    notes.md                 # narrative QA notes and known limitations
+    qa-report.json           # synthbanshee qa-report output
+```
+
+??? info "Why are there four files per clip?"
+    - **`.wav`** — the audio, spec-compliant (normalized, padded, validated)
+    - **`.txt`** — the transcript with turn-level onset/offset markers, used as ASR reference
+    - **`.json`** — `ClipMetadata`: weak labels (`has_violence`, `max_intensity`), speaker list, acoustic scene, provenance (`generation_metadata`)
+    - **`.jsonl`** — `EventLabel` records: one line per strong-label event with category, subtype, onset, offset, intensity, emotional state
+
+    You only need `.wav` + `.json` for most training pipelines. Add `.jsonl` when you need per-event strong labels or onset/offset supervision.
+
+---
+
+## Where to start
+
+| I want to… | Go to |
+|------------|-------|
+| Load my first clip in Python | [Getting Started → Load a clip](getting-started.md#load-a-single-clip) |
+| Understand what the labels mean | [Label Taxonomy](taxonomy.md) |
+| Parse `ClipMetadata` with Pydantic | [Schema Reference](schema.md) |
+| Work with She-Proves scenes | [She-Proves guide](she-proves.md) |
+| Work with Elephant Tier B audio | [Elephant in the Room guide](elephant.md) |
+| Understand the audio normalization | [Audio Format](audio-format.md) |
+| Check current quality status | [Deliveries](deliveries.md) |
+
+---
+
+## Quick snippet
+
+```python
+import json
+from pathlib import Path
+import soundfile as sf
+
+root = Path(".")  # repo root
+
+# Load a clip
+wav, sr = sf.read(root / "data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav")
+meta = json.loads((root / "data/he/agg_m_30-45_001/sp_sv_a_0001_00.json").read_text())
+
+print(f"Duration: {len(wav)/sr:.1f}s  has_violence: {meta['weak_label']['has_violence']}")
+# Duration: 110.5s  has_violence: True
+```
+
+For manifest-level operations:
+
+```python
+import pandas as pd
+
+df = pd.read_csv("data/he/manifest.csv")
+violent = df[df["has_violence"] == True]
+print(violent[["clip_id", "project", "violence_typology", "duration_seconds"]].to_string())
+```
diff --git a/docs/schema.md b/docs/schema.md
new file mode 100644
index 0000000..98a10fd
--- /dev/null
+++ b/docs/schema.md
@@ -0,0 +1,219 @@
+# Schema Reference
+
+Every clip's `.json` file contains a `ClipMetadata` object. The authoritative Pydantic model is in [SynthBanshee `synthbanshee/labels/schema.py`](https://github.com/DataHackIL/SynthBanshee/blob/main/synthbanshee/labels/schema.py).
+
+---
+
+## Loading with Pydantic
+
+```python
+from synthbanshee.labels.schema import ClipMetadata  # requires SynthBanshee installed
+from pathlib import Path
+
+meta = ClipMetadata.model_validate_json(
+    Path("data/he/agg_m_30-45_001/sp_sv_a_0001_00.json").read_text()
+)
+print(meta.clip_id, meta.violence_typology, meta.weak_label.has_violence)
+# sp_sv_a_0001_00 SV True
+```
+
+Plain JSON (no SynthBanshee required):
+
+```python
+import json
+from pathlib import Path
+
+meta = json.loads(Path("data/he/agg_m_30-45_001/sp_sv_a_0001_00.json").read_text())
+```
+
+---
+
+## Top-level `ClipMetadata` fields
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `clip_id` | `str` | Lowercase ASCII clip identifier, e.g. `sp_sv_a_0001_00` |
+| `project` | `str` | `she_proves` or `elephant_in_the_room` |
+| `language` | `str` | ISO 639-1, always `"he"` |
+| `violence_typology` | `str` | `SV` / `IT` / `NEG` / `NEU` — see [taxonomy](taxonomy.md) |
+| `tier` | `str` | `"A"` (clean) or `"B"` (room-augmented) |
+| `duration_seconds` | `float` | Duration of the processed WAV |
+| `sample_rate` | `int` | Always `16000` |
+| `channels` | `int` | Always `1` |
+| `is_synthetic` | `bool` | Always `true` in this corpus |
+| `generator_version` | `str` | SynthBanshee semver, e.g. `"0.1.0"` |
+| `generation_date` | `str` | ISO 8601 date of generation |
+| `random_seed` | `int` | Scene-level RNG seed for reproducibility |
+| `scene_config` | `str` | Relative path to the scene YAML in SynthBanshee |
+| `transcript_path` | `str` | Repo-relative POSIX path to the `.txt` transcript |
+| `dirty_file_path` | `str` | Repo-relative POSIX path to the pre-preprocessing WAV |
+| `speakers` | `list[SpeakerInfo]` | Speaker metadata — see below |
+| `weak_label` | `WeakLabel` | Clip-level summary labels |
+| `generation_metadata` | `GenerationMetadata \| null` | Pipeline provenance — see below |
+| `preprocessing_applied` | `PreprocessingApplied` | What preprocessing steps ran |
+| `acoustic_scene` | `AcousticScene` | Room/device augmentation (Tier B) |
+| `quality_flags` | `list[str]` | QA flags, e.g. `["emotion_downgrade"]` |
+| `snr_db_estimated` | `float \| null` | Estimated SNR (not always populated) |
+| `annotator_confidence` | `float` | Auto-label confidence, 0–1 (auto-generated: always `1.0`) |
+| `iaa_reviewed` | `bool` | Whether inter-annotator agreement review was done |
+| `she_proves_meta` | `null` | Reserved for She-Proves–specific metadata (future) |
+| `elephant_meta` | `null` | Reserved for Elephant–specific metadata (future) |
+
+---
+
+## `SpeakerInfo`
+
+One entry per speaker in `speakers[]`.
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `speaker_id` | `str` | UPPERCASE persona ID, e.g. `AGG_M_30-45_001` |
+| `role` | `str` | `AGG` (aggressor), `VIC` (victim), `SW` (social worker), `BEN` (beneficiary/client) |
+| `gender` | `str` | `"male"` or `"female"` |
+| `age_range` | `str` | e.g. `"30-45"` |
+| `tts_voice_id` | `str` | TTS voice identifier, e.g. `"he-IL-AvriNeural"` |
+| `voice_family` | `str` | Same as `tts_voice_id` (may diverge in future) |
+
+??? info "Speaker ID casing convention"
+    The `speaker_id` field in JSON is always **UPPERCASE**: `AGG_M_30-45_001`.
+    The on-disk directory is **lowercase**: `agg_m_30-45_001/`.
+    This is a deliberate per-surface casing rule — see [SynthBanshee spec §2.5](https://github.com/DataHackIL/SynthBanshee/blob/main/docs/spec.md#25-filename-constraints).
+
+---
+
+## `WeakLabel`
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `has_violence` | `bool` | `any(e.tier1_category != "NONE" for e in events)` — see [taxonomy](taxonomy.md#has_violence-the-correct-derivation) |
+| `violence_typology` | `str` | Mirrors top-level `violence_typology` |
+| `max_intensity` | `int` | Highest per-turn intensity across the clip (1–5) |
+| `violence_categories` | `list[str]` | Distinct `tier1_category` values observed in events |
+
+---
+
+## `GenerationMetadata`
+
+Present on all delivery-003 clips; may be `null` on older clips.
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `pipeline_version` | `str` | SynthBanshee semver |
+| `tts_backend` | `dict[str, str]` | Speaker ID → `"azure"` or `"google"` |
+| `voice_family` | `dict[str, str]` | Speaker ID → voice family string |
+| `mix_mode_used` | `str` | `"sequential"` (turns in order) or `"overlapping"` |
+| `normalization_strategy` | `str` | `"per_turn_rms_v2_target_peak"` |
+| `loudness_target_peak_dbfs` | `float` | Configured peak target, e.g. `-2.0` |
+| `breathiness_applied` | `bool` | Whether breathiness augmentation was applied |
+| `effective_prosody_caps` | `list[ProsodyCap]` | Per-turn cap activations at I3–I5 |
+| `speaker_state_serialized` | `dict[str, SpeakerState]` | Final prosody state per speaker |
+| `prosody_controller_version` | `str \| null` | Version of the prosody controller |
+| `text_normalization_version` | `str \| null` | Version of text normalization |
+| `timing_controller_version` | `str \| null` | Version of timing controller |
+
+### `ProsodyCap` (entry in `effective_prosody_caps`)
+
+| Field | Description |
+|-------|-------------|
+| `turn_index` | Zero-based turn index |
+| `intensity` | Intensity score for that turn |
+| `dim` | `"pitch"` or `"rate"` |
+| `pre_cap` | Prosody value before capping (semitones for pitch, ratio for rate) |
+| `post_cap` | Prosody value after capping |
+
+### `SpeakerState` (entry in `speaker_state_serialized`)
+
+| Field | Description |
+|-------|-------------|
+| `pitch_offset_st` | Final pitch offset in semitones |
+| `rate_offset` | Final speaking rate multiplier |
+| `volume_offset_db` | Final volume offset in dB |
+| `breathiness_level` | Breathiness level 0–1 |
+
+---
+
+## `PreprocessingApplied`
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `resampled_to_16k` | `bool` | Whether sample rate conversion ran |
+| `downmixed_to_mono` | `bool` | Whether channel downmix ran |
+| `normalized_dbfs` | `float` | **Measured** peak dBFS of the output WAV (not the target) |
+| `silence_padded` | `bool` | Whether silence padding was applied |
+| `denoised` | `bool` | Whether denoising ran |
+| `spectral_filtered` | `bool` | Whether spectral filtering ran |
+
+!!! note "`normalized_dbfs` is the measured peak, not the target"
+    Use `generation_metadata.loudness_target_peak_dbfs` for the configured target.
+    Use `preprocessing_applied.normalized_dbfs` to verify the actual output peak.
+    On delivery-003, both should be very close to `–2.0` (within floating-point precision).
+
+---
+
+## `AcousticScene`
+
+Populated for Tier B clips. Null fields indicate Tier A (no augmentation).
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `room_type` | `str \| null` | e.g. `"clinic_office"`, `"welfare_office"`, `"open_office"` |
+| `device` | `str \| null` | e.g. `"pi_budget_mic"` |
+| `ir_source` | `str \| null` | Room impulse response source, e.g. `"pyroomacoustics_ism"` |
+| `snr_db_actual` | `float \| null` | Actual SNR after augmentation (dB) |
+| `speaker_distance_meters` | `float \| null` | Simulated speaker distance from microphone |
+| `background_events` | `list[BackgroundEvent]` | Non-speech acoustic events added |
+
+### `BackgroundEvent`
+
+| Field | Description |
+|-------|-------------|
+| `type` | `"hvac_hum"`, `"ACOU_SLAM"`, `"ACOU_FALL"`, etc. |
+| `onset` | Start time in seconds |
+| `offset` | End time in seconds |
+| `level_db` | Relative level of the event (dB) |
+
+---
+
+## `EventLabel` (`.jsonl` rows)
+
+One JSON object per line. Each represents a single labelled event within the clip.
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `event_id` | `str` | `{clip_id}_EVT_{index:03d}` |
+| `clip_id` | `str` | Parent clip ID |
+| `onset` | `float` | Event start time in seconds (in the processed WAV) |
+| `offset` | `float` | Event end time in seconds |
+| `tier1_category` | `str` | `VERB` / `DIST` / `PHYS` / `EMOT` / `ACOU` / `NONE` |
+| `tier2_subtype` | `str` | e.g. `VERB_SHOUT`, `PHYS_HARD` |
+| `intensity` | `int` | Turn intensity 1–5 |
+| `speaker_id` | `str` | UPPERCASE speaker persona ID |
+| `speaker_role` | `str` | `AGG`, `VIC`, `SW`, `BEN` |
+| `emotional_state` | `str` | e.g. `"anger"`, `"fear"`, `"desperation"`, `"neutral"` |
+| `confidence` | `float` | Auto-label confidence (always `1.0` for auto-generated) |
+| `label_source` | `str` | `"auto"` for all current clips |
+| `iaa_reviewed` | `bool` | Always `false` in current deliveries |
+| `truncated` | `bool` | Whether the event was cut short by a turn boundary |
+| `notes` | `str \| null` | Annotator notes |
+
+---
+
+## Manifest CSV columns
+
+`data/he/manifest.csv` — one row per clip.
+
+| Column | Type | Notes |
+|--------|------|-------|
+| `clip_id` | str | Matches JSON `clip_id` |
+| `project` | str | `she_proves` / `elephant_in_the_room` |
+| `violence_typology` | str | `SV` / `IT` / `NEG` / `NEU` |
+| `tier` | str | `A` / `B` |
+| `duration_seconds` | float | |
+| `speaker_ids` | str | Pipe-delimited, e.g. `AGG_M_30-45_001\|VIC_F_25-40_002` |
+| `voice_families` | str | Pipe-delimited, matches `speaker_ids` order |
+| `has_violence` | bool | See [taxonomy](taxonomy.md#has_violence-the-correct-derivation) |
+| `max_intensity` | int | 1–5 |
+| `quality_flags` | str | Comma-delimited flag list |
+| `split` | str | `train` / `val` / `test` — all `train` in delivery-003 |
+| `wav_path` | str | Repo-relative POSIX path |
+| `strong_labels_path` | str | Repo-relative POSIX path to `.jsonl` |
diff --git a/docs/she-proves.md b/docs/she-proves.md
new file mode 100644
index 0000000..fef7983
--- /dev/null
+++ b/docs/she-proves.md
@@ -0,0 +1,133 @@
+# She-Proves Team Guide
+
+She-Proves is a smartphone app that **passively monitors audio for domestic violence incidents** and preserves evidence for legal use.
+
+**Optimization target: high recall.** It is better to flag an incident for review than to miss one.
+
+---
+
+## Scene structure
+
+| Property | Value |
+|----------|-------|
+| Duration | 3–6 minutes |
+| Tier | A (clean — no room processing) |
+| Pre-incident window | ≥ 60% of clip duration before the first violence event |
+| Device profile | `phone_in_pocket`, `phone_on_table`, `phone_in_hand` |
+| Room types | apartment rooms (living room, bedroom, kitchen) |
+| Language | Hebrew (`he`) |
+
+The long pre-incident window reflects real-world deployment: the app is always listening, and incidents are rare. Models trained on this data should handle extended periods of mundane speech before a rapid escalation.
+
+??? info "Tier A — what does 'clean' mean?"
+    Tier A clips have **no acoustic augmentation** — no room impulse response convolution, no device frequency response, no background noise injection. The audio is the direct TTS-mixer output after preprocessing: peak-normalized, silence-padded, 16 kHz mono 16-bit PCM.
+
+    For Tier A, `acoustic_scene.room_type`, `device`, `ir_source`, and `snr_db_actual` are all `null`.
+
+    Tier B (used by Elephant) adds all of the above. See [Elephant in the Room](elephant.md) for details.
+
+---
+
+## Speaker pairs
+
+Delivery-003 has two She-Proves speaker pairs — one per TTS backend.
+
+| Pair | Speaker dir | Male speaker | Female speaker | Backend |
+|------|-------------|--------------|----------------|---------|
+| Azure | `agg_m_30-45_001/` | `AGG_M_30-45_001` → `he-IL-AvriNeural` | `VIC_F_25-40_002` → `he-IL-HilaNeural` | Azure |
+| Google Chirp HD | `agg_m_30-45_002/` | `AGG_M_30-45_002` → `he-IL-Chirp3-HD-Achird` | `VIC_F_25-40_003` → `he-IL-Chirp3-HD-Achernar` | Google |
+
+Both pairs play **AGG (aggressor, male) + VIC (victim, female)** roles. The Google pair was added in delivery-003 specifically to introduce backend diversity.
+
+!!! note "Two speaker directories"
+    Clips from the Azure pair live under `data/he/agg_m_30-45_001/`.
+    Clips from the Google pair live under `data/he/agg_m_30-45_002/`.
+    Downstream code that hardcodes `agg_m_30-45_001/` will miss the Google clips.
+    Use `manifest.csv` or filter `meta["generation_metadata"]["tts_backend"]` to find both.
+
+---
+
+## Clips in delivery-003
+
+### Azure pair — 10 clips
+
+`data/he/agg_m_30-45_001/`
+
+| Clip ID | Typology | `has_violence` | Duration |
+|---------|----------|:---:|------:|
+| `sp_sv_a_0001_00` | SV | ✓ | 1m 50.5s |
+| `sp_sv_a_0002_00` | SV | ✓ | 1m 32.1s |
+| `sp_it_a_0001_00` | IT | ✓ | 2m 23.8s |
+| `sp_it_a_0002_00` | IT | ✓ | 2m 19.7s |
+| `sp_neg_a_0001_00` | NEG | — | 1m 58.8s |
+| `sp_neg_a_0002_00` | NEG | — | 1m 47.8s |
+| `sp_neg_a_0003_00` | NEG | — | 2m 26.3s |
+| `sp_neu_a_0001_00` | NEU | — | 1m 59.2s |
+| `sp_neu_a_0002_00` | NEU | — | 2m 09.0s |
+| `sp_neu_a_0003_00` | NEU | — | 1m 45.1s |
+
+### Google Chirp HD pair — 2 clips
+
+`data/he/agg_m_30-45_002/`
+
+| Clip ID | Typology | `has_violence` | Duration | Note |
+|---------|----------|:---:|------:|------|
+| `sp_sv_a_0003_00` | SV | ✓ | 1m 42.8s | `vic_f0_high` flag |
+| `sp_it_a_0003_00` | IT | ✓ | 1m 53.9s | `vic_f0_high` flag |
+
+The `vic_f0_high` flag on the Google clips indicates the female voice (`he-IL-Chirp3-HD-Achernar`) has a higher F0 baseline than the Azure Hila reference. See [Audio Format → vic_f0_high](audio-format.md#vic_f0_high-google-chirp-hd-female-f0-baseline).
+
+---
+
+## Loading She-Proves clips
+
+```python
+import json
+import soundfile as sf
+import pandas as pd
+from pathlib import Path
+
+root = Path(".")
+
+# Via manifest — easiest
+df = pd.read_csv("data/he/manifest.csv")
+sp_clips = df[df["project"] == "she_proves"]
+
+# Load all She-Proves audio
+wavs = {}
+for _, row in sp_clips.iterrows():
+    wav, sr = sf.read(root / row["wav_path"])
+    wavs[row["clip_id"]] = wav
+
+# Filter to violent She-Proves clips only
+sp_violent = sp_clips[sp_clips["has_violence"] == True]
+
+# Get per-backend split
+sp_clips["backend"] = sp_clips["voice_families"].apply(
+    lambda v: "google" if "Chirp" in v else "azure"
+)
+print(sp_clips.groupby("backend")["clip_id"].count())
+# azure    10
+# google    2
+```
+
+---
+
+## Guidance for model training
+
+!!! warning "This is a toy corpus — not for production training"
+    12 She-Proves clips (10 Azure + 2 Google) are not enough for training a production model. Use this delivery to validate your data pipeline and schema parsing. Full-scale data follows.
+
+**High-recall orientation:**
+
+- **NEG clips are your hardest negatives.** They contain intense speech (raised voices, arguments, crying) with `has_violence: false`. Your recall model must not fire on them.
+- **The pre-incident window** (first 60% of the clip) will look like NEU/low-intensity speech. Include it in your training windows — models that only see escalated segments will miss early warning signals.
+- **Per-turn intensity** in the `.jsonl` events gives you fine-grained supervision beyond binary `has_violence`. Consider training an intensity regressor as an auxiliary objective.
+
+**Backend diversity:**
+
+The 2 Google Chirp HD clips expose your feature extractor to a different F0 baseline and spectral profile. At small scale, they're useful for checking that your features don't overfit to Azure voice characteristics.
+
+**Speaker splits:**
+
+All 12 clips share 2 unique speaker personas (4 if you count Azure+Google pairs separately). There are not enough speakers for a speaker-disjoint split in this delivery. Re-evaluate when the corpus scales to 100+ speakers.
diff --git a/docs/taxonomy.md b/docs/taxonomy.md
new file mode 100644
index 0000000..8154a06
--- /dev/null
+++ b/docs/taxonomy.md
@@ -0,0 +1,126 @@
+# Label Taxonomy
+
+Labels follow a three-level hierarchy. The **source of truth** is `taxonomy.yaml` in the [SynthBanshee](https://github.com/DataHackIL/SynthBanshee) repo. Never derive labels from field names alone — always read from the actual data.
+
+---
+
+## Violence typologies (clip-level)
+
+The `violence_typology` field classifies the overall scenario of the clip.
+
+| Typology | Full name | Description |
+|----------|-----------|-------------|
+| `SV` | Severe Violence | Physical violence, life-threatening escalation |
+| `IT` | Intimate Terrorism | Systematic coercive control, repeated verbal/emotional abuse |
+| `NEG` | Negative / Confusor | Acoustically intense but non-violent — anger, argument, distress, crying |
+| `NEU` | Neutral | Calm or mundane conversation with no violence markers |
+
+??? info "Why NEG is not the same as non-violent IT/SV"
+    NEG clips are designed as **hard negatives** — they sound intense and may have raised voices, crying, or confrontational tone, but no actual violence occurs. Their purpose is to train models to distinguish acoustic distress from violence.
+
+    Models that rely only on loudness or emotional tone will misclassify NEG clips. This is by design.
+
+---
+
+## `has_violence` — the correct derivation
+
+`has_violence` is a **derived convenience field** computed from the strong-label events, not from typology:
+
+```python
+has_violence = any(e["tier1_category"] != "NONE" for e in events)
+```
+
+This means:
+
+- `NEG` clips are **always** `has_violence: false`, regardless of `max_intensity` — by definition, every event in a NEG clip lands `tier1_category: "NONE"`.
+- A `NEU` clip with even one stray non-NONE event would be `has_violence: true` (shouldn't happen in a well-labelled corpus, but the rule is defensive).
+
+!!! danger "Do not re-derive `has_violence` from typology + intensity"
+    ```python
+    # WRONG — will misclassify every NEG clip
+    has_violence = typology in ("SV", "IT")
+
+    # CORRECT
+    has_violence = any(e["tier1_category"] != "NONE" for e in events)
+    ```
+    The taxonomy columns are the ground truth. `has_violence` exists only for fast filtering and baseline modelling — never use it as the sole training label.
+
+---
+
+## Tier 1 categories (event-level)
+
+Each `EventLabel` in the `.jsonl` file has a `tier1_category`:
+
+| Category | Description | Example contexts |
+|----------|-------------|-----------------|
+| `VERB` | Verbal violence — threats, shouting, demeaning language | Arguments, intimidation |
+| `DIST` | Distress vocalisations — screaming, crying under duress | Peak escalation turns |
+| `PHYS` | Physical violence cues — impact sounds, struggle | Severe violence scenes |
+| `EMOT` | Emotional manipulation — guilt-tripping, gaslighting | IT/coercive control |
+| `ACOU` | Acoustic events — object impacts, slams, falls | Background events in Tier B |
+| `NONE` | No violence — ambient speech, neutral turns | All NEU/NEG events |
+
+??? info "ACOU vs DIST"
+    `ACOU` captures **non-vocal acoustic cues** — a door slam, an object falling, an impact sound. These appear in Tier B clips as `background_events` in the `acoustic_scene` block.
+
+    `DIST` captures **vocal distress** — screams, panic vocalisations, crying under coercion.
+
+---
+
+## Tier 2 subtypes (event-level)
+
+| Tier 1 | Tier 2 subtype | Description |
+|--------|----------------|-------------|
+| VERB | `VERB_SHOUT` | Raised or shouted speech |
+| VERB | `VERB_THREAT` | Direct verbal threats |
+| VERB | `VERB_INSULT` | Demeaning or insulting language |
+| DIST | `DIST_SCREAM` | Distress scream or panic vocalisation |
+| DIST | `DIST_CRY` | Crying or sobbing under duress |
+| PHYS | `PHYS_HARD` | Hard physical impact cue |
+| PHYS | `PHYS_SOFT` | Softer physical contact cue |
+| EMOT | `EMOT_GASLIGHT` | Gaslighting or reality-denial |
+| EMOT | `EMOT_GUILT` | Guilt-tripping or emotional coercion |
+| ACOU | `ACOU_SLAM` | Object slam or door slam |
+| ACOU | `ACOU_FALL` | Object falling or thrown |
+| NONE | `NONE_AMBIENT` | Regular ambient speech or neutral turn |
+
+---
+
+## Intensity scale (turn-level)
+
+Intensity is scored 1–5 per dialogue turn. It controls prosody generation (pitch, rate, volume) and determines which tier1/tier2 labels are applied.
+
+| Score | Label | Description | Prosody profile |
+|-------|-------|-------------|----------------|
+| 1 | Low tension | Calm conversation, mild undercurrent | Near-neutral |
+| 2 | Moderate tension | Noticeable friction, raised voices | Slightly raised pitch/rate |
+| 3 | Active conflict | Clear verbal aggression or intimidation | Elevated pitch, faster rate |
+| 4 | Escalated violence | Physical or high-intensity verbal violence | High pitch, fast rate, volume up |
+| 5 | Extreme / life-threatening | Severe physical violence, panic | Maximally expressive (capped) |
+
+??? info "The prosody cap at I4–I5"
+    At intensity 4–5, the LLM-generated prosody values are capped before SSML rendering to prevent Whisper transcription failures and maintain naturalness. The cap values are:
+
+    - **Pitch:** max +2.0 semitones (post-cap)
+    - **Rate:** range [0.85, 1.20] (post-cap)
+
+    Any cap activation is recorded in `generation_metadata.effective_prosody_caps` per turn. You'll see many activations at I4–I5 in delivery-003 — this is expected. The cap was calibrated in a listening test in May 2026 (SynthBanshee PR #87).
+
+---
+
+## Distribution in delivery-003
+
+| Typology | Clips | Projects | Tiers |
+|----------|------:|---------|-------|
+| SV | 5 | she_proves (3) + elephant (2) | A (3) + B (2) |
+| IT | 5 | she_proves (3) + elephant (2) | A (3) + B (2) |
+| NEG | 5 | she_proves (3) + elephant (2) | A (3) + B (2) |
+| NEU | 5 | she_proves (3) + elephant (2) | A (3) + B (2) |
+
+Intensity distribution across all 20 clips:
+
+| Max intensity | Clips |
+|:---:|:---:|
+| 5 | 10 |
+| 3 | 4 |
+| 2 | 6 |
diff --git a/mkdocs.yml b/mkdocs.yml
new file mode 100644
index 0000000..11dbc73
--- /dev/null
+++ b/mkdocs.yml
@@ -0,0 +1,91 @@
+site_name: avdp-synth-corpus
+site_description: Synthetic Hebrew audio corpus for the Audio Violence Detection Pipeline — consumer guide for She-Proves and Elephant in the Room teams
+site_url: https://datahackil.github.io/avdp-synth-corpus/
+repo_url: https://github.com/DataHackIL/avdp-synth-corpus
+repo_name: DataHackIL/avdp-synth-corpus
+edit_uri: edit/main/docs/
+
+theme:
+  name: material
+  logo: assets/logo.svg
+  favicon: assets/logo.svg
+  palette:
+    - scheme: default
+      primary: teal
+      accent: cyan
+      toggle:
+        icon: material/brightness-7
+        name: Switch to dark mode
+    - scheme: slate
+      primary: teal
+      accent: cyan
+      toggle:
+        icon: material/brightness-4
+        name: Switch to light mode
+  features:
+    - navigation.tabs
+    - navigation.tabs.sticky
+    - navigation.sections
+    - navigation.expand
+    - navigation.indexes
+    - navigation.top
+    - toc.follow
+    - search.suggest
+    - search.highlight
+    - search.share
+    - content.code.copy
+    - content.code.annotate
+    - content.tabs.link
+    - announce.dismiss
+
+markdown_extensions:
+  - admonition
+  - pymdownx.details
+  - pymdownx.superfences:
+      custom_fences:
+        - name: mermaid
+          class: mermaid
+          format: !!python/name:pymdownx.superfences.fence_code_format
+  - pymdownx.highlight:
+      anchor_linenums: true
+      line_spans: __span
+      pygments_lang_class: true
+  - pymdownx.inlinehilite
+  - pymdownx.snippets
+  - pymdownx.tabbed:
+      alternate_style: true
+  - pymdownx.emoji:
+      emoji_index: !!python/name:material.extensions.emoji.twemoji
+      emoji_generator: !!python/name:material.extensions.emoji.to_svg
+  - tables
+  - attr_list
+  - md_in_html
+  - toc:
+      permalink: true
+  - def_list
+
+plugins:
+  - search:
+      lang: en
+
+nav:
+  - Home: index.md
+  - Getting Started: getting-started.md
+  - Team Guides:
+    - She-Proves: she-proves.md
+    - Elephant in the Room: elephant.md
+  - Reference:
+    - Label Taxonomy: taxonomy.md
+    - Schema Reference: schema.md
+    - Audio Format: audio-format.md
+  - Deliveries: deliveries.md
+
+extra:
+  social:
+    - icon: fontawesome/brands/github
+      link: https://github.com/DataHackIL/avdp-synth-corpus
+      name: avdp-synth-corpus on GitHub
+    - icon: fontawesome/brands/github
+      link: https://github.com/DataHackIL/SynthBanshee
+      name: SynthBanshee pipeline on GitHub
+  generator: false