diff --git a/docs/assets/extra.css b/docs/assets/extra.css
new file mode 100644
index 0000000..a6e2c4c
--- /dev/null
+++ b/docs/assets/extra.css
@@ -0,0 +1,60 @@
+/* Status pill used in headers and front-page hero */
+.status-pill {
+  display: inline-block;
+  padding: 0.15rem 0.55rem;
+  border-radius: 0.4rem;
+  font-size: 0.72rem;
+  font-weight: 600;
+  letter-spacing: 0.03em;
+  text-transform: uppercase;
+  vertical-align: middle;
+  margin: 0 0.25rem;
+}
+.status-pill.provisional { background: #FFB300; color: #3E2723; }
+.status-pill.approved    { background: #43A047; color: white; }
+.status-pill.superseded  { background: #BDBDBD; color: #424242; }
+
+/* Cards used on the home page to replace tabbed "What is this?" widget */
+.team-cards {
+  display: grid;
+  grid-template-columns: 1fr 1fr;
+  gap: 1rem;
+  margin: 1.25rem 0 1.5rem;
+}
+@media (max-width: 720px) {
+  .team-cards { grid-template-columns: 1fr; }
+}
+.team-card {
+  border: 1px solid var(--md-default-fg-color--lightest);
+  border-radius: 0.45rem;
+  padding: 1rem 1.1rem;
+  background: var(--md-default-bg-color);
+  transition: transform 0.15s ease, box-shadow 0.15s ease;
+}
+.team-card:hover {
+  transform: translateY(-2px);
+  box-shadow: 0 6px 18px rgba(0,0,0,0.06);
+}
+.team-card h3 {
+  margin: 0 0 0.35rem;
+  font-size: 1rem;
+  color: var(--md-primary-fg-color);
+}
+.team-card .tagline {
+  font-size: 0.78rem;
+  color: var(--md-default-fg-color--light);
+  text-transform: uppercase;
+  letter-spacing: 0.05em;
+  margin-bottom: 0.5rem;
+}
+.team-card p { margin: 0.4rem 0; font-size: 0.92rem; }
+.team-card a.card-link {
+  display: inline-block;
+  margin-top: 0.5rem;
+  font-weight: 600;
+  font-size: 0.9rem;
+}
+
+/* Tighter table look for reference pages */
+.md-typeset table:not([class]) { font-size: 0.78rem; }
+.md-typeset table:not([class]) code { font-size: 0.78rem; }
diff --git a/docs/assets/sp_sv_a_0001_00_waveform.png b/docs/assets/sp_sv_a_0001_00_waveform.png
new file mode 100644
index 0000000..f7bf4fe
Binary files /dev/null and b/docs/assets/sp_sv_a_0001_00_waveform.png differ
diff --git a/docs/audio-format.md b/docs/audio-format.md
index b0173a9..61912f8 100644
--- a/docs/audio-format.md
+++ b/docs/audio-format.md
@@ -1,110 +1,75 @@
 # Audio Format
 
-All clips in the corpus conform to the following hard constraints. Clips that fail these checks are rejected at generation time and will not appear in the corpus.
+The three facts you need to use the data, then optional detail on how it gets that way.
 
 ---
 
-## Format requirements
+## What you need to know
 
-| Property | Value |
-|----------|-------|
-| Sample rate | 16 000 Hz |
-| Channels | 1 (mono) |
-| Bit depth | 16-bit PCM |
-| Peak level | ≤ –1.0 dBFS (safety ceiling) |
-| Duration | ≥ 3.0 s |
-| Encoding | WAV (no lossy formats) |
+| Fact | Value | Why it matters |
+|------|-------|----------------|
+| **Sample rate** | 16 000 Hz | Always. Resample your features for this. |
+| **Channels / depth** | mono / 16-bit PCM WAV | `wav.ndim == 1`. No lossy formats anywhere. |
+| **Peak level** | ≤ –1.0 dBFS (target –2.0 dBFS) | `np.abs(wav).max() ≈ 0.79`, **not** 1.0. |
+| **Silence pad** | ≥ 0.5 s at head and tail | Onset/offset timestamps **already account for it** — no shift needed. |
+| **Duration** | ≥ 3.0 s | Hard minimum; clips below it are rejected. |
 
 ```python
-import soundfile as sf
-import numpy as np
+import soundfile as sf, numpy as np
 
 wav, sr = sf.read("data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav")
 assert sr == 16000
-assert wav.ndim == 1               # mono
-assert wav.dtype == np.float64     # soundfile returns float64 by default
-assert np.abs(wav).max() <= 1.0   # -1.0 dBFS ≈ linear amplitude 1.0
-
-# Check format info
-info = sf.info("data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav")
-print(info.subtype)  # PCM_16
+assert wav.ndim == 1
+assert wav.dtype == np.float64           # soundfile default
+assert np.abs(wav).max() <= 1.0          # safety ceiling at -1.0 dBFS
+print(sf.info("data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav").subtype)   # PCM_16
 ```
 
 ---
 
-## Normalization pipeline
-
-Each clip passes through two normalization steps:
-
-```
-TTS render (float32, arbitrary loudness)
-    ↓
-[1] Per-turn RMS gain (M3a)        — preserves inter-turn contrast
-    ↓
-[2] Single global peak gain         — lands absolute peak at target_peak_dbfs
-    ↓
-[3] Safety limiter                  — clips at ≤ –1.0 dBFS (guaranteed no-op for target ≥ –12.0)
-    ↓
-Tier B only: room IR + device → renormalize to same target
-    ↓
-Output WAV
-```
-
-### Step 1 — Per-turn RMS gain (M3a)
-
-Each dialogue turn is gain-adjusted so its RMS matches a per-intensity target. This preserves the acoustic contrast between calm and escalated turns — a whispered turn at I1 stays quieter than a shouted turn at I5 — while giving the subsequent global normalization a stable peak-to-RMS ratio to work with.
-
-??? info "Why per-turn RMS matters"
-    Without per-turn normalization, the TTS engine produces flat RMS across intensities regardless of the requested prosody. The raw Azure and Google outputs are nearly constant-loudness even when the SSML requests "shout" style. Per-turn RMS gain is the mechanism that creates the acoustic loudness gradient you expect to see in the data.
-
-### Step 2 — Single global peak gain
-
-A single gain is applied to the whole mix so the clip's absolute peak lands at `loudness_target_peak_dbfs` (default: –2.0 dBFS). Because it's a single gain, all per-turn RMS *ratios* survive unchanged — the contrast from Step 1 is preserved.
+## Two peak fields, two meanings
 
-The configured target is recorded in `generation_metadata.loudness_target_peak_dbfs`.
-The measured output peak is recorded in `preprocessing_applied.normalized_dbfs`.
+Every clip records two related loudness values:
 
-### Step 3 — Safety limiter
+| Field | Set by | What it is |
+|-------|--------|------------|
+| `generation_metadata.loudness_target_peak_dbfs` | The pipeline config | **Configured** peak target (default –2.0 dBFS) |
+| `preprocessing_applied.normalized_dbfs` | Measurement at write time | **Measured** post-preprocess peak of the actual WAV |
 
-A hard ceiling at –1.0 dBFS. For in-spec targets (range: [–12.0, –1.5] dBFS), this is a guaranteed no-op. It exists as a safety rail against misconfiguration.
+If those two disagree by more than a fraction of a dB, something is wrong with normalization. Useful as a diagnostic check.
 
 ---
 
-## Silence padding
+## Known audio quirks
 
-Every clip has at least 0.5 s of ambient silence at the head and tail. This is applied by `preprocess()` and logged in `preprocessing_applied.silence_padded: true`.
+### `vic_f0_high` on the 2 Google clips
 
-Onset/offset timestamps in the `.txt` transcript and `.jsonl` events are already shifted to account for the leading pad — they refer to positions in the final processed WAV, not the raw TTS output.
+`sp_sv_a_0003_00` and `sp_it_a_0003_00` use the Google Chirp 3 HD female voice (`he-IL-Chirp3-HD-Achernar`). Its F0 baseline runs measurably higher than the Azure reference voice (`he-IL-HilaNeural`), against which the QA F0 thresholds were calibrated.
 
----
+**What to do about it:** nothing. The flag fires correctly; the audio is fine. If you compute F0-derived features, calibrate per backend (`generation_metadata.tts_backend`) — or just use spectral features that aren't sensitive to baseline F0. Don't exclude these two clips: they're the only backend diversity you have in this delivery.
 
-## Dirty files
+### `quality_flags: ["emotion_downgrade"]`
 
-`preprocessing_applied` records the processing that was applied. The **pre-preprocessing WAV** is retained as `{clip_id}_dirty.wav` under `assets/speech/dirty/`. These are the raw TTS-mixer outputs before normalization, padding, or denoising.
+The pipeline detected that the TTS engine produced slightly less intense prosody than the SSML asked for at high-intensity turns. The audio is still valid; the prosody is just a touch tamer than the scene intended. About 15 of 20 clips in delivery-003 carry this flag — it's not a defect signal.
 
-The `dirty_file_path` field in ClipMetadata gives the repo-relative path:
-```
-"dirty_file_path": "assets/speech/dirty/sp_sv_a_0001_00_dirty.wav"
-```
+### Dirty files
 
-Dirty files are useful for:
-- Diagnosing normalization issues (compare dirty peak vs. `normalized_dbfs`)
-- Checking raw TTS prosody before processing
-- Re-running preprocessing with different parameters
+The pre-preprocessing WAV is retained at `assets/speech/dirty/{clip_id}_dirty.wav`. Its path is recorded in `dirty_file_path`. These files are the raw TTS-mixer outputs before normalization, padding, or denoising — useful for diagnosing the pipeline, not for training.
 
-!!! warning "Do not modify dirty files"
-    The `assets/` directory is managed by SynthBanshee. Manual edits to `.wav` files under `assets/speech/` will break SHA-256 cache lookups.
+!!! warning "Don't modify files under `assets/`"
+    `assets/speech/` is the SynthBanshee SHA-256 SSML cache. Renaming or editing any file there will break cache lookups and force a paid re-synthesis on next run.
 
 ---
 
 ## TTS backends
 
-| Backend | Voices | Clips in delivery-003 |
-|---------|--------|----------------------|
+| Backend | Voices in delivery-003 | Clips |
+|---------|-----------------------|------:|
 | Azure Cognitive Services | `he-IL-AvriNeural` (M), `he-IL-HilaNeural` (F) | 18 |
 | Google Cloud TTS Chirp 3 HD | `he-IL-Chirp3-HD-Achird` (M), `he-IL-Chirp3-HD-Achernar` (F) | 2 |
 
-The backend per speaker is recorded in `generation_metadata.tts_backend`:
+Per-speaker backend is in `generation_metadata.tts_backend`:
+
 ```json
 "tts_backend": {
     "AGG_M_30-45_002": "google",
@@ -112,21 +77,29 @@ The backend per speaker is recorded in `generation_metadata.tts_backend`:
 }
 ```
 
-??? info "Azure SSML cache"
-    SynthBanshee caches per-utterance WAVs under `assets/speech/` keyed by SHA-256 of the full rendered SSML string. Re-running generation with the same SSML is **free** for Azure clips — the file is returned directly from cache without an API call. Google Chirp HD does not use the same cache: it produces slightly different audio on each synthesis (minor bit-level variation at the same parameters).
+Azure is deterministic — re-rendering the same SSML returns byte-identical WAVs (via the SHA-256 cache). Google Chirp 3 HD is not — it produces minor bit-level variation on each synthesis at the same parameters. If you need byte-stable reproducibility for an experiment, you may see the Google clips re-render slightly differently between fresh generations even though peak / RMS / duration stay within tolerance.
 
 ---
 
-## Known audio quirks
+## How the normalization actually works
 
-### `vic_f0_high` — Google Chirp HD female F0 baseline
+You don't need this to consume the data. Open the section below if you're debugging loudness drift, building a comparable pipeline, or just curious.
 
-The two Google Chirp 3 HD clips (`sp_sv_a_0003_00`, `sp_it_a_0003_00`) use the female voice `he-IL-Chirp3-HD-Achernar`. This voice's F0 baseline runs measurably higher than `he-IL-HilaNeural` (Azure), against which the corpus QA M10a thresholds were calibrated.
+??? info "The normalization pipeline (3 stages)"
+    ```
+    TTS render  →  per-turn RMS gain  →  single global peak gain  →  safety limiter  →  Tier B: room IR + device + noise → renormalize  →  output WAV
+    ```
 
-Both clips are flagged `vic_f0_high` in the QA report. This is expected and tracked — it reflects a real backend difference, not a synthesis failure. **Do not exclude these clips** on the basis of this flag; calibrate your model's F0 features against the correct baseline per backend.
+    **Stage 1: per-turn RMS gain.** Each dialogue turn is gain-adjusted so its RMS matches a per-intensity target. This creates the calm-to-loud gradient you'd expect — a whispered I1 turn stays quieter than a shouted I5 turn. Without this step, raw Azure and Google output is nearly constant-loudness regardless of the requested prosody style.
 
-### `quality_flags: ["emotion_downgrade"]`
+    **Stage 2: single global peak gain.** A single multiplicative gain lands the clip's absolute peak at `loudness_target_peak_dbfs` (default –2.0 dBFS). Because it's one gain applied to the whole mix, every per-turn RMS ratio from Stage 1 survives unchanged.
+
+    **Stage 3: safety limiter.** A hard ceiling at –1.0 dBFS. For in-spec targets in `[-12.0, -1.5]` dBFS, this is always a no-op. It exists as a safety rail against config drift.
+
+    **Tier B post-processing.** Room IR convolution, device frequency response (e.g. `pi_budget_mic`), and background-noise injection happen after Stage 3. Then the same `peak_normalize_to_target` helper renormalises so every tier exits at the same absolute peak — Tier A and Tier B are comparable on the loudness dimension.
 
-Several clips carry an `emotion_downgrade` quality flag. This means the TTS engine produced a less emotionally intense output than requested by the SSML prosody hints — the pipeline detected the downgrade and flagged it. Audio quality is still acceptable; the prosody is slightly less extreme than the scene specification intended.
+??? info "Why per-turn RMS gain matters"
+    Without it, the TTS engine produces flat RMS across turns regardless of the requested prosody. The raw Azure and Google outputs are nearly constant-loudness even when the SSML requests a "shout" style or sets `prosody volume="+50%"`. Per-turn RMS gain is the mechanism that creates the acoustic loudness gradient between calm and escalated turns — without it, your model has nothing to learn loudness escalation from.
 
-In delivery-003: 15 clips carry at least one quality flag, mostly from prosody cap activations at I3+.
+??? info "Why peak normalize to –2.0 dBFS instead of 0 dBFS"
+    The 2 dB of headroom buys safety against any later processing step that might add 1–2 dB of gain (room IR convolution can do this). Peak at 0 dBFS would clip; peak at –1.0 dBFS leaves no headroom for the limiter. –2.0 is the conservative middle.
diff --git a/docs/deliveries.md b/docs/deliveries.md
index 71dac3b..12c9e2a 100644
--- a/docs/deliveries.md
+++ b/docs/deliveries.md
@@ -1,16 +1,14 @@
 # Deliveries
 
-All data deliveries are logged here. Each entry links to per-delivery notes with clip counts, QA findings, known limitations, and the SynthBanshee commit that produced the batch.
+What's currently in the corpus, what's missing, and what changed in the latest batch. One row per data delivery in the log at the bottom.
 
 ---
 
-## Delivery 003 — multi-project, multi-voice
+## Current delivery — 003
 
-**Date:** 2026-05-12 · **Status:** provisional · **PR:** [#5](https://github.com/DataHackIL/avdp-synth-corpus/pull/5)
+<span class="status-pill provisional">provisional · 2026-05-12</span> [`#5`](https://github.com/DataHackIL/avdp-synth-corpus/pull/5) · slug: `multi-project-multi-voice` · supersedes delivery-002.
 
-This is the current working delivery. It replaces delivery-002.
-
-### At a glance
+### What's in it
 
 | | |
 |---|---|
@@ -19,53 +17,74 @@ This is the current working delivery. It replaces delivery-002.
 | Projects | `she_proves` (12) + `elephant_in_the_room` (8) |
 | Tiers | A (12 clean) + B (8 room-augmented) |
 | TTS backends | Azure (18) + Google Chirp 3 HD (2) |
+| Unique speaker personas | 6 (4 in She-Proves, 2 in Elephant) |
 | Validation failures | 0 / 20 |
 | Pipeline | SynthBanshee `0.1.0` @ [`1ea48f3`](https://github.com/DataHackIL/SynthBanshee/commit/1ea48f3) |
 
-[Full notes](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/003-multi-project-multi-voice/notes.md) · [QA report](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/003-multi-project-multi-voice/qa-report.json)
+Authoritative records: [`metadata.yaml`](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/003-multi-project-multi-voice/metadata.yaml) · [`notes.md`](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/003-multi-project-multi-voice/notes.md) · [`qa-report.json`](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/003-multi-project-multi-voice/qa-report.json).
 
-### QA findings — closed (vs. delivery-002)
+### Known limitations
 
-| Finding | Delivery-002 | Delivery-003 |
-|---------|:---:|:---:|
-| `agg_no_escalation` | 3 clips | **0** — AGG RMS now escalates with intensity |
-| `warn_no_overlap` | 4 clips | **0** — overlap_ratio 100% on I4+ clips |
-| `warn_emotion_downgrade` | 4 clips | **0** — emotion_downgrade_ratio 0% |
-| `generation_metadata` absent | 0 of 8 clips | **20 of 20** carry the full block |
-| `dirty_file_path` null | 7 of 8 clips | **20 of 20** retain dirty files |
-| `normalized_dbfs` hardcoded `-1.0` | all 8 clips | **fixed** — now the measured peak |
+- **All clips are `split: train`.** Only 4 unique speaker personas across 20 clips — speaker-disjoint partitioning isn't feasible at this scale.
+- **One room type for Elephant.** All 8 Tier-B clips use `clinic_office`. `welfare_office` and `open_office` are in the pipeline but not exercised yet.
+- **One device profile for She-Proves.** No `phone_in_pocket` etc. augmentation applied yet — Tier-A clips are clean, not phone-captured.
+- **Voice diversity is low.** 2 voice families per gender; the QA threshold for "diverse" is ≥3.
+- **Toy-batch scale.** 20 clips is enough to wire up consumer plumbing. Not enough to train a production model.
 
-Additional findings closed by the 2026-05-12 schema-shift regen (PRs [#110](https://github.com/DataHackIL/SynthBanshee/pull/110)/[#111](https://github.com/DataHackIL/SynthBanshee/pull/111)/[#112](https://github.com/DataHackIL/SynthBanshee/pull/112)):
+### Open QA flags
 
-| Finding | Resolution |
-|---------|-----------|
-| `single_backend` false positive | `qa.py` now derives backend diversity from `generation_metadata.tts_backend.values()`; reports `clips_by_tts_backend: {azure: 18, google: 2}` |
-| Absolute paths in clip JSON | `dirty_file_path` and `transcript_path` are now repo-relative POSIX strings |
-| Leaked pytest tmp_path on `sp_neu_a_0001_00` | Regen overwrote with canonical path; autouse env-var strip fixture prevents future leaks |
+| Flag | Detail | What to do about it |
+|------|--------|---------------------|
+| `low_voice_diversity_male` | 2 male voice families across the corpus (threshold ≥3) | Track per-voice eval separately; expect feature overfit to AvriNeural until more voices land |
+| `low_voice_diversity_female` | Same, for female voices | Same |
+| `vic_f0_high` (per-clip × 2) | `sp_sv_a_0003_00`, `sp_it_a_0003_00` — Google Chirp HD female F0 above Azure baseline | **Nothing.** Don't exclude the clips. Calibrate F0 features per backend if you compute them. See [Audio Format](audio-format.md#vic_f0_high-on-the-2-google-clips). |
+| `quality_flagged_clips: 15` | Mostly `emotion_downgrade` from prosody cap activations at I3+ | Don't reflexively filter these out — they pass validation. See [Common mistakes #7](gotchas.md#7-quality_flags-doesnt-mean-broken). |
 
-### QA findings — open
+### Distribution
 
-| Finding | Detail |
-|---------|--------|
-| `low_voice_diversity_male` | 2 voice families per gender; threshold ≥ 3 |
-| `low_voice_diversity_female` | 2 voice families per gender; threshold ≥ 3 |
-| `vic_f0_high` (2 clips) | `sp_sv_a_0003_00` and `sp_it_a_0003_00` — Google Chirp HD female F0 runs higher than Azure Hila reference |
-| `quality_flagged_clips: 15` | Mostly from prosody cap activations at I3+; expected behaviour |
+| Typology | Tier A (She-Proves) | Tier B (Elephant) | Total |
+|----------|:--:|:--:|:--:|
+| `SV`  | 3 | 2 | 5 |
+| `IT`  | 3 | 2 | 5 |
+| `NEG` | 3 | 2 | 5 |
+| `NEU` | 3 | 2 | 5 |
 
-### Known limitations
+`max_intensity` across the 20 clips: I5 = 10 clips · I3 = 4 clips · I2 = 6 clips.
+
+---
+
+## What this delivery exercises
+
+Use these to check your consumer code on the schema features the delivery was designed to cover:
+
+1. Full `ClipMetadata` schema — including the `generation_metadata` block and (for Tier B) populated `acoustic_scene`.
+2. Per-surface casing rules — UPPERCASE `speaker_id`, lowercase paths and clip IDs.
+3. `has_violence` derivation from events — NEG clips correctly `false` even at `max_intensity ≥ 3`.
+4. Multi-project layout under a single `data/he/` root.
+5. Multi-backend provenance — `generation_metadata.tts_backend` differs per speaker.
+
+---
+
+## What changed vs delivery-002
 
-- **Speaker-disjoint splits not feasible.** 4 unique speaker personas across 20 clips; all clips are `split: train`.
-- **Two speaker directories only.** `agg_m_30-45_002/` and `ben_m_40-55_003/` are first appearances — code hardcoding `agg_m_30-45_001/` will miss them.
-- **One room type.** All 8 Elephant Tier B clips use `clinic_office`. Future deliveries will add `welfare_office` and `open_office`.
-- **Toy corpus only.** 20 clips is not sufficient for training production models.
+??? abstract "Closed QA findings (vs. delivery-002)"
+    | Finding | Delivery-002 | Delivery-003 |
+    |---------|:---:|:---:|
+    | `agg_no_escalation` | 3 clips | **0** — AGG RMS now escalates with intensity |
+    | `warn_no_overlap` | 4 clips | **0** — turn-overlap fires on I4+ clips |
+    | `warn_emotion_downgrade` | 4 clips | **0** |
+    | `generation_metadata` absent | 0 of 8 clips had it | **20 of 20** carry the full block |
+    | `dirty_file_path` null | 7 of 8 clips | **20 of 20** retain dirty files |
+    | `normalized_dbfs` hardcoded `-1.0` | all 8 clips | Records the measured peak |
 
-### What this delivery exercises
+??? abstract "Closed by the 2026-05-12 schema-shift regen"
+    Three SynthBanshee PRs landed alongside the regen ([#110](https://github.com/DataHackIL/SynthBanshee/pull/110) / [#111](https://github.com/DataHackIL/SynthBanshee/pull/111) / [#112](https://github.com/DataHackIL/SynthBanshee/pull/112)):
 
-1. Full `ClipMetadata` schema including `generation_metadata`, `voice_family`, and (for Tier B) the populated `acoustic_scene` block
-2. Per-surface casing rules: UPPERCASE `speaker_id`, lowercase paths and clip IDs
-3. `has_violence` derivation from events: NEG clips are correctly `false` even at `max_intensity ≥ 3`
-4. Multi-project layout under a single `data/he/` root
-5. Multi-backend provenance: `generation_metadata.tts_backend` per speaker
+    | Finding | Resolution |
+    |---------|-----------|
+    | `single_backend` false positive | `qa.py` derives backend diversity from `generation_metadata.tts_backend.values()`; reports `clips_by_tts_backend: {azure: 18, google: 2}` |
+    | Absolute paths in clip JSON | `dirty_file_path` and `transcript_path` are now repo-relative POSIX |
+    | Leaked pytest tmp_path on `sp_neu_a_0001_00` | Regen overwrote with canonical path; autouse env-var strip fixture prevents future leaks |
 
 ---
 
@@ -73,14 +92,14 @@ Additional findings closed by the 2026-05-12 schema-shift regen (PRs [#110](http
 
 | # | Date | Slug | Project | Tier | Clips | Duration | Status |
 |---|------|------|---------|------|------:|------:|--------|
-| [003](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/003-multi-project-multi-voice/notes.md) | 2026-05-12 | multi-project-multi-voice | she_proves + elephant | A + B | 20 | ~42m | provisional |
-| [002](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/002-m2a-wettest/notes.md) | 2026-04-15 | m2a-wettest | she_proves | A | 8 | ~17m | superseded |
-| [001](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/001-debug-run-1/notes.md) | 2026-04-15 | debug-run-1 | she_proves | A | 1 | 2m 36s | superseded |
+| [003](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/003-multi-project-multi-voice/notes.md) | 2026-05-12 | multi-project-multi-voice | she_proves + elephant | A + B | 20 | ~42m | <span class="status-pill provisional">provisional</span> |
+| [002](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/002-m2a-wettest/notes.md) | 2026-04-15 | m2a-wettest | she_proves | A | 8 | ~17m | <span class="status-pill superseded">superseded</span> |
+| [001](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/001-debug-run-1/notes.md) | 2026-04-15 | debug-run-1 | she_proves | A | 1 | 2m 36s | <span class="status-pill superseded">superseded</span> |
 
 ## Status definitions
 
 | Status | Meaning |
 |--------|---------|
-| `provisional` | Wet-test batch; not yet approved for model training |
+| `provisional` | Preview batch; consumer-integration only, not approved for training |
 | `approved` | QA passed; cleared for training use |
-| `superseded` | Replaced by a later delivery with the same scenes at higher quality |
+| `superseded` | Replaced by a later delivery covering the same scenes at higher quality |
diff --git a/docs/elephant.md b/docs/elephant.md
index fa62e35..2409795 100644
--- a/docs/elephant.md
+++ b/docs/elephant.md
@@ -1,178 +1,155 @@
 # Elephant in the Room Guide
 
-**Elephant in the Room (הפיל שבחדר)** is a Raspberry Pi–class device placed in clinic and welfare offices that alerts security when a social worker is under threat.
+Elephant in the Room (הפיל שבחדר) is a Raspberry Pi–class device placed in clinic and welfare offices that alerts security when a social worker is under threat. **Optimisation target: high precision** — false alarms erode trust with the security team and the workers they protect.
 
-**Optimization target: high precision.** False alarms erode trust with security staff and social workers alike.
+This page is the *differential* between Elephant clips and the rest of the corpus. For shared concepts (schema, labels, audio format) follow the cross-links.
 
 ---
 
-## Scene structure
+## Scene profile
 
-| Property | Value |
-|----------|-------|
-| Duration | 1–4 minutes |
-| Tier | B (room IR + device + noise augmentation) |
-| Alert window | Final 40% of the clip |
-| Device profile | `pi_budget_mic` |
-| Room types | `clinic_office`, `welfare_office`, `open_office` |
-| Language | Hebrew (`he`) |
+| | |
+|---|---|
+| Project code | `elephant_in_the_room` (clip-id prefix `el_*`) |
+| Tier | B — room IR + device profile + background noise applied |
+| Duration | 1–4 min |
+| Alert window | Final 40% of the clip — violence events concentrate here |
+| Device | `pi_budget_mic` |
+| Room types | `clinic_office`, `welfare_office`, `open_office` (only `clinic_office` in delivery-003) |
 
-The alert-in-final-40% constraint reflects real-world deployment: the device picks up normal consultation audio before a client becomes threatening. The model must recognize genuine escalation from a baseline of professional interaction.
-
-??? info "Tier B acoustic augmentation pipeline"
-    Tier B clips go through three augmentation steps after TTS rendering and preprocessing:
-
-    1. **Room impulse response (IR)** — the clean speech is convolved with a synthetic room IR (generated by `pyroomacoustics` image-source method) to simulate the acoustic of the target room type.
-    2. **Device frequency response** — the `pi_budget_mic` profile applies the frequency response of a budget Raspberry Pi microphone capsule.
-    3. **Background noise injection** — ambient noise events (HVAC hum, equipment sounds) are mixed in at specified SNR levels.
-
-    After augmentation, the clip is renormalized to the same peak target (–2.0 dBFS) via the shared `peak_normalize_to_target` helper — so all tiers exit at the same absolute peak level.
+The alert-in-final-40% constraint mirrors real-world deployment: the device picks up normal consultation audio for most of the session before any threat emerges. The model must distinguish escalation from a baseline of routine professional interaction.
 
 ---
 
-## Speaker pair
+## What Tier B adds (and why)
 
-Delivery-003 has one Elephant speaker pair.
+Tier B clips run through three augmentation stages after preprocessing. This is what separates them from She-Proves (Tier A) clips.
 
-| Speaker dir | Male speaker | Female speaker | Backend |
-|-------------|--------------|----------------|---------|
-| `ben_m_40-55_003/` | `BEN_M_40-55_003` → `he-IL-AvriNeural` | `SW_F_30-45_001` → `he-IL-HilaNeural` | Azure |
+| Stage | What it adds | Where to find it in metadata |
+|-------|--------------|------------------------------|
+| Room IR convolution | Reverb of a real-sounding room | `acoustic_scene.room_type`, `ir_source` |
+| Device profile | Frequency response of a budget Pi microphone | `acoustic_scene.device` |
+| Background noise injection | HVAC hum + occasional `ACOU_*` events | `acoustic_scene.background_events` |
 
-The roles are **BEN (beneficiary/client, male) + SW (social worker, female)** — matching the most common demographic in Israeli welfare/clinic settings.
+After augmentation the clip is renormalised to the same peak target (–2.0 dBFS) as Tier A, so the two tiers are comparable on the loudness dimension.
 
-!!! note "`ben_m_40-55_003/` is a new speaker directory in delivery-003"
-    Downstream code that hardcoded `agg_m_30-45_001/` for She-Proves will not find these clips. Use `manifest.csv` or filter by `meta["project"] == "elephant_in_the_room"`.
+!!! info "What `pyroomacoustics_ism` is"
+    The image-source method (ISM) synthesises a room impulse response by simulating a virtual point source reflecting off the walls of a modelled room. [`pyroomacoustics`](https://pyroomacoustics.readthedocs.io/) is the Python library that implements it. The resulting IR, when convolved with clean speech, makes the speech sound like it was recorded in the modelled room — without needing a real recording.
 
 ---
 
-## The `acoustic_scene` block
+## The `acoustic_scene` field
 
-This is the key difference between Tier A and Tier B metadata. For Elephant clips, `acoustic_scene` is fully populated:
+For Tier A clips this is all `null` / empty. For Elephant clips it's fully populated:
 
 ```json
 "acoustic_scene": {
-    "room_type": "clinic_office",
-    "device": "pi_budget_mic",
-    "ir_source": "pyroomacoustics_ism",
-    "snr_db_actual": 11.2,
-    "speaker_distance_meters": 1.2,
-    "background_events": [
-        {"type": "hvac_hum",   "onset": 0.0,     "offset": 147.0, "level_db": -37.4},
-        {"type": "ACOU_SLAM",  "onset": 72.164,  "offset": 72.476, "level_db": 9.9},
-        {"type": "ACOU_FALL",  "onset": 97.57,   "offset": 98.473, "level_db": 9.6}
-    ]
+  "room_type":                "clinic_office",
+  "device":                   "pi_budget_mic",
+  "ir_source":                "pyroomacoustics_ism",
+  "snr_db_actual":            11.2,
+  "speaker_distance_meters":  1.2,
+  "background_events": [
+    {"type": "hvac_hum",  "onset":  0.000, "offset": 147.031, "level_db": -37.4},
+    {"type": "ACOU_SLAM", "onset": 72.164, "offset":  72.476, "level_db":   9.9},
+    {"type": "ACOU_FALL", "onset": 97.570, "offset":  98.473, "level_db":   9.6}
+  ]
 }
 ```
 
-| Field | Meaning |
-|-------|---------|
-| `room_type` | Simulated room environment |
-| `device` | Microphone/device profile applied |
-| `ir_source` | Method used to generate room IR |
-| `snr_db_actual` | Measured speech-to-noise ratio after mixing |
-| `speaker_distance_meters` | Simulated speaker-to-mic distance |
-| `background_events` | Non-speech acoustic events: type, timestamps, level |
+| Field | What it tells you |
+|-------|-------------------|
+| `room_type` | Modelled room (`clinic_office` / `welfare_office` / `open_office`) |
+| `device` | Microphone profile applied (`pi_budget_mic`) |
+| `ir_source` | How the room IR was generated (currently always `pyroomacoustics_ism`) |
+| `snr_db_actual` | Measured speech-to-noise ratio in dB **after** mixing — your ground truth for SNR-stratified eval |
+| `speaker_distance_meters` | Simulated distance from speaker to microphone |
+| `background_events` | List of non-speech acoustic events: `hvac_hum` (constant low-level), `ACOU_SLAM` / `ACOU_FALL` (brief, high-level) |
 
-??? info "What is `pyroomacoustics_ism`?"
-    The image-source method (ISM) is an algorithm for computing room impulse responses by reflecting a virtual point source off the room's walls. `pyroomacoustics` is a Python library that implements it.
+!!! info "`ACOU_*` events are double-recorded"
+    Each `ACOU_SLAM` / `ACOU_FALL` event lives in **both** `acoustic_scene.background_events` (with `level_db` mixing metadata) **and** the `.jsonl` strong-label file (as a regular `EventLabel` with `tier1_category: "ACOU"`). The two views are deliberate — the first carries audio-level provenance, the second is the supervision target. If you train an event detector, use the `.jsonl` view.
+
+---
+
+## Speaker pair
 
-    The resulting IR simulates how sound travels from a speaker to a microphone in a room of specified dimensions and surface absorption coefficients — giving the audio the characteristic reverb of the target room type without recording in a real room.
+One pair in delivery-003. Roles match Israeli welfare/clinic demographics: BEN (client/service-user, male) + SW (social worker, female).
 
-??? info "Background event types"
-    | Type | Description |
-    |------|-------------|
-    | `hvac_hum` | Constant HVAC/ventilation hum (low level, full duration) |
-    | `ACOU_SLAM` | Door slam or hard object impact (brief, high level) |
-    | `ACOU_FALL` | Object falling or being thrown (brief, high level) |
+Speaker directory: `data/he/ben_m_40-55_003/`
 
-    `ACOU_*` events are also tagged as `EventLabel` entries in the `.jsonl` strong labels with `tier1_category: "ACOU"`. This means they contribute to `weak_label.violence_categories` even in SV/IT clips where the primary violence is verbal or physical.
+| Role | speaker_id | TTS voice |
+|------|-----------|-----------|
+| BEN | `BEN_M_40-55_003` | `he-IL-AvriNeural` |
+| SW  | `SW_F_30-45_001`  | `he-IL-HilaNeural` |
+
+Both speakers use the Azure backend. See [Glossary — Speaker roles](glossary.md#speaker-roles) if `BEN` and `SW` are new abbreviations.
 
 ---
 
 ## Clips in delivery-003
 
-`data/he/ben_m_40-55_003/`
+**8 clips · ~17 min · 4 violent (SV + IT), 4 non-violent (NEG + NEU) · all `room_type: clinic_office`, all `device: pi_budget_mic`, SNR ~11 dB**
 
-| Clip ID | Typology | `has_violence` | Duration | SNR (dB) |
-|---------|----------|:---:|------:|:---:|
-| `el_sv_b_0001_00` | SV | ✓ | 2m 27.0s | ~11 |
-| `el_sv_b_0002_00` | SV | ✓ | 2m 18.5s | ~11 |
-| `el_it_b_0001_00` | IT | ✓ | 2m 30.0s | ~11 |
-| `el_it_b_0002_00` | IT | ✓ | 2m 31.6s | ~11 |
-| `el_neg_b_0001_00` | NEG | — | 1m 53.8s | ~11 |
-| `el_neg_b_0002_00` | NEG | — | 2m 54.6s | ~11 |
-| `el_neu_b_0001_00` | NEU | — | 1m 56.9s | ~11 |
-| `el_neu_b_0002_00` | NEU | — | 1m 19.7s | ~11 |
+??? abstract "Full clip listing"
+    All in `data/he/ben_m_40-55_003/`:
 
-All 8 clips are Tier B with `device: pi_budget_mic` and `room_type: clinic_office`.
+    | Clip ID | Typology | violent | Duration |
+    |---------|----------|:---:|---------:|
+    | `el_sv_b_0001_00`  | SV  | ✓ | 2m 27.0s |
+    | `el_sv_b_0002_00`  | SV  | ✓ | 2m 18.5s |
+    | `el_it_b_0001_00`  | IT  | ✓ | 2m 30.0s |
+    | `el_it_b_0002_00`  | IT  | ✓ | 2m 31.6s |
+    | `el_neg_b_0001_00` | NEG | — | 1m 53.8s |
+    | `el_neg_b_0002_00` | NEG | — | 2m 54.6s |
+    | `el_neu_b_0001_00` | NEU | — | 1m 56.9s |
+    | `el_neu_b_0002_00` | NEU | — | 1m 19.7s |
 
 ---
 
-## Loading Elephant clips
+## Loading and inspecting an Elephant clip
 
 ```python
-import json
-import soundfile as sf
-import numpy as np
-import pandas as pd
+import pandas as pd, soundfile as sf, json
 from pathlib import Path
 
 root = Path(".")
 df = pd.read_csv("data/he/manifest.csv")
-el_clips = df[df["project"] == "elephant_in_the_room"]
+el = df[df["project"] == "elephant_in_the_room"]    # 8 rows
 
-# Load audio + metadata for a Tier B clip
-clip_id = "el_sv_b_0001_00"
-wav, sr = sf.read(root / f"data/he/ben_m_40-55_003/{clip_id}.wav")
-meta = json.loads((root / f"data/he/ben_m_40-55_003/{clip_id}.json").read_text())
+# Pick one clip
+row = el.iloc[0]
+wav, sr = sf.read(root / row.wav_path)
+meta = json.loads((root / row.wav_path).with_suffix(".json").read_text())
 
-# Inspect acoustic scene
+# Acoustic scene
 scene = meta["acoustic_scene"]
-print(f"Room: {scene['room_type']}  Device: {scene['device']}  SNR: {scene['snr_db_actual']} dB")
-# Room: clinic_office  Device: pi_budget_mic  SNR: 11.2 dB
+print(f"{scene['room_type']}  {scene['device']}  SNR {scene['snr_db_actual']} dB  "
+      f"dist {scene['speaker_distance_meters']} m")
 
-# Find background acoustic events
+# Background acoustic events
 for evt in scene["background_events"]:
-    print(f"{evt['type']}: {evt['onset']:.1f}s – {evt['offset']:.1f}s  @ {evt['level_db']} dB")
-# hvac_hum: 0.0s – 147.0s  @ -37.4 dB
-# ACOU_SLAM: 72.2s – 72.5s  @ 9.9 dB
-# ACOU_FALL: 97.6s – 98.5s  @ 9.6 dB
+    print(f"  {evt['type']:10s}  {evt['onset']:6.1f}s – {evt['offset']:6.1f}s  @ {evt['level_db']:+5.1f} dB")
 
-# Get alert window (final 40%)
+# Alert window (final 40%) — for sliding-window evaluation
 duration = meta["duration_seconds"]
-alert_start = duration * 0.60
-print(f"Alert window: {alert_start:.1f}s – {duration:.1f}s")
+alert_start = 0.60 * duration
 
-# Filter strong labels to alert window only
-events = [json.loads(l) for l in
-          (root / f"data/he/ben_m_40-55_003/{clip_id}.jsonl").read_text().splitlines()]
+events = [json.loads(l) for l in (root / row.strong_labels_path).read_text().splitlines() if l.strip()]
 alert_events = [e for e in events if e["onset"] >= alert_start]
+print(f"alert window: {alert_start:.1f}s – {duration:.1f}s   "
+      f"{len(alert_events)} events fire in window  "
+      f"(of {len(events)} total)")
 ```
 
 ---
 
-## Guidance for model training
-
-!!! warning "This is a toy corpus — not for production training"
-    8 Elephant clips from 1 speaker pair in 1 room type is insufficient for training. This delivery exists to bootstrap your data pipeline and acoustic-scene parsing code.
-
-**High-precision orientation:**
-
-- **NEG clips are essential.** Your precision target means you must not fire on `el_neg_b_*` clips — intense speech in a clinic room with background noise, but no violence. Train hard against these.
-- **The alert-in-final-40% window** is where violence events concentrate. Consider a sliding-window detector that scores the final portion of each clip more aggressively than the opening.
-- **SNR is ~11 dB.** This is a realistic but challenging condition for acoustic feature extraction. Verify that your features (MFCCs, log-mel, etc.) are robust at this SNR before comparing with She-Proves Tier A results.
-
-**Tier B–specific features:**
-
-- `acoustic_scene.snr_db_actual` gives you the ground-truth SNR per clip — useful for SNR-conditioned training or evaluation stratification.
-- `background_events` timestamps let you train event detectors separately from the speech violence detector.
-- `acoustic_scene.room_type` will diversify across room types at scale (`clinic_office`, `welfare_office`, `open_office`). Future deliveries will include all three.
-
-**What delivery-003 doesn't cover:**
+## Training-time notes (specific to this project)
 
-- Only `clinic_office` room type (all 8 clips)
-- Only one speaker pair (BEN_M_40-55_003 + SW_F_30-45_001)
-- No test/val split (4 unique speakers total; all are `split: train`)
-- SNR variation (all ~11 dB)
+- **NEG clips are essential for precision.** `el_neg_b_*` is intense speech in a clinic room with background noise but no violence. If your detector fires on these, security stops trusting it. Train hard against these.
+- **The alert-in-final-40% structure is exploitable.** Consider a sliding-window detector that biases toward the back half of each clip — or use the window structure as a positional feature. Don't reward early firing.
+- **SNR ~11 dB is challenging.** Verify your features (MFCCs, log-mel, etc.) are robust here before comparing with She-Proves Tier A results. SNR is recorded per clip (`acoustic_scene.snr_db_actual`) — use it for SNR-stratified eval.
+- **`ACOU_*` events double as strong labels.** You can train an event detector on `ACOU_SLAM` / `ACOU_FALL` separately from the speech-violence detector and ensemble them.
+- **What delivery-003 *doesn't* cover:** only `clinic_office`, only one speaker pair (BEN+SW), only Azure backend, SNR essentially constant at ~11 dB. Plan for room diversity, SNR stratification, and speaker-disjoint splits when scaling.
 
-Plan for room-type diversity, SNR stratification, and speaker disjoint splits at scale.
+!!! warning "Still a small test batch"
+    8 clips, 1 room type, 1 speaker pair, 1 SNR is enough to wire up data loaders and acoustic-scene parsing. It is not enough to train a production model. Build the plumbing; wait for the real batch.
diff --git a/docs/getting-started.md b/docs/getting-started.md
index 0310028..a7f9cbb 100644
--- a/docs/getting-started.md
+++ b/docs/getting-started.md
@@ -1,183 +1,163 @@
-# Getting Started
+# Start here
 
-This guide walks through loading and using clips from the corpus in Python. All paths are relative to the repository root.
+Your first 10 minutes with the corpus. By the end you'll have cloned it, verified the clone, loaded one clip with its labels, and seen what's in a transcript.
 
-## Prerequisites
+---
+
+## 1. Clone
 
 ```bash
-pip install soundfile numpy pandas pydantic
+git clone https://github.com/DataHackIL/avdp-synth-corpus.git
+cd avdp-synth-corpus
 ```
 
-??? note "Optional: full SynthBanshee schema"
-    If you want strict Pydantic validation against the full `ClipMetadata` schema:
-    ```bash
-    git clone https://github.com/DataHackIL/SynthBanshee
-    cd SynthBanshee && pip install -e .
-    ```
-    This gives you `from synthbanshee.labels.schema import ClipMetadata` and `validate_clip()`.
-    For most DS workflows, plain `json.loads()` is sufficient.
+No Git LFS. Total size is a few hundred megabytes for delivery-003 — the audio lives in `data/he/`, the SSML caches live in `assets/`.
 
-## Clone the corpus
+---
+
+## 2. Verify the clone
 
 ```bash
-git clone https://github.com/DataHackIL/avdp-synth-corpus.git
-cd avdp-synth-corpus
+find data/he -name "*.wav" | wc -l    # expect 20
+wc -l data/he/manifest.csv            # expect 21 (header + 20 rows)
 ```
 
-The repository contains the audio files directly (no LFS). Total size is moderate — `data/he/` is roughly a few hundred MB for delivery-003.
+If those numbers don't match, the clone is incomplete — `git lfs pull` is not the answer (we don't use LFS). Re-clone.
 
 ---
 
-## Load a single clip
+## 3. Install the minimal Python deps
+
+```bash
+pip install soundfile numpy pandas
+```
+
+That's enough for everything on this page. `pydantic` is only needed if you want strict schema validation; `jsonlines` only if you prefer it to the one-liner that reads `.jsonl` directly.
+
+??? note "When you'd want the full SynthBanshee install"
+    If you want `from synthbanshee.labels.schema import ClipMetadata` for strict Pydantic validation, or `synthbanshee qa-report` to re-run QA over the data directory:
+    ```bash
+    git clone https://github.com/DataHackIL/SynthBanshee
+    cd SynthBanshee && pip install -e .
+    ```
+    For consuming the corpus, `json.loads()` is fine and is what the examples below use.
+
+---
+
+## 4. Load one clip end-to-end
+
+The path on disk is **lowercase** even though the speaker ID in JSON is **UPPERCASE** — that's a [Gotcha #4](gotchas.md#4-uppercase-in-json-lowercase-on-disk).
 
 ```python
 import json
 from pathlib import Path
 import soundfile as sf
-import numpy as np
-
-root = Path(".")  # run from repo root
 
+root = Path(".")                          # repo root
+clip_dir = root / "data/he/agg_m_30-45_001"
 clip_id = "sp_sv_a_0001_00"
-speaker_dir = root / "data/he/agg_m_30-45_001"
 
-# --- Audio ---
-wav, sr = sf.read(speaker_dir / f"{clip_id}.wav")
-# wav: float64 array, shape (N,). sr: always 16000.
+# Audio
+wav, sr = sf.read(clip_dir / f"{clip_id}.wav")
+assert sr == 16000 and wav.ndim == 1      # always 16 kHz mono
 
-print(f"Duration: {len(wav)/sr:.1f}s  Sample rate: {sr}  Peak: {np.abs(wav).max():.4f}")
-# Duration: 110.5s  Sample rate: 16000  Peak: 0.7943
-
-# --- Weak labels (ClipMetadata) ---
-meta = json.loads((speaker_dir / f"{clip_id}.json").read_text())
-wl = meta["weak_label"]
-print(f"Typology: {meta['violence_typology']}  has_violence: {wl['has_violence']}  "
-      f"max_intensity: {wl['max_intensity']}")
-# Typology: SV  has_violence: True  max_intensity: 5
-
-# --- Transcript ---
-transcript = (speaker_dir / f"{clip_id}.txt").read_text(encoding="utf-8")
-print(transcript[:200])  # Hebrew turns with timestamps
+print(f"duration={len(wav)/sr:.1f}s  peak={abs(wav).max():.3f}")
+# duration=110.5s  peak=0.794
 ```
 
-??? info "Why is the peak ~0.79 (–2.0 dBFS) not 1.0?"
-    All clips are peak-normalized to a **–2.0 dBFS target** (not –1.0 dBFS = 1.0 linear).
-    This gives 2 dB of headroom above the safety limiter ceiling (–1.0 dBFS).
-    `preprocessing_applied.normalized_dbfs` in the JSON records the measured peak.
-    See [Audio Format](audio-format.md) for the full normalization pipeline.
-
----
+!!! info "Why is the peak 0.794 and not 1.0?"
+    Clips are normalized to a **–2.0 dBFS peak target**, which is roughly 0.79 linear amplitude. Use `generation_metadata.loudness_target_peak_dbfs` to read the configured target and `preprocessing_applied.normalized_dbfs` to read the measured output peak. Full detail: [Audio Format](audio-format.md).
 
-## Load strong-label events
+Clip-level labels (weak labels):
 
 ```python
-import jsonlines  # pip install jsonlines
+meta = json.loads((clip_dir / f"{clip_id}.json").read_text())
+wl = meta["weak_label"]
+print(f"typology={meta['violence_typology']}  has_violence={wl['has_violence']}  "
+      f"intensity_max={wl['max_intensity']}  categories={wl['violence_categories']}")
+# typology=SV  has_violence=True  intensity_max=5  categories=['DIST', 'PHYS', 'VERB']
+```
 
-events = []
-with jsonlines.open(speaker_dir / f"{clip_id}.jsonl") as reader:
-    for evt in reader:
-        events.append(evt)
+Event-level labels (strong labels):
 
-# Or without jsonlines:
+```python
 events = [
     json.loads(line)
-    for line in (speaker_dir / f"{clip_id}.jsonl").read_text().splitlines()
+    for line in (clip_dir / f"{clip_id}.jsonl").read_text().splitlines()
     if line.strip()
 ]
 
 for evt in events[:3]:
-    print(f"[{evt['onset']:.1f}s – {evt['offset']:.1f}s] "
-          f"{evt['tier1_category']}/{evt['tier2_subtype']}  I{evt['intensity']}")
-# [0.8s – 10.1s] VERB/VERB_SHOUT  I2
-# [10.5s – 18.7s] VERB/VERB_SHOUT  I2
-# [18.3s – 29.7s] VERB/VERB_THREAT  I3
+    print(f"[{evt['onset']:5.1f}s – {evt['offset']:5.1f}s]  "
+          f"{evt['speaker_role']}  {evt['tier1_category']}/{evt['tier2_subtype']}  "
+          f"I{evt['intensity']}")
+# [  0.8s –  10.1s]  AGG  VERB/VERB_SHOUT   I2
+# [ 10.5s –  18.7s]  VIC  VERB/VERB_SHOUT   I2
+# [ 18.3s –  29.7s]  AGG  VERB/VERB_THREAT  I3
 ```
 
-??? info "What are tier1_category and tier2_subtype?"
-    Strong labels follow a three-level taxonomy:
+The full 14-event escalation arc for this clip is the one visualised on the [home page](index.md#see-it-first) — verbal → distress → physical → settle.
+
+---
+
+## 5. Read a transcript
 
-    **Typology** (clip-level): `SV` · `IT` · `NEG` · `NEU`
+`.txt` files are turn-major with a small header block per turn. They use UTF-8 Hebrew and are intended both for human reading and as ASR reference.
 
-    **Tier 1 category** (event-level): `VERB` · `DIST` · `PHYS` · `EMOT` · `ACOU` · `NONE`
+```
+[CLIP_ID: sp_sv_a_0001_00]
+[SPEAKER: AGG_M_30-45_001 | ROLE: AGG | ONSET: 0.76 | OFFSET: 10.07]
+מה זה הארוחה הזאת? שאלתי אותך דבר אחד פשוט, לעשות ארוחת ערב נורמלית.
+[ACTION: VERB_SHOUT | INTENSITY: 2]
+[SPEAKER: VIC_F_25-40_002 | ROLE: VIC | ONSET: 10.49 | OFFSET: 18.74]
+עבדתי עד שש היום. עשיתי מה שהספקתי...
+```
 
-    **Tier 2 subtype** (event-level): e.g. `VERB_SHOUT`, `VERB_THREAT`, `DIST_SCREAM`, `PHYS_HARD`, `ACOU_SLAM`
+!!! note "Hebrew is right-to-left; some terminals mis-render it"
+    macOS Terminal.app handles it correctly; older Windows consoles don't. If transcripts look reversed or garbled, view the `.txt` in an editor (VS Code, BBEdit) rather than `cat`.
 
-    See [Label Taxonomy](taxonomy.md) for the full table and has_violence derivation rule.
+Timestamps in the header are already relative to the **final processed WAV** — they include the 0.5 s silence pad at the head. No shift needed.
 
 ---
 
-## Work with the manifest
+## 6. Work from the manifest, not from hardcoded paths
 
-`data/he/manifest.csv` is a flat summary of all clips. It's the fastest entry point for filtering and dataset construction.
+`data/he/manifest.csv` is one row per clip. It's the fastest entry point for filtering and the safest way to find files (because [hardcoded speaker directories will miss two-thirds of the clips](gotchas.md#2-dont-hardcode-speaker-directory-paths)).
 
 ```python
-import pandas as pd
+import pandas as pd, soundfile as sf
 
 df = pd.read_csv("data/he/manifest.csv")
-print(df.columns.tolist())
+df.columns.tolist()
 # ['clip_id', 'project', 'violence_typology', 'tier', 'duration_seconds',
 #  'speaker_ids', 'voice_families', 'has_violence', 'max_intensity',
 #  'quality_flags', 'split', 'wav_path', 'strong_labels_path']
 
-# Filter by project
-she_proves_clips = df[df["project"] == "she_proves"]
+# Filter
+violent  = df[df["has_violence"]]                                 # 10 clips
+elephant = df[df["project"] == "elephant_in_the_room"]            # 8 clips
+sv_high  = df[(df["violence_typology"] == "SV") & (df["max_intensity"] >= 4)]
 
-# Filter by typology
-sv_clips = df[df["violence_typology"] == "SV"]
-
-# High-intensity violent clips only
-high_intensity = df[(df["has_violence"]) & (df["max_intensity"] >= 4)]
-
-# Load audio for a manifest row
+# Load audio for any manifest row — wav_path is already repo-relative POSIX
 row = df.iloc[0]
-wav, sr = sf.read(row["wav_path"])  # paths are repo-relative POSIX strings
-```
-
-!!! warning "`speaker_ids` and `voice_families` are pipe-delimited"
-    These columns contain multiple values joined by `|`:
-    ```python
-    speakers = row["speaker_ids"].split("|")
-    # ['AGG_M_30-45_001', 'VIC_F_25-40_002']
-    ```
-
-!!! note "All clips are `split: train` in delivery-003"
-    The corpus has only 4 unique speaker personas across 20 clips — speaker-disjoint splits are not feasible at this scale. When the corpus scales, speaker-disjoint train/val/test splits will be assigned by SynthBanshee. Until then, treat this as an unpartitioned pool.
-
----
-
-## Find a clip's speaker directory
-
-Clip IDs follow the pattern `{project_prefix}_{typology}_{tier}_{scene_num}_{take}`. The on-disk directory is the **lowercase** form of the first speaker ID listed in `speakers[]`:
-
-```python
-def clip_dir(root: Path, clip_id: str, meta: dict) -> Path:
-    first_speaker = meta["speakers"][0]["speaker_id"]
-    return root / "data" / meta["language"] / first_speaker.lower()
+wav, sr = sf.read(row["wav_path"])
+speakers = row["speaker_ids"].split("|")        # pipe-delimited!
+voices   = row["voice_families"].split("|")     # same order as speaker_ids
 ```
 
-| clip_id | speaker_dir |
-|---------|-------------|
-| `sp_sv_a_0001_00` | `data/he/agg_m_30-45_001/` |
-| `sp_sv_a_0003_00` | `data/he/agg_m_30-45_002/` |
-| `el_sv_b_0001_00` | `data/he/ben_m_40-55_003/` |
-
-Or use `manifest.csv` directly — `wav_path` already contains the full repo-relative path.
+!!! warning "`speaker_ids` and `voice_families` are pipe-delimited strings"
+    They are not CSV-nested lists. Split on `|`.
 
 ---
 
-## Validate a clip
-
-If you have SynthBanshee installed:
-
-```bash
-synthbanshee validate data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav
-```
-
-This checks: all four files present, WAV format (16 kHz mono), peak ≤ –1.0 dBFS, duration ≥ 3 s, JSON parses as `ClipMetadata`.
-
-To run QA over the entire language directory:
-
-```bash
-synthbanshee qa-report data/he/
-synthbanshee qa-report data/he/ --run-summary   # adds corpus-level aggregates
-```
+## 7. Where to go next
+
+| You're about to… | Read |
+|------------------|------|
+| Write data-loading code | [Common mistakes](gotchas.md) (2 min read; saves debugging) |
+| Look up a field in `.json` | [Schema Reference](schema.md) |
+| Understand `has_violence` semantics | [Label Taxonomy](taxonomy.md) |
+| Look up a term (F0, SSML, IR, BEN…) | [Glossary](glossary.md) |
+| Work specifically with phone-app data | [She-Proves guide](she-proves.md) |
+| Work specifically with Tier B / room-augmented audio | [Elephant in the Room guide](elephant.md) |
+| Verify a clip is spec-compliant | `synthbanshee validate <path>` (requires SynthBanshee installed) |
diff --git a/docs/glossary.md b/docs/glossary.md
new file mode 100644
index 0000000..7a18b7e
--- /dev/null
+++ b/docs/glossary.md
@@ -0,0 +1,110 @@
+# Glossary
+
+Abbreviations and jargon that show up across the corpus and on this site, in one place.
+
+---
+
+## Speaker roles
+
+The role of each speaker is encoded in the speaker_id prefix and in `speakers[].role`.
+
+| Code | Stands for | Used in |
+|------|-----------|---------|
+| `AGG` | **Aggressor** — the perpetrator in a domestic-violence scene | She-Proves clips (`AGG_M_30-45_*`) |
+| `VIC` | **Victim** — the target of violence in a domestic-violence scene | She-Proves clips (`VIC_F_25-40_*`) |
+| `BEN` | **Beneficiary / client** — a service-user in a welfare or clinic setting (the threatening party in Elephant scenes) | Elephant clips (`BEN_M_40-55_*`) |
+| `SW` | **Social Worker** — the threatened professional in Elephant scenes | Elephant clips (`SW_F_30-45_*`) |
+
+The role determines the prosody profile, scene position, and which `tier1_category` events the speaker can produce.
+
+---
+
+## Project codes
+
+| Code | Project | Clip ID prefix |
+|------|---------|----------------|
+| `she_proves` | She-Proves smartphone app | `sp_*` |
+| `elephant_in_the_room` | Elephant in the Room (clinic/welfare device) | `el_*` |
+
+---
+
+## Violence typology
+
+The clip-level `violence_typology` field — not an ordered scale. See [Label Taxonomy](taxonomy.md) for details.
+
+| Code | Stands for |
+|------|------------|
+| `SV` | Severe Violence |
+| `IT` | Intimate Terrorism |
+| `NEG` | Negative confusor (sounds intense, no violence) |
+| `NEU` | Neutral |
+
+---
+
+## Tier 1 event category
+
+The event-level `tier1_category` field on each `EventLabel`.
+
+| Code | Stands for |
+|------|------------|
+| `VERB` | Verbal violence (shouting, threats, insults) |
+| `DIST` | Distress vocalisations (screaming, crying under duress) |
+| `PHYS` | Physical violence cues (impact sounds, struggle) |
+| `EMOT` | Emotional manipulation (gaslighting, guilt-tripping) |
+| `ACOU` | Acoustic non-vocal events (slams, falls) |
+| `NONE` | Ambient / neutral / no violence cue |
+
+---
+
+## Tier codes
+
+| Code | Meaning |
+|------|---------|
+| `A` | Clean audio — no room IR, no device profile, no background noise |
+| `B` | Room IR + device profile + background noise injection |
+
+---
+
+## Audio jargon
+
+| Term | Meaning |
+|------|---------|
+| **F0** | Fundamental frequency — the lowest frequency of a periodic signal; for voice, the pitch. Reported per speaker in some QA outputs. |
+| **dBFS** | Decibels relative to full scale — 0 dBFS is the maximum amplitude representable by the format; –2 dBFS is ~80% of full amplitude. |
+| **Peak normalization** | Applying a single gain to the whole signal so its absolute maximum matches a target level. |
+| **RMS** | Root-mean-square — a measure of average signal energy. SynthBanshee uses per-turn RMS gain to enforce the loudness gradient between calm and escalated turns. |
+| **SNR** | Signal-to-noise ratio — speech level minus background-noise level, in dB. Recorded in `acoustic_scene.snr_db_actual` for Tier B clips. |
+| **IR** | Impulse response — a recording of how a room (or microphone, or speaker) responds to an idealised pulse. Convolving clean speech with a room IR makes it sound like it was recorded in that room. |
+| **ISM** | Image-source method — an algorithm for synthetically generating room IRs by reflecting virtual sound sources off room walls. Implemented by `pyroomacoustics`. |
+| **SSML** | Speech Synthesis Markup Language — an XML dialect that controls TTS output (pitch, rate, emphasis, breaks, voice). Azure and Google both accept SSML. |
+| **TTS** | Text-to-speech — the generation of audio from a text prompt. |
+| **Prosody** | The patterns of stress, intonation, pitch, and rate that make speech expressive (vs. flat). |
+| **Prosody cap** | A safety clamp applied by SynthBanshee to LLM-suggested prosody values to prevent unnatural extremes (pitch ≤ +2 st, rate ∈ [0.85, 1.20]). |
+| **Whisper** | OpenAI's open-weight ASR model, used internally as a sanity check that synthesised audio is still transcribable. |
+
+---
+
+## Pipeline / corpus jargon
+
+| Term | Meaning |
+|------|---------|
+| **Dirty file** | The pre-preprocessing WAV (raw TTS-mixer output, before normalization and padding). Retained under `assets/speech/dirty/{clip_id}_dirty.wav`. |
+| **Generation metadata** | The `generation_metadata` field — pipeline provenance: which TTS backend was used, which voice family, what mix mode, etc. |
+| **Manifest** | The flat CSV summary at `data/he/manifest.csv` — one row per clip, columns for filtering. |
+| **Strong labels** | Event-level labels in `.jsonl` files — one `EventLabel` object per labelled event, with onset/offset/category. |
+| **Weak labels** | Clip-level summary labels in `.json` — `has_violence`, `max_intensity`, `violence_typology`, `violence_categories`. |
+| **Quality flag** | A soft warning in `quality_flags` (e.g. `emotion_downgrade`). Doesn't fail validation; flags audio worth a second look. |
+| **Delivery** | A merged data batch under `deliveries/{slug}/`. Each delivery records its SynthBanshee commit, metadata, and per-batch QA notes. |
+
+---
+
+## Hebrew TTS voice IDs
+
+The four voices used in delivery-003:
+
+| Voice ID | Gender | Backend |
+|----------|:---:|---------|
+| `he-IL-AvriNeural` | M | Azure |
+| `he-IL-HilaNeural` | F | Azure |
+| `he-IL-Chirp3-HD-Achird` | M | Google Chirp 3 HD |
+| `he-IL-Chirp3-HD-Achernar` | F | Google Chirp 3 HD |
diff --git a/docs/gotchas.md b/docs/gotchas.md
new file mode 100644
index 0000000..816c90b
--- /dev/null
+++ b/docs/gotchas.md
@@ -0,0 +1,136 @@
+# Common mistakes
+
+Read this once before you write code against the corpus. Two minutes here saves a debugging session later.
+
+---
+
+## 1. Don't derive `has_violence` from typology
+
+This will misclassify every `NEG` clip:
+
+```python
+# WRONG — NEG clips will look violent because of their max_intensity
+has_violence = typology in ("SV", "IT")
+
+# CORRECT — uses the event-level ground truth
+has_violence = any(e["tier1_category"] != "NONE" for e in events)
+```
+
+`has_violence` in `weak_label` is **derived from strong-label events**, not from typology. NEG clips can have `max_intensity = 3` (raised voices, distress) and still be `has_violence: false` because every one of their events lands `tier1_category: "NONE"` by design. That's the whole point of NEG: hard negatives that sound intense but aren't violent.
+
+---
+
+## 2. Don't hardcode speaker directory paths
+
+There's already more than one. Delivery-003 has three speaker directories under `data/he/`:
+
+```
+data/he/agg_m_30-45_001/   # She-Proves, Azure pair
+data/he/agg_m_30-45_002/   # She-Proves, Google Chirp HD pair  (new in delivery-003)
+data/he/ben_m_40-55_003/   # Elephant in the Room, Azure pair  (new in delivery-003)
+```
+
+Code that hardcodes `data/he/agg_m_30-45_001/` will miss two-thirds of the clips. Use `manifest.csv` (the `wav_path` column is repo-relative POSIX), or derive the directory from the first entry in `speakers[]`:
+
+```python
+speaker_dir = root / "data" / meta["language"] / meta["speakers"][0]["speaker_id"].lower()
+```
+
+---
+
+## 3. Audio peak is ~0.79, not 1.0
+
+Clips are normalized to a **–2.0 dBFS peak target** (not –1.0 dBFS = linear 1.0). Loading a clip and expecting full-range float values will surprise you:
+
+```python
+wav, sr = sf.read("data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav")
+print(np.abs(wav).max())  # ~0.7943, not 1.0
+```
+
+The –2 dBFS target leaves 2 dB of headroom above the safety limiter at –1.0 dBFS. The configured target is recorded in `generation_metadata.loudness_target_peak_dbfs`; the measured peak is recorded in `preprocessing_applied.normalized_dbfs`.
+
+---
+
+## 4. UPPERCASE in JSON, lowercase on disk
+
+The same speaker has two surface forms:
+
+| Surface | Form | Example |
+|---------|------|---------|
+| JSON field (`speaker_id`, `speakers[].speaker_id`) | **UPPERCASE** | `AGG_M_30-45_001` |
+| Filesystem directory | **lowercase** | `agg_m_30-45_001/` |
+| `clip_id` (everywhere) | **lowercase** | `sp_sv_a_0001_00` |
+
+If you build a dict keyed on speaker IDs from JSON and then try to look up paths with the same string, you'll get a `FileNotFoundError`. Always `.lower()` when converting from JSON to a path.
+
+---
+
+## 5. NEG is not "violent at low intensity"
+
+The four violence typologies are **not** an ordered scale.
+
+| | |
+|---|---|
+| `SV` | Severe Violence — physical attacks, life-threatening |
+| `IT` | Intimate Terrorism — sustained coercive control, repeated abuse |
+| `NEG` | **Negative confusor** — sounds intense, no violence (hard negative) |
+| `NEU` | Neutral — mundane conversation |
+
+A NEG clip is **not** "a milder SV." It is acoustic distress that a naive model would mistake for violence. Treating NEG as a positive class will tank your precision.
+
+---
+
+## 6. All clips are `split: train` in delivery-003
+
+The `split` column exists in `manifest.csv`, but there are only 4 unique speaker personas across all 20 clips. Speaker-disjoint train/val/test partitioning isn't feasible at this scale — every clip is therefore assigned `split: train`. **Don't trust the `split` column as a usable partition.** Treat the whole corpus as an unpartitioned pool for now. SynthBanshee will assign meaningful splits once the speaker pool grows.
+
+---
+
+## 7. `quality_flags` doesn't mean "broken"
+
+About 15 of 20 clips in delivery-003 carry at least one `quality_flags` entry — usually `emotion_downgrade` (the TTS produced slightly less intense prosody than the SSML asked for at high-intensity turns). These clips are still validated and spec-compliant; the flag is a soft hint, not a failure. Don't filter them out reflexively.
+
+The hard line is `synthbanshee validate` — a clip either passes or doesn't. If it's in the corpus, it passed.
+
+---
+
+## 8. The 2 Google clips have a `vic_f0_high` flag — that's expected
+
+`sp_sv_a_0003_00` and `sp_it_a_0003_00` use the Google Chirp 3 HD female voice (`he-IL-Chirp3-HD-Achernar`), whose fundamental-frequency baseline runs higher than the Azure reference voice the QA thresholds were calibrated against. The flag is fired correctly; the audio is fine. **Don't exclude these clips on the basis of this flag** — your model needs the backend diversity. If you compute F0-derived features, calibrate per backend.
+
+---
+
+## 9. Timestamps already account for silence padding
+
+Every clip has ≥0.5 s of silence at head and tail. **Onset/offset timestamps in `.txt` and `.jsonl` are already shifted** to refer to positions in the final processed WAV. You don't need to add the pad — read the timestamp, slice the WAV, done.
+
+---
+
+## 10. The `.json` and `.jsonl` files aren't the same thing
+
+| File | Contains | When to load |
+|------|----------|--------------|
+| `{clip_id}.json` | `ClipMetadata` — one object per clip: weak labels, speakers, provenance, acoustic scene | Always |
+| `{clip_id}.jsonl` | `EventLabel` records — one JSON object per **line**, one per labelled event in the clip | When you need per-event strong labels (onset/offset/category) |
+
+If you `json.loads()` the `.jsonl` you'll get an error. Read line by line.
+
+---
+
+## Quick verification
+
+Use these snippets to confirm a fresh clone is intact:
+
+```bash
+find data/he -name "*.wav" | wc -l     # expect 20
+wc -l data/he/manifest.csv              # expect 21 (header + 20 rows)
+```
+
+```python
+import pandas as pd
+df = pd.read_csv("data/he/manifest.csv")
+assert len(df) == 20
+assert set(df["tier"]) == {"A", "B"}
+assert set(df["violence_typology"]) == {"SV", "IT", "NEG", "NEU"}
+assert df["has_violence"].sum() == 10   # 5 SV + 5 IT
+```
diff --git a/docs/index.md b/docs/index.md
index 7b5b4d2..aa280c8 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,127 +1,82 @@
-# avdp-synth-corpus
+# AVDP Synthetic Corpus
 
-**Synthetic Hebrew audio corpus for the Audio Violence Detection Pipeline (AVDP)**
+**Synthetic Hebrew audio clips for the Audio Violence Detection Pipeline.**
+Hebrew (he-IL) · 16 kHz mono 16-bit PCM · generated by [SynthBanshee](https://github.com/DataHackIL/SynthBanshee).
 
-Generated by [SynthBanshee](https://github.com/DataHackIL/SynthBanshee) · Hebrew (he-IL) · 16 kHz mono 16-bit PCM
+<span class="status-pill provisional">delivery 003 · 2026-05-12 · provisional</span>
+20 clips · ~41.6 min · `she_proves` (12) + `elephant_in_the_room` (8) · Azure (18) + Google (2) · 0 validation failures.
 
 ---
 
-!!! warning "Toy corpus — not approved for model training"
-    All current deliveries are provisional wet-test batches for spec validation and pipeline bootstrapping.
-    The `split` field in `manifest.csv` is informational only. **Do not train production models on this data.**
-    See [Deliveries](deliveries.md) for the full status of each batch.
+## See it first
 
----
-
-## What is this?
-
-This repository contains **synthetic Hebrew audio clips** representing domestic-violence and threat scenarios, produced by a text-to-speech pipeline with automatic prosody modelling and acoustic augmentation.
+A real clip from the corpus — Severe Violence scene, two speakers, with strong-label events overlaid on the waveform:
 
-Two downstream products consume this data:
+![Waveform of sp_sv_a_0001_00 with event boundaries](assets/sp_sv_a_0001_00_waveform.png)
 
-=== "She-Proves"
+You can read the typical escalation arc directly: an argument starts as verbal (`VERB`, blue), peaks into distress vocalisations (`DIST`, orange) around 36s, then into physical-violence cues (`PHYS`, red) around 71s. Intensity badges (`I2` → `I5`) follow the same curve.
 
-    A smartphone app that passively monitors audio for domestic violence incidents and preserves evidence for legal use. High-recall orientation — better to flag and review than to miss.
-
-    → [She-Proves team guide](she-proves.md)
+---
 
-=== "Elephant in the Room"
+## Load a clip in 4 lines
 
-    A Raspberry Pi–class device placed in clinic and welfare offices that alerts security when a social worker is under threat. High-precision orientation — false alarms erode trust.
+```python
+import json, soundfile as sf
+wav, sr = sf.read("data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav")
+meta    = json.loads(open("data/he/agg_m_30-45_001/sp_sv_a_0001_00.json").read())
+print(f"{len(wav)/sr:.1f}s  has_violence={meta['weak_label']['has_violence']}  "
+      f"intensity_max={meta['weak_label']['max_intensity']}")
+# 110.5s  has_violence=True  intensity_max=5
+```
 
-    → [Elephant in the Room team guide](elephant.md)
+For everything else: [Start here →](getting-started.md)
 
 ---
 
-## Current delivery at a glance
+## Two consumer teams
 
-**Delivery 003 — multi-project, multi-voice** · 2026-05-12 · provisional
+<div class="team-cards" markdown>
 
-| Dimension | Value |
-|-----------|-------|
-| Clips | 20 |
-| Total duration | ~41.6 min |
-| Projects | `she_proves` (12 clips) + `elephant_in_the_room` (8 clips) |
-| Tiers | A — clean (12) + B — room-augmented (8) |
-| TTS backends | Azure (18 clips) + Google Chirp 3 HD (2 clips) |
-| Validation failures | 0 / 20 |
-| Pipeline | SynthBanshee `0.1.0` @ [`1ea48f3`](https://github.com/DataHackIL/SynthBanshee/commit/1ea48f3) |
+<div class="team-card" markdown>
+<div class="tagline">smartphone app</div>
+### She-Proves
+Passively monitors a phone for domestic-violence incidents and preserves audio evidence for legal use. **High-recall** orientation — better to flag and review than to miss.
 
-Full breakdown: [Deliveries](deliveries.md) · [She-Proves clips](she-proves.md#clips-in-delivery-003) · [Elephant clips](elephant.md#clips-in-delivery-003)
+12 clips · Tier A (clean audio) · scenes 3–6 min · phone-pocket device profile.
 
----
+[She-Proves guide →](she-proves.md){ .card-link }
+</div>
 
-## Repository layout
+<div class="team-card" markdown>
+<div class="tagline">raspberry pi · clinic / welfare office</div>
+### Elephant in the Room
+A Pi-class device that alerts security when a social worker is under threat. **High-precision** orientation — false alarms erode trust.
 
-```
-data/
-  he/                        # ISO 639-1 language code
-    {speaker_dir}/           # e.g. agg_m_30-45_001/  (lowercase of first speaker ID)
-      {clip_id}.wav          # 16 kHz mono 16-bit PCM
-      {clip_id}.txt          # per-turn transcript with onset/offset markers
-      {clip_id}.json         # ClipMetadata (weak labels, provenance, speaker info)
-      {clip_id}.jsonl        # EventLabel records — one JSON object per line
-    manifest.csv             # flat summary of all clips under data/he/
-
-assets/
-  speech/                    # SHA-256-keyed per-utterance WAV cache (do not modify)
-    dirty/                   # pre-preprocessing WAVs, retained per spec
-  scripts/                   # SHA-256-keyed LLM script cache (do not modify)
-
-deliveries/
-  {slug}/
-    metadata.yaml            # structured delivery record
-    notes.md                 # narrative QA notes and known limitations
-    qa-report.json           # synthbanshee qa-report output
-```
+8 clips · Tier B (room IR + budget mic + noise) · scenes 1–4 min · alert in final 40%.
 
-??? info "Why are there four files per clip?"
-    - **`.wav`** — the audio, spec-compliant (normalized, padded, validated)
-    - **`.txt`** — the transcript with turn-level onset/offset markers, used as ASR reference
-    - **`.json`** — `ClipMetadata`: weak labels (`has_violence`, `max_intensity`), speaker list, acoustic scene, provenance (`generation_metadata`)
-    - **`.jsonl`** — `EventLabel` records: one line per strong-label event with category, subtype, onset, offset, intensity, emotional state
+[Elephant guide →](elephant.md){ .card-link }
+</div>
 
-    You only need `.wav` + `.json` for most training pipelines. Add `.jsonl` when you need per-event strong labels or onset/offset supervision.
+</div>
 
 ---
 
-## Where to start
+## Where to go
 
-| I want to… | Go to |
-|------------|-------|
-| Load my first clip in Python | [Getting Started → Load a clip](getting-started.md#load-a-single-clip) |
-| Understand what the labels mean | [Label Taxonomy](taxonomy.md) |
-| Parse `ClipMetadata` with Pydantic | [Schema Reference](schema.md) |
-| Work with She-Proves scenes | [She-Proves guide](she-proves.md) |
-| Work with Elephant Tier B audio | [Elephant in the Room guide](elephant.md) |
-| Understand the audio normalization | [Audio Format](audio-format.md) |
-| Check current quality status | [Deliveries](deliveries.md) |
+| | |
+|---|---|
+| **First time here** | [Start here](getting-started.md) — clone, load one clip, read its labels |
+| **About to write code** | [Common mistakes](gotchas.md) — read this once; it'll save you a few |
+| **Decoding a label** | [Label Taxonomy](taxonomy.md) — typologies, categories, `has_violence` rule |
+| **Decoding a JSON field** | [Schema Reference](schema.md) — annotated `ClipMetadata` example |
+| **Working with team data** | [She-Proves](she-proves.md) · [Elephant](elephant.md) |
+| **Looking up a term** | [Glossary](glossary.md) — F0, SSML, IR, AGG/VIC/SW/BEN, etc. |
+| **Checking what's current** | [Deliveries](deliveries.md) — current batch, known gaps |
 
 ---
 
-## Quick snippet
+!!! warning "This is a small test batch, not training data"
+    All current deliveries are preview batches for verifying that downstream data-loading code works before the full dataset arrives. The `split` column in `manifest.csv` is informational only — all 20 clips are `split: train` because there aren't enough unique speakers for a disjoint partition at this scale. **Do not train production models on this corpus.**
 
-```python
-import json
-from pathlib import Path
-import soundfile as sf
-
-root = Path(".")  # repo root
-
-# Load a clip
-wav, sr = sf.read(root / "data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav")
-meta = json.loads((root / "data/he/agg_m_30-45_001/sp_sv_a_0001_00.json").read_text())
-
-print(f"Duration: {len(wav)/sr:.1f}s  has_violence: {meta['weak_label']['has_violence']}")
-# Duration: 110.5s  has_violence: True
-```
-
-For manifest-level operations:
-
-```python
-import pandas as pd
-
-df = pd.read_csv("data/he/manifest.csv")
-violent = df[df["has_violence"] == True]
-print(violent[["clip_id", "project", "violence_typology", "duration_seconds"]].to_string())
-```
+!!! info "What's *not* in this corpus"
+    No real human recordings (synthetic TTS only) · no Arabic or English (Hebrew only) · no inter-annotator agreement metrics (labels are auto-generated by SynthBanshee) · no demographic detail beyond `gender` + `age_range`. Scripts are LLM-generated in Hebrew, not human-written. See [Glossary](glossary.md) for what each abbreviation means.
diff --git a/docs/schema.md b/docs/schema.md
index 98a10fd..dccbf1e 100644
--- a/docs/schema.md
+++ b/docs/schema.md
@@ -1,219 +1,226 @@
 # Schema Reference
 
-Every clip's `.json` file contains a `ClipMetadata` object. The authoritative Pydantic model is in [SynthBanshee `synthbanshee/labels/schema.py`](https://github.com/DataHackIL/SynthBanshee/blob/main/synthbanshee/labels/schema.py).
+A real `ClipMetadata` JSON, fully annotated. Click the `+` markers to jump to a field's explanation. Fields are ordered by how often you'll actually use them: top-level → labels → speakers → augmentation (Tier B only) → provenance (diagnostic, usually skip).
 
----
-
-## Loading with Pydantic
-
-```python
-from synthbanshee.labels.schema import ClipMetadata  # requires SynthBanshee installed
-from pathlib import Path
-
-meta = ClipMetadata.model_validate_json(
-    Path("data/he/agg_m_30-45_001/sp_sv_a_0001_00.json").read_text()
-)
-print(meta.clip_id, meta.violence_typology, meta.weak_label.has_violence)
-# sp_sv_a_0001_00 SV True
-```
-
-Plain JSON (no SynthBanshee required):
-
-```python
-import json
-from pathlib import Path
-
-meta = json.loads(Path("data/he/agg_m_30-45_001/sp_sv_a_0001_00.json").read_text())
-```
+The authoritative Pydantic model lives in [SynthBanshee `synthbanshee/labels/schema.py`](https://github.com/DataHackIL/SynthBanshee/blob/main/synthbanshee/labels/schema.py). For day-to-day consumer work, `json.loads()` is fine.
 
 ---
 
-## Top-level `ClipMetadata` fields
-
-| Field | Type | Description |
-|-------|------|-------------|
-| `clip_id` | `str` | Lowercase ASCII clip identifier, e.g. `sp_sv_a_0001_00` |
-| `project` | `str` | `she_proves` or `elephant_in_the_room` |
-| `language` | `str` | ISO 639-1, always `"he"` |
-| `violence_typology` | `str` | `SV` / `IT` / `NEG` / `NEU` — see [taxonomy](taxonomy.md) |
-| `tier` | `str` | `"A"` (clean) or `"B"` (room-augmented) |
-| `duration_seconds` | `float` | Duration of the processed WAV |
-| `sample_rate` | `int` | Always `16000` |
-| `channels` | `int` | Always `1` |
-| `is_synthetic` | `bool` | Always `true` in this corpus |
-| `generator_version` | `str` | SynthBanshee semver, e.g. `"0.1.0"` |
-| `generation_date` | `str` | ISO 8601 date of generation |
-| `random_seed` | `int` | Scene-level RNG seed for reproducibility |
-| `scene_config` | `str` | Relative path to the scene YAML in SynthBanshee |
-| `transcript_path` | `str` | Repo-relative POSIX path to the `.txt` transcript |
-| `dirty_file_path` | `str` | Repo-relative POSIX path to the pre-preprocessing WAV |
-| `speakers` | `list[SpeakerInfo]` | Speaker metadata — see below |
-| `weak_label` | `WeakLabel` | Clip-level summary labels |
-| `generation_metadata` | `GenerationMetadata \| null` | Pipeline provenance — see below |
-| `preprocessing_applied` | `PreprocessingApplied` | What preprocessing steps ran |
-| `acoustic_scene` | `AcousticScene` | Room/device augmentation (Tier B) |
-| `quality_flags` | `list[str]` | QA flags, e.g. `["emotion_downgrade"]` |
-| `snr_db_estimated` | `float \| null` | Estimated SNR (not always populated) |
-| `annotator_confidence` | `float` | Auto-label confidence, 0–1 (auto-generated: always `1.0`) |
-| `iaa_reviewed` | `bool` | Whether inter-annotator agreement review was done |
-| `she_proves_meta` | `null` | Reserved for She-Proves–specific metadata (future) |
-| `elephant_meta` | `null` | Reserved for Elephant–specific metadata (future) |
-
----
-
-## `SpeakerInfo`
-
-One entry per speaker in `speakers[]`.
-
-| Field | Type | Description |
-|-------|------|-------------|
-| `speaker_id` | `str` | UPPERCASE persona ID, e.g. `AGG_M_30-45_001` |
-| `role` | `str` | `AGG` (aggressor), `VIC` (victim), `SW` (social worker), `BEN` (beneficiary/client) |
-| `gender` | `str` | `"male"` or `"female"` |
-| `age_range` | `str` | e.g. `"30-45"` |
-| `tts_voice_id` | `str` | TTS voice identifier, e.g. `"he-IL-AvriNeural"` |
-| `voice_family` | `str` | Same as `tts_voice_id` (may diverge in future) |
-
-??? info "Speaker ID casing convention"
-    The `speaker_id` field in JSON is always **UPPERCASE**: `AGG_M_30-45_001`.
-    The on-disk directory is **lowercase**: `agg_m_30-45_001/`.
-    This is a deliberate per-surface casing rule — see [SynthBanshee spec §2.5](https://github.com/DataHackIL/SynthBanshee/blob/main/docs/spec.md#25-filename-constraints).
-
----
-
-## `WeakLabel`
-
-| Field | Type | Description |
-|-------|------|-------------|
-| `has_violence` | `bool` | `any(e.tier1_category != "NONE" for e in events)` — see [taxonomy](taxonomy.md#has_violence-the-correct-derivation) |
-| `violence_typology` | `str` | Mirrors top-level `violence_typology` |
-| `max_intensity` | `int` | Highest per-turn intensity across the clip (1–5) |
-| `violence_categories` | `list[str]` | Distinct `tier1_category` values observed in events |
-
----
+## Annotated example
+
+```json
+{
+  "clip_id": "sp_sv_a_0001_00",                       // (1)!
+  "project": "she_proves",                            // (2)!
+  "language": "he",                                   // (3)!
+  "violence_typology": "SV",                          // (4)!
+  "tier": "A",                                        // (5)!
+  "duration_seconds": 110.46,                         // (6)!
+  "sample_rate": 16000,                               // (7)!
+  "channels": 1,
+  "is_synthetic": true,                               // (8)!
+
+  "weak_label": {                                     // (9)!
+    "has_violence": true,
+    "violence_typology": "SV",
+    "max_intensity": 5,
+    "violence_categories": ["DIST", "PHYS", "VERB"]
+  },
+
+  "speakers": [                                       // (10)!
+    {
+      "speaker_id": "AGG_M_30-45_001",
+      "role": "AGG",
+      "gender": "male",
+      "age_range": "30-45",
+      "tts_voice_id": "he-IL-AvriNeural",
+      "voice_family": "he-IL-AvriNeural"
+    },
+    {
+      "speaker_id": "VIC_F_25-40_002",
+      "role": "VIC",
+      "gender": "female",
+      "age_range": "25-40",
+      "tts_voice_id": "he-IL-HilaNeural",
+      "voice_family": "he-IL-HilaNeural"
+    }
+  ],
+
+  "transcript_path":  "data/he/agg_m_30-45_001/sp_sv_a_0001_00.txt",        // (11)!
+  "dirty_file_path":  "assets/speech/dirty/sp_sv_a_0001_00_dirty.wav",      // (12)!
+
+  "quality_flags": ["emotion_downgrade"],             // (13)!
+
+  "acoustic_scene": {                                 // (14)!
+    "room_type": null,
+    "device": null,
+    "ir_source": null,
+    "snr_db_actual": null,
+    "speaker_distance_meters": null,
+    "background_events": []
+  },
+
+  "preprocessing_applied": {                          // (15)!
+    "resampled_to_16k":    true,
+    "downmixed_to_mono":   true,
+    "normalized_dbfs":    -2.0000002,
+    "silence_padded":      true,
+    "denoised":            true,
+    "spectral_filtered":   true
+  },
+
+  "generation_metadata": { /* ...see below... */ },   // (16)!
+
+  "generator_version": "0.1.0",                       // (17)!
+  "generation_date":   "2026-05-12",
+  "random_seed":       1201,
+  "scene_config":      "configs/scenes/she_proves/sp_sv_a_0001.yaml",
+  "snr_db_estimated":  null,                          // (18)!
+  "annotator_confidence": 1.0,                        // (19)!
+  "iaa_reviewed":         false,
+  "she_proves_meta":      null,                       // (20)!
+  "elephant_meta":        null
+}
+```
 
-## `GenerationMetadata`
-
-Present on all delivery-003 clips; may be `null` on older clips.
-
-| Field | Type | Description |
-|-------|------|-------------|
-| `pipeline_version` | `str` | SynthBanshee semver |
-| `tts_backend` | `dict[str, str]` | Speaker ID → `"azure"` or `"google"` |
-| `voice_family` | `dict[str, str]` | Speaker ID → voice family string |
-| `mix_mode_used` | `str` | `"sequential"` (turns in order) or `"overlapping"` |
-| `normalization_strategy` | `str` | `"per_turn_rms_v2_target_peak"` |
-| `loudness_target_peak_dbfs` | `float` | Configured peak target, e.g. `-2.0` |
-| `breathiness_applied` | `bool` | Whether breathiness augmentation was applied |
-| `effective_prosody_caps` | `list[ProsodyCap]` | Per-turn cap activations at I3–I5 |
-| `speaker_state_serialized` | `dict[str, SpeakerState]` | Final prosody state per speaker |
-| `prosody_controller_version` | `str \| null` | Version of the prosody controller |
-| `text_normalization_version` | `str \| null` | Version of text normalization |
-| `timing_controller_version` | `str \| null` | Version of timing controller |
-
-### `ProsodyCap` (entry in `effective_prosody_caps`)
-
-| Field | Description |
-|-------|-------------|
-| `turn_index` | Zero-based turn index |
-| `intensity` | Intensity score for that turn |
-| `dim` | `"pitch"` or `"rate"` |
-| `pre_cap` | Prosody value before capping (semitones for pitch, ratio for rate) |
-| `post_cap` | Prosody value after capping |
-
-### `SpeakerState` (entry in `speaker_state_serialized`)
-
-| Field | Description |
-|-------|-------------|
-| `pitch_offset_st` | Final pitch offset in semitones |
-| `rate_offset` | Final speaking rate multiplier |
-| `volume_offset_db` | Final volume offset in dB |
-| `breathiness_level` | Breathiness level 0–1 |
+1.  Lowercase ASCII clip identifier. Pattern: `{project_prefix}_{typology}_{tier}_{scene_num}_{take}`.
+2.  `she_proves` or `elephant_in_the_room`. Determines clip-id prefix (`sp_*` / `el_*`) and which `*_meta` field is non-null.
+3.  ISO 639-1 — always `"he"` in this corpus.
+4.  `SV` · `IT` · `NEG` · `NEU`. **Not** an ordered scale — see [Label Taxonomy](taxonomy.md). `NEG` is the hard-negative class (sounds intense, not violent).
+5.  `"A"` (clean, TTS only) or `"B"` (room IR + device profile + background noise applied). Determines whether `acoustic_scene` is populated.
+6.  Duration of the final processed WAV, **including** the 0.5 s silence pad on each end.
+7.  Always 16000. Channels always 1. Format always 16-bit PCM WAV.
+8.  Always `true` in this corpus. The field exists because future real-recording deliveries will set it `false`.
+9.  Clip-level summary labels. `has_violence` is derived from events: `any(e.tier1_category != "NONE")`. Don't derive it from typology — see [Gotcha #1](gotchas.md#1-dont-derive-has_violence-from-typology).
+10. One entry per speaker. The on-disk directory is **`speakers[0].speaker_id.lower()`** — UPPERCASE in JSON, lowercase on disk ([Gotcha #4](gotchas.md#4-uppercase-in-json-lowercase-on-disk)).
+11. Repo-relative POSIX path to the `.txt` transcript.
+12. Repo-relative POSIX path to the pre-preprocessing ("dirty") WAV, retained per spec. Useful for diagnosing normalization issues. **Don't modify** — `assets/` is managed by SynthBanshee ([Gotcha #7](gotchas.md#7-quality_flags-doesnt-mean-broken)).
+13. Soft warnings. Don't filter on these reflexively — they don't fail validation. Most common: `emotion_downgrade` (TTS produced slightly less intense prosody than requested), `vic_f0_high` (Google female F0 above Azure baseline; expected on the 2 Google clips).
+14. Populated for Tier B (Elephant) clips; all `null` / empty for Tier A. See [Elephant guide](elephant.md#the-acoustic_scene-field).
+15. Records *what* preprocessing ran. `normalized_dbfs` is the **measured** post-preprocess peak — pair with `generation_metadata.loudness_target_peak_dbfs` (the configured target) to diagnose loudness drift.
+16. Pipeline provenance. Always present on delivery-003 clips; may be `null` on older clips. Expanded below.
+17. SynthBanshee version that produced this clip. Combined with `random_seed` + `scene_config`, scenes are reproducible.
+18. Estimated SNR — not populated for any current delivery. Use `acoustic_scene.snr_db_actual` for Tier B.
+19. Auto-label confidence; always `1.0` because labels are generated by the pipeline (not human-annotated). `iaa_reviewed` is always `false` for the same reason.
+20. Reserved for per-project metadata. Always `null` in current deliveries.
 
 ---
 
-## `PreprocessingApplied`
+## `generation_metadata` — pipeline provenance
+
+Expanded view of field (16). Use this block for diagnostics, not for filtering training data.
+
+```json
+{
+  "pipeline_version":      "0.1.0",
+  "tts_backend":           {"AGG_M_30-45_001": "azure",  "VIC_F_25-40_002": "azure"},
+  "voice_family":          {"AGG_M_30-45_001": "he-IL-AvriNeural", "VIC_F_25-40_002": "he-IL-HilaNeural"},
+  "mix_mode_used":         "sequential",
+  "normalization_strategy":  "per_turn_rms_v2_target_peak",   // internal version string; informational
+  "loudness_target_peak_dbfs": -2.0,
+  "breathiness_applied":   false,
+  "effective_prosody_caps": [                                  // per-turn cap activations at I3+
+    {"turn_index": 1, "intensity": 2, "dim": "rate",  "pre_cap": 0.912, "post_cap": 0.95},
+    {"turn_index": 4, "intensity": 4, "dim": "pitch", "pre_cap": 2.348, "post_cap": 2.0}
+  ],
+  "speaker_state_serialized": {
+    "AGG_M_30-45_001": {"pitch_offset_st": 1.40, "rate_offset": 1.14, "volume_offset_db":  3.80, "breathiness_level": 0.0},
+    "VIC_F_25-40_002": {"pitch_offset_st": 0.56, "rate_offset": 0.89, "volume_offset_db": -2.58, "breathiness_level": 0.0}
+  }
+}
+```
 
-| Field | Type | Description |
-|-------|------|-------------|
-| `resampled_to_16k` | `bool` | Whether sample rate conversion ran |
-| `downmixed_to_mono` | `bool` | Whether channel downmix ran |
-| `normalized_dbfs` | `float` | **Measured** peak dBFS of the output WAV (not the target) |
-| `silence_padded` | `bool` | Whether silence padding was applied |
-| `denoised` | `bool` | Whether denoising ran |
-| `spectral_filtered` | `bool` | Whether spectral filtering ran |
+| Field | What it tells you |
+|-------|-------------------|
+| `tts_backend` | Per-speaker dict mapping speaker_id → `"azure"` or `"google"`. The corpus-level backend distribution is derived from this — don't look for a top-level `tts_engine` field, it was removed. |
+| `voice_family` | Per-speaker dict mapping speaker_id → voice ID. Currently identical to `speakers[].tts_voice_id`. |
+| `mix_mode_used` | `"sequential"` (turns in order) or `"overlapping"` (turns can overlap at I4+). All delivery-003 violent clips use `"overlapping"` at high intensity; calm clips use `"sequential"`. |
+| `loudness_target_peak_dbfs` | The **configured** peak target (–2.0 dBFS by default). Pair with `preprocessing_applied.normalized_dbfs` (the measured peak) to detect drift. |
+| `effective_prosody_caps` | Per-turn list of cap activations — when the LLM-suggested pitch or rate exceeded the safety cap. Common at I3+ in this delivery. Recording them lets you compute the "uncapped" prosody the LLM intended. |
+| `speaker_state_serialized` | Final per-speaker prosody offset. Used for reproducing a scene with the same speaker drift. |
 
-!!! note "`normalized_dbfs` is the measured peak, not the target"
-    Use `generation_metadata.loudness_target_peak_dbfs` for the configured target.
-    Use `preprocessing_applied.normalized_dbfs` to verify the actual output peak.
-    On delivery-003, both should be very close to `–2.0` (within floating-point precision).
+??? info "Internal version-string fields"
+    `normalization_strategy`, `prosody_controller_version`, `text_normalization_version`, `timing_controller_version` are internal version strings. They're recorded for provenance but you won't filter on them as a consumer.
 
 ---
 
-## `AcousticScene`
-
-Populated for Tier B clips. Null fields indicate Tier A (no augmentation).
-
-| Field | Type | Description |
-|-------|------|-------------|
-| `room_type` | `str \| null` | e.g. `"clinic_office"`, `"welfare_office"`, `"open_office"` |
-| `device` | `str \| null` | e.g. `"pi_budget_mic"` |
-| `ir_source` | `str \| null` | Room impulse response source, e.g. `"pyroomacoustics_ism"` |
-| `snr_db_actual` | `float \| null` | Actual SNR after augmentation (dB) |
-| `speaker_distance_meters` | `float \| null` | Simulated speaker distance from microphone |
-| `background_events` | `list[BackgroundEvent]` | Non-speech acoustic events added |
-
-### `BackgroundEvent`
-
-| Field | Description |
-|-------|-------------|
-| `type` | `"hvac_hum"`, `"ACOU_SLAM"`, `"ACOU_FALL"`, etc. |
-| `onset` | Start time in seconds |
-| `offset` | End time in seconds |
-| `level_db` | Relative level of the event (dB) |
-
----
+## `EventLabel` — `.jsonl` rows
+
+One JSON object per line. One line per labelled event. Read line-by-line — `json.loads()` on the whole file errors.
+
+```json
+{
+  "event_id":       "sp_sv_a_0001_00_EVT_004",
+  "clip_id":        "sp_sv_a_0001_00",
+  "onset":          36.736,
+  "offset":         46.552,
+  "tier1_category": "DIST",
+  "tier2_subtype":  "DIST_SCREAM",
+  "intensity":      4,
+  "speaker_id":     "AGG_M_30-45_001",
+  "speaker_role":   "AGG",
+  "emotional_state": "anger",
+  "confidence":      1.0,
+  "label_source":    "auto",
+  "iaa_reviewed":    false,
+  "truncated":       false,
+  "notes":           null
+}
+```
 
-## `EventLabel` (`.jsonl` rows)
-
-One JSON object per line. Each represents a single labelled event within the clip.
-
-| Field | Type | Description |
-|-------|------|-------------|
-| `event_id` | `str` | `{clip_id}_EVT_{index:03d}` |
-| `clip_id` | `str` | Parent clip ID |
-| `onset` | `float` | Event start time in seconds (in the processed WAV) |
-| `offset` | `float` | Event end time in seconds |
-| `tier1_category` | `str` | `VERB` / `DIST` / `PHYS` / `EMOT` / `ACOU` / `NONE` |
-| `tier2_subtype` | `str` | e.g. `VERB_SHOUT`, `PHYS_HARD` |
-| `intensity` | `int` | Turn intensity 1–5 |
-| `speaker_id` | `str` | UPPERCASE speaker persona ID |
-| `speaker_role` | `str` | `AGG`, `VIC`, `SW`, `BEN` |
-| `emotional_state` | `str` | e.g. `"anger"`, `"fear"`, `"desperation"`, `"neutral"` |
-| `confidence` | `float` | Auto-label confidence (always `1.0` for auto-generated) |
-| `label_source` | `str` | `"auto"` for all current clips |
-| `iaa_reviewed` | `bool` | Always `false` in current deliveries |
-| `truncated` | `bool` | Whether the event was cut short by a turn boundary |
-| `notes` | `str \| null` | Annotator notes |
+| Field | Notes |
+|-------|-------|
+| `onset` / `offset` | Seconds in the **final processed WAV**. Already shifted to account for the 0.5 s leading silence pad. |
+| `tier1_category` | `VERB` · `DIST` · `PHYS` · `EMOT` · `ACOU` · `NONE`. See [Label Taxonomy](taxonomy.md). |
+| `tier2_subtype`  | e.g. `VERB_SHOUT`, `DIST_SCREAM`, `PHYS_HARD`, `ACOU_SLAM`. |
+| `intensity`      | The intensity of the turn the event belongs to (1–5). |
+| `speaker_id`     | UPPERCASE. Matches one of `ClipMetadata.speakers[].speaker_id`. |
+| `speaker_role`   | `AGG` · `VIC` · `SW` · `BEN`. See [Glossary](glossary.md#speaker-roles). |
+| `emotional_state` | Free-text label of speaker emotion at this turn (e.g. `"anger"`, `"fear"`, `"desperation"`, `"neutral"`). |
+| `confidence`     | Always `1.0` (labels are auto-generated). |
+| `label_source`   | Always `"auto"`. |
+| `iaa_reviewed`   | Always `false` in current deliveries — no human inter-annotator agreement review yet. |
+| `truncated`      | `true` if the event was cut short by a turn boundary. |
 
 ---
 
 ## Manifest CSV columns
 
-`data/he/manifest.csv` — one row per clip.
+`data/he/manifest.csv` — one row per clip, the fastest entry point for filtering.
 
 | Column | Type | Notes |
 |--------|------|-------|
-| `clip_id` | str | Matches JSON `clip_id` |
+| `clip_id` | str | Matches `ClipMetadata.clip_id` |
 | `project` | str | `she_proves` / `elephant_in_the_room` |
 | `violence_typology` | str | `SV` / `IT` / `NEG` / `NEU` |
 | `tier` | str | `A` / `B` |
-| `duration_seconds` | float | |
-| `speaker_ids` | str | Pipe-delimited, e.g. `AGG_M_30-45_001\|VIC_F_25-40_002` |
-| `voice_families` | str | Pipe-delimited, matches `speaker_ids` order |
-| `has_violence` | bool | See [taxonomy](taxonomy.md#has_violence-the-correct-derivation) |
+| `duration_seconds` | float | Final WAV duration including pads |
+| `speaker_ids` | str | **Pipe-delimited.** `AGG_M_30-45_001\|VIC_F_25-40_002` |
+| `voice_families` | str | **Pipe-delimited**, same order as `speaker_ids` |
+| `has_violence` | bool | Derived from events — see [Gotcha #1](gotchas.md#1-dont-derive-has_violence-from-typology) |
 | `max_intensity` | int | 1–5 |
-| `quality_flags` | str | Comma-delimited flag list |
-| `split` | str | `train` / `val` / `test` — all `train` in delivery-003 |
-| `wav_path` | str | Repo-relative POSIX path |
-| `strong_labels_path` | str | Repo-relative POSIX path to `.jsonl` |
+| `quality_flags` | str | Comma-delimited soft warnings |
+| `split` | str | `train` / `val` / `test` — **all `train`** in delivery-003 ([Gotcha #6](gotchas.md#6-all-clips-are-split-train-in-delivery-003)) |
+| `wav_path` | str | Repo-relative POSIX path to the `.wav` |
+| `strong_labels_path` | str | Repo-relative POSIX path to the `.jsonl` |
+
+---
+
+## Transcript file format (`.txt`)
+
+Plain UTF-8. One turn = one header line + one or more text lines + one action line. Hebrew text only; no Latin script in the body.
+
+```
+[CLIP_ID: sp_sv_a_0001_00]
+[SPEAKER: AGG_M_30-45_001 | ROLE: AGG | ONSET: 0.76 | OFFSET: 10.07]
+מה זה הארוחה הזאת? שאלתי אותך דבר אחד פשוט, לעשות ארוחת ערב נורמלית.
+[ACTION: VERB_SHOUT | INTENSITY: 2]
+[SPEAKER: VIC_F_25-40_002 | ROLE: VIC | ONSET: 10.49 | OFFSET: 18.74]
+עבדתי עד שש היום. עשיתי מה שהספקתי...
+[ACTION: VERB_SHOUT | INTENSITY: 2]
+```
+
+- The first line is a single `[CLIP_ID: ...]` header.
+- Each subsequent turn is a `[SPEAKER: ... | ROLE: ... | ONSET: ... | OFFSET: ...]` line, the Hebrew text, then `[ACTION: <tier2_subtype> | INTENSITY: 1–5]`.
+- `ONSET` / `OFFSET` are in seconds, relative to the final processed WAV (already include the leading pad).
+- The `.jsonl` strong labels are the canonical source for events; the transcript is for human reading and as an ASR reference.
diff --git a/docs/she-proves.md b/docs/she-proves.md
index fef7983..6be24c7 100644
--- a/docs/she-proves.md
+++ b/docs/she-proves.md
@@ -1,133 +1,117 @@
-# She-Proves Team Guide
+# She-Proves Guide
 
-She-Proves is a smartphone app that **passively monitors audio for domestic violence incidents** and preserves evidence for legal use.
+She-Proves is a smartphone app that passively monitors audio for domestic-violence incidents and preserves evidence for legal use. **Optimisation target: high recall** — better to flag for review than to miss.
 
-**Optimization target: high recall.** It is better to flag an incident for review than to miss one.
+This page is the *differential* between She-Proves clips and the rest of the corpus. For shared concepts (schema, labels, audio format) follow the cross-links.
 
 ---
 
-## Scene structure
+## Scene profile
 
-| Property | Value |
-|----------|-------|
-| Duration | 3–6 minutes |
-| Tier | A (clean — no room processing) |
-| Pre-incident window | ≥ 60% of clip duration before the first violence event |
-| Device profile | `phone_in_pocket`, `phone_on_table`, `phone_in_hand` |
-| Room types | apartment rooms (living room, bedroom, kitchen) |
-| Language | Hebrew (`he`) |
+| | |
+|---|---|
+| Project code | `she_proves` (clip-id prefix `sp_*`) |
+| Tier | A — clean audio, no room/device augmentation |
+| Duration | 3–6 min |
+| Pre-incident window | ≥ 60% of clip is normal speech before the first violence event |
+| Device | `phone_in_pocket`, `phone_on_table`, `phone_in_hand` (planned; not active in delivery-003) |
+| Room | Apartment (living room, bedroom, kitchen) — planned; not active in delivery-003 |
 
-The long pre-incident window reflects real-world deployment: the app is always listening, and incidents are rare. Models trained on this data should handle extended periods of mundane speech before a rapid escalation.
+The long pre-incident window is intentional. In deployment the app is always listening; incidents are rare. A model trained only on escalation segments will miss the gradual-buildup signal that precedes most domestic-violence events.
 
-??? info "Tier A — what does 'clean' mean?"
-    Tier A clips have **no acoustic augmentation** — no room impulse response convolution, no device frequency response, no background noise injection. The audio is the direct TTS-mixer output after preprocessing: peak-normalized, silence-padded, 16 kHz mono 16-bit PCM.
+!!! note "What 'Tier A' means here"
+    Tier A audio is the direct TTS-mixer output after preprocessing — peak-normalised, silence-padded, 16 kHz mono PCM. No room IR, no microphone profile, no background noise. `acoustic_scene.room_type`, `device`, `ir_source`, and `snr_db_actual` are all `null` for every Tier A clip.
 
-    For Tier A, `acoustic_scene.room_type`, `device`, `ir_source`, and `snr_db_actual` are all `null`.
-
-    Tier B (used by Elephant) adds all of the above. See [Elephant in the Room](elephant.md) for details.
+    Delivery-003 has no Tier-A device augmentation yet (the `phone_in_pocket` etc. profiles exist in the pipeline but aren't applied at this stage). When that's added in a future delivery, the `acoustic_scene` block will start carrying `device` while keeping `room_type` null.
 
 ---
 
 ## Speaker pairs
 
-Delivery-003 has two She-Proves speaker pairs — one per TTS backend.
+Two pairs in delivery-003, one per TTS backend. Both pairs play the **AGG (aggressor, male) + VIC (victim, female)** roles.
+
+=== "Azure pair (10 clips)"
+    Speaker directory: `data/he/agg_m_30-45_001/`
+
+    | Role | speaker_id | TTS voice |
+    |------|-----------|-----------|
+    | AGG  | `AGG_M_30-45_001` | `he-IL-AvriNeural` |
+    | VIC  | `VIC_F_25-40_002` | `he-IL-HilaNeural` |
+
+=== "Google Chirp HD pair (2 clips)"
+    Speaker directory: `data/he/agg_m_30-45_002/`
 
-| Pair | Speaker dir | Male speaker | Female speaker | Backend |
-|------|-------------|--------------|----------------|---------|
-| Azure | `agg_m_30-45_001/` | `AGG_M_30-45_001` → `he-IL-AvriNeural` | `VIC_F_25-40_002` → `he-IL-HilaNeural` | Azure |
-| Google Chirp HD | `agg_m_30-45_002/` | `AGG_M_30-45_002` → `he-IL-Chirp3-HD-Achird` | `VIC_F_25-40_003` → `he-IL-Chirp3-HD-Achernar` | Google |
+    | Role | speaker_id | TTS voice |
+    |------|-----------|-----------|
+    | AGG  | `AGG_M_30-45_002` | `he-IL-Chirp3-HD-Achird`   |
+    | VIC  | `VIC_F_25-40_003` | `he-IL-Chirp3-HD-Achernar` |
 
-Both pairs play **AGG (aggressor, male) + VIC (victim, female)** roles. The Google pair was added in delivery-003 specifically to introduce backend diversity.
+    The Google pair was added in delivery-003 specifically to introduce backend diversity. Both clips carry a `vic_f0_high` flag — see [Audio Format](audio-format.md#vic_f0_high-on-the-2-google-clips).
 
-!!! note "Two speaker directories"
-    Clips from the Azure pair live under `data/he/agg_m_30-45_001/`.
-    Clips from the Google pair live under `data/he/agg_m_30-45_002/`.
-    Downstream code that hardcodes `agg_m_30-45_001/` will miss the Google clips.
-    Use `manifest.csv` or filter `meta["generation_metadata"]["tts_backend"]` to find both.
+[Gotcha #2: don't hardcode `agg_m_30-45_001/`](gotchas.md#2-dont-hardcode-speaker-directory-paths) — three speaker directories exist now, including one for Elephant. Filter on `manifest.csv["project"] == "she_proves"` or on `meta["project"]`.
 
 ---
 
 ## Clips in delivery-003
 
-### Azure pair — 10 clips
+**12 clips · ~20 min · 6 violent (`SV` + `IT`), 6 non-violent (`NEG` + `NEU`)**
 
-`data/he/agg_m_30-45_001/`
+??? abstract "Full clip listing"
+    Azure pair, `data/he/agg_m_30-45_001/` — 10 clips:
 
-| Clip ID | Typology | `has_violence` | Duration |
-|---------|----------|:---:|------:|
-| `sp_sv_a_0001_00` | SV | ✓ | 1m 50.5s |
-| `sp_sv_a_0002_00` | SV | ✓ | 1m 32.1s |
-| `sp_it_a_0001_00` | IT | ✓ | 2m 23.8s |
-| `sp_it_a_0002_00` | IT | ✓ | 2m 19.7s |
-| `sp_neg_a_0001_00` | NEG | — | 1m 58.8s |
-| `sp_neg_a_0002_00` | NEG | — | 1m 47.8s |
-| `sp_neg_a_0003_00` | NEG | — | 2m 26.3s |
-| `sp_neu_a_0001_00` | NEU | — | 1m 59.2s |
-| `sp_neu_a_0002_00` | NEU | — | 2m 09.0s |
-| `sp_neu_a_0003_00` | NEU | — | 1m 45.1s |
+    | Clip ID | Typology | violent | Duration |
+    |---------|----------|:---:|---------:|
+    | `sp_sv_a_0001_00`  | SV  | ✓ | 1m 50.5s |
+    | `sp_sv_a_0002_00`  | SV  | ✓ | 1m 32.1s |
+    | `sp_it_a_0001_00`  | IT  | ✓ | 2m 23.8s |
+    | `sp_it_a_0002_00`  | IT  | ✓ | 2m 19.7s |
+    | `sp_neg_a_0001_00` | NEG | — | 1m 58.8s |
+    | `sp_neg_a_0002_00` | NEG | — | 1m 47.8s |
+    | `sp_neg_a_0003_00` | NEG | — | 2m 26.3s |
+    | `sp_neu_a_0001_00` | NEU | — | 1m 59.2s |
+    | `sp_neu_a_0002_00` | NEU | — | 2m 09.0s |
+    | `sp_neu_a_0003_00` | NEU | — | 1m 45.1s |
 
-### Google Chirp HD pair — 2 clips
+    Google Chirp HD pair, `data/he/agg_m_30-45_002/` — 2 clips:
 
-`data/he/agg_m_30-45_002/`
+    | Clip ID | Typology | violent | Duration | Flags |
+    |---------|----------|:---:|---------:|-------|
+    | `sp_sv_a_0003_00` | SV | ✓ | 1m 42.8s | `vic_f0_high` |
+    | `sp_it_a_0003_00` | IT | ✓ | 1m 53.9s | `vic_f0_high` |
 
-| Clip ID | Typology | `has_violence` | Duration | Note |
-|---------|----------|:---:|------:|------|
-| `sp_sv_a_0003_00` | SV | ✓ | 1m 42.8s | `vic_f0_high` flag |
-| `sp_it_a_0003_00` | IT | ✓ | 1m 53.9s | `vic_f0_high` flag |
-
-The `vic_f0_high` flag on the Google clips indicates the female voice (`he-IL-Chirp3-HD-Achernar`) has a higher F0 baseline than the Azure Hila reference. See [Audio Format → vic_f0_high](audio-format.md#vic_f0_high-google-chirp-hd-female-f0-baseline).
+The waveform on the [home page](index.md#see-it-first) is `sp_sv_a_0001_00` — a worked example of an SV escalation arc in this project's data.
 
 ---
 
-## Loading She-Proves clips
+## Loading just the She-Proves clips
 
 ```python
-import json
-import soundfile as sf
-import pandas as pd
+import pandas as pd, soundfile as sf, json
 from pathlib import Path
 
 root = Path(".")
-
-# Via manifest — easiest
 df = pd.read_csv("data/he/manifest.csv")
-sp_clips = df[df["project"] == "she_proves"]
-
-# Load all She-Proves audio
-wavs = {}
-for _, row in sp_clips.iterrows():
-    wav, sr = sf.read(root / row["wav_path"])
-    wavs[row["clip_id"]] = wav
-
-# Filter to violent She-Proves clips only
-sp_violent = sp_clips[sp_clips["has_violence"] == True]
-
-# Get per-backend split
-sp_clips["backend"] = sp_clips["voice_families"].apply(
-    lambda v: "google" if "Chirp" in v else "azure"
-)
-print(sp_clips.groupby("backend")["clip_id"].count())
-# azure    10
-# google    2
-```
+sp = df[df["project"] == "she_proves"]                   # 12 rows
 
----
-
-## Guidance for model training
-
-!!! warning "This is a toy corpus — not for production training"
-    12 She-Proves clips (10 Azure + 2 Google) are not enough for training a production model. Use this delivery to validate your data pipeline and schema parsing. Full-scale data follows.
+# Tag backend per row (Google clips have "Chirp" in voice_families)
+sp = sp.assign(backend=sp["voice_families"].str.contains("Chirp").map({True: "google", False: "azure"}))
+print(sp.groupby("backend")["clip_id"].count())
+# azure     10
+# google     2
 
-**High-recall orientation:**
-
-- **NEG clips are your hardest negatives.** They contain intense speech (raised voices, arguments, crying) with `has_violence: false`. Your recall model must not fire on them.
-- **The pre-incident window** (first 60% of the clip) will look like NEU/low-intensity speech. Include it in your training windows — models that only see escalated segments will miss early warning signals.
-- **Per-turn intensity** in the `.jsonl` events gives you fine-grained supervision beyond binary `has_violence`. Consider training an intensity regressor as an auxiliary objective.
+# Load audio for each row
+audio = {row.clip_id: sf.read(root / row.wav_path) for row in sp.itertuples()}
+```
 
-**Backend diversity:**
+---
 
-The 2 Google Chirp HD clips expose your feature extractor to a different F0 baseline and spectral profile. At small scale, they're useful for checking that your features don't overfit to Azure voice characteristics.
+## Training-time notes (specific to this project)
 
-**Speaker splits:**
+- **NEG clips are your hardest negatives.** `sp_neg_a_*` clips have raised voices, distress, arguments — and `has_violence: false`. Recall metrics that fire on these will tank precision. See [Gotcha #1](gotchas.md#1-dont-derive-has_violence-from-typology) and [Gotcha #5](gotchas.md#5-neg-is-not-violent-at-low-intensity).
+- **Use the pre-incident window.** The first 60% of each violent clip looks like NEU-grade speech. Train across the full clip, not only on escalation segments — early-warning signal lives in the buildup.
+- **Per-turn intensity is a useful auxiliary objective.** `EventLabel.intensity` gives turn-level supervision beyond binary `has_violence`. An intensity regressor trained alongside the classifier often boosts the latter.
+- **Only 2 voice families per gender in this delivery** (`low_voice_diversity_*` is flagged at the corpus level). Expect your acoustic features to over-fit to AvriNeural and HilaNeural — track per-voice eval separately when the corpus grows.
+- **No device/room augmentation yet on She-Proves clips.** When the `phone_in_pocket` profile activates in a future delivery, your model will see substantially more high-frequency roll-off and handling noise than what's in delivery-003.
 
-All 12 clips share 2 unique speaker personas (4 if you count Azure+Google pairs separately). There are not enough speakers for a speaker-disjoint split in this delivery. Re-evaluate when the corpus scales to 100+ speakers.
+!!! warning "Still a small test batch"
+    12 clips and 4 voices is enough to wire up your data loaders, label parsers, and evaluation harness. It is not enough to train a production model. Build the plumbing; wait for the real batch.
diff --git a/docs/taxonomy.md b/docs/taxonomy.md
index 8154a06..973f5cd 100644
--- a/docs/taxonomy.md
+++ b/docs/taxonomy.md
@@ -1,126 +1,137 @@
 # Label Taxonomy
 
-Labels follow a three-level hierarchy. The **source of truth** is `taxonomy.yaml` in the [SynthBanshee](https://github.com/DataHackIL/SynthBanshee) repo. Never derive labels from field names alone — always read from the actual data.
+Three levels: clip-level **typology** → event-level **tier 1 category** → event-level **tier 2 subtype**. Plus a per-turn **intensity** (1–5) that drives prosody. The source of truth is `taxonomy.yaml` in [SynthBanshee](https://github.com/DataHackIL/SynthBanshee).
 
 ---
 
-## Violence typologies (clip-level)
+## Violence typology (clip-level)
 
-The `violence_typology` field classifies the overall scenario of the clip.
+The `violence_typology` field. **Not** an ordered scale.
 
-| Typology | Full name | Description |
-|----------|-----------|-------------|
-| `SV` | Severe Violence | Physical violence, life-threatening escalation |
-| `IT` | Intimate Terrorism | Systematic coercive control, repeated verbal/emotional abuse |
-| `NEG` | Negative / Confusor | Acoustically intense but non-violent — anger, argument, distress, crying |
-| `NEU` | Neutral | Calm or mundane conversation with no violence markers |
+| Code | Name | What it sounds like |
+|------|------|---------------------|
+| `SV` | Severe Violence | Physical violence, life-threatening escalation. `tier1_category` includes `PHYS`, `DIST`, often `VERB`. |
+| `IT` | Intimate Terrorism | Sustained coercive control, repeated verbal/emotional abuse — typically without physical attack. Heavy on `VERB` and `EMOT`. |
+| `NEG` | Negative confusor | Acoustically intense but non-violent — anger, argument, distress, crying. **Hard negative class.** All events are `tier1_category: "NONE"`. |
+| `NEU` | Neutral | Calm or mundane conversation. No violence markers. |
 
-??? info "Why NEG is not the same as non-violent IT/SV"
-    NEG clips are designed as **hard negatives** — they sound intense and may have raised voices, crying, or confrontational tone, but no actual violence occurs. Their purpose is to train models to distinguish acoustic distress from violence.
-
-    Models that rely only on loudness or emotional tone will misclassify NEG clips. This is by design.
+!!! danger "NEG is the trap"
+    A NEG clip can have raised voices, crying, and `max_intensity: 3`. It will *sound* like violence to a model that only listens for loudness or emotional tone. But it is by definition `has_violence: false` — its purpose is to teach your model the difference between distress and violence. Training NEG as a positive class will collapse your precision. See [Gotcha #5](gotchas.md#5-neg-is-not-violent-at-low-intensity).
 
 ---
 
-## `has_violence` — the correct derivation
-
-`has_violence` is a **derived convenience field** computed from the strong-label events, not from typology:
+## `has_violence` — derived from events
 
 ```python
 has_violence = any(e["tier1_category"] != "NONE" for e in events)
 ```
 
-This means:
+That's the rule. Two consequences worth knowing:
 
-- `NEG` clips are **always** `has_violence: false`, regardless of `max_intensity` — by definition, every event in a NEG clip lands `tier1_category: "NONE"`.
-- A `NEU` clip with even one stray non-NONE event would be `has_violence: true` (shouldn't happen in a well-labelled corpus, but the rule is defensive).
+- **NEG clips are always `has_violence: false`** — every event in a NEG clip has `tier1_category: "NONE"` by construction, even when `max_intensity` is high.
+- **NEU clips are always `has_violence: false`** for the same reason.
+- A `SV` or `IT` clip is `has_violence: true` because at least one event has a non-NONE category.
 
-!!! danger "Do not re-derive `has_violence` from typology + intensity"
+!!! danger "Don't derive `has_violence` from typology or intensity"
     ```python
-    # WRONG — will misclassify every NEG clip
-    has_violence = typology in ("SV", "IT")
-
-    # CORRECT
-    has_violence = any(e["tier1_category"] != "NONE" for e in events)
+    has_violence = typology in ("SV", "IT")     # WRONG — works on this corpus but fragile
+    has_violence = max_intensity >= 3           # VERY WRONG — fires on every NEG clip
     ```
-    The taxonomy columns are the ground truth. `has_violence` exists only for fast filtering and baseline modelling — never use it as the sole training label.
+    The event-level taxonomy is the ground truth. `weak_label.has_violence` exists for fast filtering and baseline modelling only — never as the sole training label. Train on the strong-label events when you can.
 
 ---
 
 ## Tier 1 categories (event-level)
 
-Each `EventLabel` in the `.jsonl` file has a `tier1_category`:
+The `tier1_category` field on each `EventLabel`. Six values.
 
-| Category | Description | Example contexts |
-|----------|-------------|-----------------|
-| `VERB` | Verbal violence — threats, shouting, demeaning language | Arguments, intimidation |
-| `DIST` | Distress vocalisations — screaming, crying under duress | Peak escalation turns |
-| `PHYS` | Physical violence cues — impact sounds, struggle | Severe violence scenes |
-| `EMOT` | Emotional manipulation — guilt-tripping, gaslighting | IT/coercive control |
-| `ACOU` | Acoustic events — object impacts, slams, falls | Background events in Tier B |
-| `NONE` | No violence — ambient speech, neutral turns | All NEU/NEG events |
+| Category | What it covers | Where it shows up |
+|----------|----------------|-------------------|
+| `VERB` | Verbal violence — threats, shouting, demeaning language | Most violent clips, all intensity levels |
+| `DIST` | Distress vocalisations — screaming, crying under duress | I3+ turns in SV/IT, peak escalation |
+| `PHYS` | Physical violence cues — impact sounds, struggle | I4+ turns in SV clips |
+| `EMOT` | Emotional manipulation — gaslighting, guilt-tripping | IT clips, coercive control turns |
+| `ACOU` | Acoustic non-vocal events — slams, falls | Tier B clips, recorded in `acoustic_scene.background_events` |
+| `NONE` | Ambient speech / neutral turn | All NEU clips, all NEG clips, calm turns in SV/IT |
 
-??? info "ACOU vs DIST"
-    `ACOU` captures **non-vocal acoustic cues** — a door slam, an object falling, an impact sound. These appear in Tier B clips as `background_events` in the `acoustic_scene` block.
+!!! info "ACOU vs DIST"
+    `ACOU` is **non-vocal** acoustic — a door slam, an object hitting the floor. `DIST` is **vocal distress** — a scream, crying. A scene where someone throws a glass and the victim screams will have an `ACOU_SLAM` event for the glass and a `DIST_SCREAM` event for the scream.
 
-    `DIST` captures **vocal distress** — screams, panic vocalisations, crying under coercion.
+    Tier B Elephant clips inject `ACOU_*` events as part of room augmentation; they show up both in `acoustic_scene.background_events` (with audio-level metadata) and in `.jsonl` strong labels (as labelled events). Tier A She-Proves clips can't produce ACOU events — there's no room-augmentation stage to add them.
 
 ---
 
 ## Tier 2 subtypes (event-level)
 
-| Tier 1 | Tier 2 subtype | Description |
-|--------|----------------|-------------|
-| VERB | `VERB_SHOUT` | Raised or shouted speech |
-| VERB | `VERB_THREAT` | Direct verbal threats |
-| VERB | `VERB_INSULT` | Demeaning or insulting language |
-| DIST | `DIST_SCREAM` | Distress scream or panic vocalisation |
-| DIST | `DIST_CRY` | Crying or sobbing under duress |
-| PHYS | `PHYS_HARD` | Hard physical impact cue |
-| PHYS | `PHYS_SOFT` | Softer physical contact cue |
-| EMOT | `EMOT_GASLIGHT` | Gaslighting or reality-denial |
-| EMOT | `EMOT_GUILT` | Guilt-tripping or emotional coercion |
-| ACOU | `ACOU_SLAM` | Object slam or door slam |
-| ACOU | `ACOU_FALL` | Object falling or thrown |
-| NONE | `NONE_AMBIENT` | Regular ambient speech or neutral turn |
+| Tier 1 | Tier 2 | Description |
+|--------|--------|-------------|
+| `VERB` | `VERB_SHOUT` | Raised or shouted speech |
+| `VERB` | `VERB_THREAT` | Direct verbal threats |
+| `VERB` | `VERB_INSULT` | Demeaning or insulting language |
+| `DIST` | `DIST_SCREAM` | Distress scream or panic vocalisation |
+| `DIST` | `DIST_CRY` | Crying or sobbing under duress |
+| `PHYS` | `PHYS_HARD` | Hard physical impact cue |
+| `PHYS` | `PHYS_SOFT` | Softer physical contact cue |
+| `EMOT` | `EMOT_GASLIGHT` | Gaslighting or reality-denial |
+| `EMOT` | `EMOT_GUILT` | Guilt-tripping or emotional coercion |
+| `ACOU` | `ACOU_SLAM` | Object slam or door slam |
+| `ACOU` | `ACOU_FALL` | Object falling or thrown |
+| `NONE` | `NONE_AMBIENT` | Regular ambient speech or neutral turn |
 
 ---
 
 ## Intensity scale (turn-level)
 
-Intensity is scored 1–5 per dialogue turn. It controls prosody generation (pitch, rate, volume) and determines which tier1/tier2 labels are applied.
+Each turn has an `intensity` in `[1, 5]`. It controls prosody generation (pitch, rate, volume) and the LLM script tone.
+
+| Score | Label | What's happening |
+|-------|-------|------------------|
+| 1 | Low tension | Calm conversation, mild undercurrent |
+| 2 | Moderate tension | Noticeable friction, raised voices |
+| 3 | Active conflict | Clear verbal aggression or intimidation |
+| 4 | Escalated violence | Physical or high-intensity verbal violence |
+| 5 | Extreme | Severe physical violence, panic, imminent danger |
+
+### How intensity and typology relate
+
+They are correlated but not the same.
+
+| Typology | Typical `max_intensity` range | Why |
+|----------|:-----------------------------:|-----|
+| `NEU` | 1–2 | Mundane conversation by definition |
+| `NEG` | 2–3 | Distressed but non-violent; intensity rises with shouting/crying, but no PHYS/DIST events fire |
+| `IT` | 3–5 | Sustained verbal/emotional aggression; can hit I5 on threats without physical violence |
+| `SV` | 4–5 | Physical escalation requires I4+ turns |
 
-| Score | Label | Description | Prosody profile |
-|-------|-------|-------------|----------------|
-| 1 | Low tension | Calm conversation, mild undercurrent | Near-neutral |
-| 2 | Moderate tension | Noticeable friction, raised voices | Slightly raised pitch/rate |
-| 3 | Active conflict | Clear verbal aggression or intimidation | Elevated pitch, faster rate |
-| 4 | Escalated violence | Physical or high-intensity verbal violence | High pitch, fast rate, volume up |
-| 5 | Extreme / life-threatening | Severe physical violence, panic | Maximally expressive (capped) |
+In delivery-003 the actual distribution is `max_intensity` 5 = 10 clips, 3 = 4 clips, 2 = 6 clips. Useful for designing stratified eval splits: if you want a balanced eval set across intensity *and* typology, you'll need to upsample (or wait for more data).
 
-??? info "The prosody cap at I4–I5"
-    At intensity 4–5, the LLM-generated prosody values are capped before SSML rendering to prevent Whisper transcription failures and maintain naturalness. The cap values are:
+??? info "What is the prosody cap?"
+    At I3+, the LLM-suggested prosody values are clamped before SSML rendering to keep speech natural and transcribable by Whisper:
 
-    - **Pitch:** max +2.0 semitones (post-cap)
-    - **Rate:** range [0.85, 1.20] (post-cap)
+    - **Pitch:** capped at +2.0 semitones (post-cap)
+    - **Rate:** clamped to [0.85, 1.20]
 
-    Any cap activation is recorded in `generation_metadata.effective_prosody_caps` per turn. You'll see many activations at I4–I5 in delivery-003 — this is expected. The cap was calibrated in a listening test in May 2026 (SynthBanshee PR #87).
+    When clamping fires, the pre- and post-cap values are recorded per turn in `generation_metadata.effective_prosody_caps`. You'll see many activations at I4–I5 in delivery-003 — that's the intended behaviour, calibrated by listening test in May 2026 (SynthBanshee PR #87).
 
 ---
 
-## Distribution in delivery-003
+## Where the labels come from
+
+- **Strong labels (`.jsonl`)** are generated by SynthBanshee from the LLM-authored script — the LLM produces turn-level intensity and an action tag (`VERB_SHOUT`, `DIST_SCREAM`, …), SynthBanshee converts them into `EventLabel` records.
+- **Weak labels (`.json` → `weak_label`)** are derived from the strong labels by aggregation.
+- **No human annotation has happened.** `confidence` is always `1.0`; `label_source` is always `"auto"`; `iaa_reviewed` is always `false`. Future deliveries may introduce human review on a subset — they'll set `iaa_reviewed: true` per clip when that happens.
+
+The scripts themselves are LLM-generated Hebrew dialogue, conditioned on the scene YAML and persona definitions in SynthBanshee. They are **not** transcripts of real conversations.
 
-| Typology | Clips | Projects | Tiers |
-|----------|------:|---------|-------|
-| SV | 5 | she_proves (3) + elephant (2) | A (3) + B (2) |
-| IT | 5 | she_proves (3) + elephant (2) | A (3) + B (2) |
-| NEG | 5 | she_proves (3) + elephant (2) | A (3) + B (2) |
-| NEU | 5 | she_proves (3) + elephant (2) | A (3) + B (2) |
+---
+
+## Distribution in delivery-003
 
-Intensity distribution across all 20 clips:
+| Typology | Clips | Tier A (she_proves) | Tier B (elephant) |
+|----------|------:|:-------------------:|:------------------:|
+| `SV` | 5 | 3 | 2 |
+| `IT` | 5 | 3 | 2 |
+| `NEG` | 5 | 3 | 2 |
+| `NEU` | 5 | 3 | 2 |
 
-| Max intensity | Clips |
-|:---:|:---:|
-| 5 | 10 |
-| 3 | 4 |
-| 2 | 6 |
+Balanced across typology and across project. Not balanced across speakers — see [Deliveries](deliveries.md#known-limitations).
diff --git a/mkdocs.yml b/mkdocs.yml
index 11dbc73..d6b6d04 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -1,5 +1,5 @@
-site_name: avdp-synth-corpus
-site_description: Synthetic Hebrew audio corpus for the Audio Violence Detection Pipeline — consumer guide for She-Proves and Elephant in the Room teams
+site_name: AVDP Synthetic Corpus
+site_description: Synthetic Hebrew audio corpus for the Audio Violence Detection Pipeline — consumer guide for the She-Proves and Elephant in the Room teams
 site_url: https://datahackil.github.io/avdp-synth-corpus/
 repo_url: https://github.com/DataHackIL/avdp-synth-corpus
 repo_name: DataHackIL/avdp-synth-corpus
@@ -26,7 +26,6 @@ theme:
     - navigation.tabs
     - navigation.tabs.sticky
     - navigation.sections
-    - navigation.expand
     - navigation.indexes
     - navigation.top
     - toc.follow
@@ -38,6 +37,9 @@ theme:
     - content.tabs.link
     - announce.dismiss
 
+extra_css:
+  - assets/extra.css
+
 markdown_extensions:
   - admonition
   - pymdownx.details
@@ -63,6 +65,7 @@ markdown_extensions:
   - toc:
       permalink: true
   - def_list
+  - abbr
 
 plugins:
   - search:
@@ -70,7 +73,8 @@ plugins:
 
 nav:
   - Home: index.md
-  - Getting Started: getting-started.md
+  - Start here: getting-started.md
+  - Common mistakes: gotchas.md
   - Team Guides:
     - She-Proves: she-proves.md
     - Elephant in the Room: elephant.md
@@ -78,6 +82,7 @@ nav:
     - Label Taxonomy: taxonomy.md
     - Schema Reference: schema.md
     - Audio Format: audio-format.md
+    - Glossary: glossary.md
   - Deliveries: deliveries.md
 
 extra: