DataHackIL · shaypal5 · May 12, 2026 · May 12, 2026
diff --git a/docs/assets/extra.css b/docs/assets/extra.css
@@ -0,0 +1,60 @@
+/* Status pill used in headers and front-page hero */
+.status-pill {
+  display: inline-block;
+  padding: 0.15rem 0.55rem;
+  border-radius: 0.4rem;
+  font-size: 0.72rem;
+  font-weight: 600;
+  letter-spacing: 0.03em;
+  text-transform: uppercase;
+  vertical-align: middle;
+  margin: 0 0.25rem;
+}
+.status-pill.provisional { background: #FFB300; color: #3E2723; }
+.status-pill.approved    { background: #43A047; color: white; }
+.status-pill.superseded  { background: #BDBDBD; color: #424242; }
+
+/* Cards used on the home page to replace tabbed "What is this?" widget */
+.team-cards {
+  display: grid;
+  grid-template-columns: 1fr 1fr;
+  gap: 1rem;
+  margin: 1.25rem 0 1.5rem;
+}
+@media (max-width: 720px) {
+  .team-cards { grid-template-columns: 1fr; }
+}
+.team-card {
+  border: 1px solid var(--md-default-fg-color--lightest);
+  border-radius: 0.45rem;
+  padding: 1rem 1.1rem;
+  background: var(--md-default-bg-color);
+  transition: transform 0.15s ease, box-shadow 0.15s ease;
+}
+.team-card:hover {
+  transform: translateY(-2px);
+  box-shadow: 0 6px 18px rgba(0,0,0,0.06);
+}
+.team-card h3 {
+  margin: 0 0 0.35rem;
+  font-size: 1rem;
+  color: var(--md-primary-fg-color);
+}
+.team-card .tagline {
+  font-size: 0.78rem;
+  color: var(--md-default-fg-color--light);
+  text-transform: uppercase;
+  letter-spacing: 0.05em;
+  margin-bottom: 0.5rem;
+}
+.team-card p { margin: 0.4rem 0; font-size: 0.92rem; }
+.team-card a.card-link {
+  display: inline-block;
+  margin-top: 0.5rem;
+  font-weight: 600;
+  font-size: 0.9rem;
+}
+
+/* Tighter table look for reference pages */
+.md-typeset table:not([class]) { font-size: 0.78rem; }
+.md-typeset table:not([class]) code { font-size: 0.78rem; }
diff --git a/docs/assets/sp_sv_a_0001_00_waveform.png b/docs/assets/sp_sv_a_0001_00_waveform.png
diff --git a/docs/audio-format.md b/docs/audio-format.md
@@ -1,132 +1,105 @@
 # Audio Format
 
-All clips in the corpus conform to the following hard constraints. Clips that fail these checks are rejected at generation time and will not appear in the corpus.
+The three facts you need to use the data, then optional detail on how it gets that way.
 
 ---
 
-## Format requirements
+## What you need to know
 
-| Property | Value |
-|----------|-------|
-| Sample rate | 16 000 Hz |
-| Channels | 1 (mono) |
-| Bit depth | 16-bit PCM |
-| Peak level | ≤ –1.0 dBFS (safety ceiling) |
-| Duration | ≥ 3.0 s |
-| Encoding | WAV (no lossy formats) |
+| Fact | Value | Why it matters |
+|------|-------|----------------|
+| **Sample rate** | 16 000 Hz | Always. Resample your features for this. |
+| **Channels / depth** | mono / 16-bit PCM WAV | `wav.ndim == 1`. No lossy formats anywhere. |
+| **Peak level** | ≤ –1.0 dBFS (target –2.0 dBFS) | `np.abs(wav).max() ≈ 0.79`, **not** 1.0. |
+| **Silence pad** | ≥ 0.5 s at head and tail | Onset/offset timestamps **already account for it** — no shift needed. |
+| **Duration** | ≥ 3.0 s | Hard minimum; clips below it are rejected. |
 
 ```python
-import soundfile as sf
-import numpy as np
+import soundfile as sf, numpy as np
 
 wav, sr = sf.read("data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav")
 assert sr == 16000
-assert wav.ndim == 1               # mono
-assert wav.dtype == np.float64     # soundfile returns float64 by default
-assert np.abs(wav).max() <= 1.0   # -1.0 dBFS ≈ linear amplitude 1.0
-
-# Check format info
-info = sf.info("data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav")
-print(info.subtype)  # PCM_16
+assert wav.ndim == 1
+assert wav.dtype == np.float64           # soundfile default
+assert np.abs(wav).max() <= 1.0          # safety ceiling at -1.0 dBFS
+print(sf.info("data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav").subtype)   # PCM_16
 ```
 
 ---
 
-## Normalization pipeline
-
-Each clip passes through two normalization steps:
-
-```
-TTS render (float32, arbitrary loudness)
-    ↓
-[1] Per-turn RMS gain (M3a)        — preserves inter-turn contrast
-    ↓
-[2] Single global peak gain         — lands absolute peak at target_peak_dbfs
-    ↓
-[3] Safety limiter                  — clips at ≤ –1.0 dBFS (guaranteed no-op for target ≥ –12.0)
-    ↓
-Tier B only: room IR + device → renormalize to same target
-    ↓
-Output WAV
-```
-
-### Step 1 — Per-turn RMS gain (M3a)
-
-Each dialogue turn is gain-adjusted so its RMS matches a per-intensity target. This preserves the acoustic contrast between calm and escalated turns — a whispered turn at I1 stays quieter than a shouted turn at I5 — while giving the subsequent global normalization a stable peak-to-RMS ratio to work with.
-
-??? info "Why per-turn RMS matters"
-    Without per-turn normalization, the TTS engine produces flat RMS across intensities regardless of the requested prosody. The raw Azure and Google outputs are nearly constant-loudness even when the SSML requests "shout" style. Per-turn RMS gain is the mechanism that creates the acoustic loudness gradient you expect to see in the data.
-
-### Step 2 — Single global peak gain
-
-A single gain is applied to the whole mix so the clip's absolute peak lands at `loudness_target_peak_dbfs` (default: –2.0 dBFS). Because it's a single gain, all per-turn RMS *ratios* survive unchanged — the contrast from Step 1 is preserved.
+## Two peak fields, two meanings
 
-The configured target is recorded in `generation_metadata.loudness_target_peak_dbfs`.
-The measured output peak is recorded in `preprocessing_applied.normalized_dbfs`.
+Every clip records two related loudness values:
 
-### Step 3 — Safety limiter
+| Field | Set by | What it is |
+|-------|--------|------------|
+| `generation_metadata.loudness_target_peak_dbfs` | The pipeline config | **Configured** peak target (default –2.0 dBFS) |
+| `preprocessing_applied.normalized_dbfs` | Measurement at write time | **Measured** post-preprocess peak of the actual WAV |
 
-A hard ceiling at –1.0 dBFS. For in-spec targets (range: [–12.0, –1.5] dBFS), this is a guaranteed no-op. It exists as a safety rail against misconfiguration.
+If those two disagree by more than a fraction of a dB, something is wrong with normalization. Useful as a diagnostic check.
 
 ---
 
-## Silence padding
+## Known audio quirks
 
-Every clip has at least 0.5 s of ambient silence at the head and tail. This is applied by `preprocess()` and logged in `preprocessing_applied.silence_padded: true`.
+### `vic_f0_high` on the 2 Google clips
 
-Onset/offset timestamps in the `.txt` transcript and `.jsonl` events are already shifted to account for the leading pad — they refer to positions in the final processed WAV, not the raw TTS output.
+`sp_sv_a_0003_00` and `sp_it_a_0003_00` use the Google Chirp 3 HD female voice (`he-IL-Chirp3-HD-Achernar`). Its F0 baseline runs measurably higher than the Azure reference voice (`he-IL-HilaNeural`), against which the QA F0 thresholds were calibrated.
 
----
+**What to do about it:** nothing. The flag fires correctly; the audio is fine. If you compute F0-derived features, calibrate per backend (`generation_metadata.tts_backend`) — or just use spectral features that aren't sensitive to baseline F0. Don't exclude these two clips: they're the only backend diversity you have in this delivery.
 
-## Dirty files
+### `quality_flags: ["emotion_downgrade"]`
 
-`preprocessing_applied` records the processing that was applied. The **pre-preprocessing WAV** is retained as `{clip_id}_dirty.wav` under `assets/speech/dirty/`. These are the raw TTS-mixer outputs before normalization, padding, or denoising.
+The pipeline detected that the TTS engine produced slightly less intense prosody than the SSML asked for at high-intensity turns. The audio is still valid; the prosody is just a touch tamer than the scene intended. About 15 of 20 clips in delivery-003 carry this flag — it's not a defect signal.
 
-The `dirty_file_path` field in ClipMetadata gives the repo-relative path:
-```
-"dirty_file_path": "assets/speech/dirty/sp_sv_a_0001_00_dirty.wav"
-```
+### Dirty files
 
-Dirty files are useful for:
-- Diagnosing normalization issues (compare dirty peak vs. `normalized_dbfs`)
-- Checking raw TTS prosody before processing
-- Re-running preprocessing with different parameters
+The pre-preprocessing WAV is retained at `assets/speech/dirty/{clip_id}_dirty.wav`. Its path is recorded in `dirty_file_path`. These files are the raw TTS-mixer outputs before normalization, padding, or denoising — useful for diagnosing the pipeline, not for training.
 
-!!! warning "Do not modify dirty files"
-    The `assets/` directory is managed by SynthBanshee. Manual edits to `.wav` files under `assets/speech/` will break SHA-256 cache lookups.
+!!! warning "Don't modify files under `assets/`"
+    `assets/speech/` is the SynthBanshee SHA-256 SSML cache. Renaming or editing any file there will break cache lookups and force a paid re-synthesis on next run.
 
 ---
 
 ## TTS backends
 
-| Backend | Voices | Clips in delivery-003 |
-|---------|--------|----------------------|
+| Backend | Voices in delivery-003 | Clips |
+|---------|-----------------------|------:|
 | Azure Cognitive Services | `he-IL-AvriNeural` (M), `he-IL-HilaNeural` (F) | 18 |
 | Google Cloud TTS Chirp 3 HD | `he-IL-Chirp3-HD-Achird` (M), `he-IL-Chirp3-HD-Achernar` (F) | 2 |
 
-The backend per speaker is recorded in `generation_metadata.tts_backend`:
+Per-speaker backend is in `generation_metadata.tts_backend`:
+
 ```json
 "tts_backend": {
     "AGG_M_30-45_002": "google",
     "VIC_F_25-40_003": "google"
 }
 ```
 
-??? info "Azure SSML cache"
-    SynthBanshee caches per-utterance WAVs under `assets/speech/` keyed by SHA-256 of the full rendered SSML string. Re-running generation with the same SSML is **free** for Azure clips — the file is returned directly from cache without an API call. Google Chirp HD does not use the same cache: it produces slightly different audio on each synthesis (minor bit-level variation at the same parameters).
+Azure is deterministic — re-rendering the same SSML returns byte-identical WAVs (via the SHA-256 cache). Google Chirp 3 HD is not — it produces minor bit-level variation on each synthesis at the same parameters. If you need byte-stable reproducibility for an experiment, you may see the Google clips re-render slightly differently between fresh generations even though peak / RMS / duration stay within tolerance.
 
 ---
 
-## Known audio quirks
+## How the normalization actually works
 
-### `vic_f0_high` — Google Chirp HD female F0 baseline
+You don't need this to consume the data. Open the section below if you're debugging loudness drift, building a comparable pipeline, or just curious.
 
-The two Google Chirp 3 HD clips (`sp_sv_a_0003_00`, `sp_it_a_0003_00`) use the female voice `he-IL-Chirp3-HD-Achernar`. This voice's F0 baseline runs measurably higher than `he-IL-HilaNeural` (Azure), against which the corpus QA M10a thresholds were calibrated.
+??? info "The normalization pipeline (3 stages)"
+    ```
+    TTS render  →  per-turn RMS gain  →  single global peak gain  →  safety limiter  →  Tier B: room IR + device + noise → renormalize  →  output WAV
+    ```
 
-Both clips are flagged `vic_f0_high` in the QA report. This is expected and tracked — it reflects a real backend difference, not a synthesis failure. **Do not exclude these clips** on the basis of this flag; calibrate your model's F0 features against the correct baseline per backend.
+    **Stage 1: per-turn RMS gain.** Each dialogue turn is gain-adjusted so its RMS matches a per-intensity target. This creates the calm-to-loud gradient you'd expect — a whispered I1 turn stays quieter than a shouted I5 turn. Without this step, raw Azure and Google output is nearly constant-loudness regardless of the requested prosody style.
 
-### `quality_flags: ["emotion_downgrade"]`
+    **Stage 2: single global peak gain.** A single multiplicative gain lands the clip's absolute peak at `loudness_target_peak_dbfs` (default –2.0 dBFS). Because it's one gain applied to the whole mix, every per-turn RMS ratio from Stage 1 survives unchanged.
+
+    **Stage 3: safety limiter.** A hard ceiling at –1.0 dBFS. For in-spec targets in `[-12.0, -1.5]` dBFS, this is always a no-op. It exists as a safety rail against config drift.
+
+    **Tier B post-processing.** Room IR convolution, device frequency response (e.g. `pi_budget_mic`), and background-noise injection happen after Stage 3. Then the same `peak_normalize_to_target` helper renormalises so every tier exits at the same absolute peak — Tier A and Tier B are comparable on the loudness dimension.
 
-Several clips carry an `emotion_downgrade` quality flag. This means the TTS engine produced a less emotionally intense output than requested by the SSML prosody hints — the pipeline detected the downgrade and flagged it. Audio quality is still acceptable; the prosody is slightly less extreme than the scene specification intended.
+??? info "Why per-turn RMS gain matters"
+    Without it, the TTS engine produces flat RMS across turns regardless of the requested prosody. The raw Azure and Google outputs are nearly constant-loudness even when the SSML requests a "shout" style or sets `prosody volume="+50%"`. Per-turn RMS gain is the mechanism that creates the acoustic loudness gradient between calm and escalated turns — without it, your model has nothing to learn loudness escalation from.
 
-In delivery-003: 15 clips carry at least one quality flag, mostly from prosody cap activations at I3+.
+??? info "Why peak normalize to –2.0 dBFS instead of 0 dBFS"
+    The 2 dB of headroom buys safety against any later processing step that might add 1–2 dB of gain (room IR convolution can do this). Peak at 0 dBFS would clip; peak at –1.0 dBFS leaves no headroom for the limiter. –2.0 is the conservative middle.