Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions docs/assets/extra.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
/* Status pill used in headers and front-page hero */
.status-pill {
display: inline-block;
padding: 0.15rem 0.55rem;
border-radius: 0.4rem;
font-size: 0.72rem;
font-weight: 600;
letter-spacing: 0.03em;
text-transform: uppercase;
vertical-align: middle;
margin: 0 0.25rem;
}
.status-pill.provisional { background: #FFB300; color: #3E2723; }
.status-pill.approved { background: #43A047; color: white; }
.status-pill.superseded { background: #BDBDBD; color: #424242; }

/* Cards used on the home page to replace tabbed "What is this?" widget */
.team-cards {
display: grid;
grid-template-columns: 1fr 1fr;
gap: 1rem;
margin: 1.25rem 0 1.5rem;
}
@media (max-width: 720px) {
.team-cards { grid-template-columns: 1fr; }
}
.team-card {
border: 1px solid var(--md-default-fg-color--lightest);
border-radius: 0.45rem;
padding: 1rem 1.1rem;
background: var(--md-default-bg-color);
transition: transform 0.15s ease, box-shadow 0.15s ease;
}
.team-card:hover {
transform: translateY(-2px);
box-shadow: 0 6px 18px rgba(0,0,0,0.06);
}
.team-card h3 {
margin: 0 0 0.35rem;
font-size: 1rem;
color: var(--md-primary-fg-color);
}
.team-card .tagline {
font-size: 0.78rem;
color: var(--md-default-fg-color--light);
text-transform: uppercase;
letter-spacing: 0.05em;
margin-bottom: 0.5rem;
}
.team-card p { margin: 0.4rem 0; font-size: 0.92rem; }
.team-card a.card-link {
display: inline-block;
margin-top: 0.5rem;
font-weight: 600;
font-size: 0.9rem;
}

/* Tighter table look for reference pages */
.md-typeset table:not([class]) { font-size: 0.78rem; }
.md-typeset table:not([class]) code { font-size: 0.78rem; }
Binary file added docs/assets/sp_sv_a_0001_00_waveform.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
131 changes: 52 additions & 79 deletions docs/audio-format.md
Original file line number Diff line number Diff line change
@@ -1,132 +1,105 @@
# Audio Format

All clips in the corpus conform to the following hard constraints. Clips that fail these checks are rejected at generation time and will not appear in the corpus.
The three facts you need to use the data, then optional detail on how it gets that way.

---

## Format requirements
## What you need to know

| Property | Value |
|----------|-------|
| Sample rate | 16 000 Hz |
| Channels | 1 (mono) |
| Bit depth | 16-bit PCM |
| Peak level | ≤ –1.0 dBFS (safety ceiling) |
| Duration | ≥ 3.0 s |
| Encoding | WAV (no lossy formats) |
| Fact | Value | Why it matters |
|------|-------|----------------|
| **Sample rate** | 16 000 Hz | Always. Resample your features for this. |
| **Channels / depth** | mono / 16-bit PCM WAV | `wav.ndim == 1`. No lossy formats anywhere. |
| **Peak level** | ≤ –1.0 dBFS (target –2.0 dBFS) | `np.abs(wav).max() ≈ 0.79`, **not** 1.0. |
| **Silence pad** | ≥ 0.5 s at head and tail | Onset/offset timestamps **already account for it** — no shift needed. |
| **Duration** | ≥ 3.0 s | Hard minimum; clips below it are rejected. |

```python
import soundfile as sf
import numpy as np
import soundfile as sf, numpy as np

wav, sr = sf.read("data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav")
assert sr == 16000
assert wav.ndim == 1 # mono
assert wav.dtype == np.float64 # soundfile returns float64 by default
assert np.abs(wav).max() <= 1.0 # -1.0 dBFS ≈ linear amplitude 1.0

# Check format info
info = sf.info("data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav")
print(info.subtype) # PCM_16
assert wav.ndim == 1
assert wav.dtype == np.float64 # soundfile default
assert np.abs(wav).max() <= 1.0 # safety ceiling at -1.0 dBFS
print(sf.info("data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav").subtype) # PCM_16
```

---

## Normalization pipeline

Each clip passes through two normalization steps:

```
TTS render (float32, arbitrary loudness)
[1] Per-turn RMS gain (M3a) — preserves inter-turn contrast
[2] Single global peak gain — lands absolute peak at target_peak_dbfs
[3] Safety limiter — clips at ≤ –1.0 dBFS (guaranteed no-op for target ≥ –12.0)
Tier B only: room IR + device → renormalize to same target
Output WAV
```

### Step 1 — Per-turn RMS gain (M3a)

Each dialogue turn is gain-adjusted so its RMS matches a per-intensity target. This preserves the acoustic contrast between calm and escalated turns — a whispered turn at I1 stays quieter than a shouted turn at I5 — while giving the subsequent global normalization a stable peak-to-RMS ratio to work with.

??? info "Why per-turn RMS matters"
Without per-turn normalization, the TTS engine produces flat RMS across intensities regardless of the requested prosody. The raw Azure and Google outputs are nearly constant-loudness even when the SSML requests "shout" style. Per-turn RMS gain is the mechanism that creates the acoustic loudness gradient you expect to see in the data.

### Step 2 — Single global peak gain

A single gain is applied to the whole mix so the clip's absolute peak lands at `loudness_target_peak_dbfs` (default: –2.0 dBFS). Because it's a single gain, all per-turn RMS *ratios* survive unchanged — the contrast from Step 1 is preserved.
## Two peak fields, two meanings

The configured target is recorded in `generation_metadata.loudness_target_peak_dbfs`.
The measured output peak is recorded in `preprocessing_applied.normalized_dbfs`.
Every clip records two related loudness values:

### Step 3 — Safety limiter
| Field | Set by | What it is |
|-------|--------|------------|
| `generation_metadata.loudness_target_peak_dbfs` | The pipeline config | **Configured** peak target (default –2.0 dBFS) |
| `preprocessing_applied.normalized_dbfs` | Measurement at write time | **Measured** post-preprocess peak of the actual WAV |

A hard ceiling at –1.0 dBFS. For in-spec targets (range: [–12.0, –1.5] dBFS), this is a guaranteed no-op. It exists as a safety rail against misconfiguration.
If those two disagree by more than a fraction of a dB, something is wrong with normalization. Useful as a diagnostic check.

---

## Silence padding
## Known audio quirks

Every clip has at least 0.5 s of ambient silence at the head and tail. This is applied by `preprocess()` and logged in `preprocessing_applied.silence_padded: true`.
### `vic_f0_high` on the 2 Google clips

Onset/offset timestamps in the `.txt` transcript and `.jsonl` events are already shifted to account for the leading pad — they refer to positions in the final processed WAV, not the raw TTS output.
`sp_sv_a_0003_00` and `sp_it_a_0003_00` use the Google Chirp 3 HD female voice (`he-IL-Chirp3-HD-Achernar`). Its F0 baseline runs measurably higher than the Azure reference voice (`he-IL-HilaNeural`), against which the QA F0 thresholds were calibrated.

---
**What to do about it:** nothing. The flag fires correctly; the audio is fine. If you compute F0-derived features, calibrate per backend (`generation_metadata.tts_backend`) — or just use spectral features that aren't sensitive to baseline F0. Don't exclude these two clips: they're the only backend diversity you have in this delivery.

## Dirty files
### `quality_flags: ["emotion_downgrade"]`

`preprocessing_applied` records the processing that was applied. The **pre-preprocessing WAV** is retained as `{clip_id}_dirty.wav` under `assets/speech/dirty/`. These are the raw TTS-mixer outputs before normalization, padding, or denoising.
The pipeline detected that the TTS engine produced slightly less intense prosody than the SSML asked for at high-intensity turns. The audio is still valid; the prosody is just a touch tamer than the scene intended. About 15 of 20 clips in delivery-003 carry this flag — it's not a defect signal.

The `dirty_file_path` field in ClipMetadata gives the repo-relative path:
```
"dirty_file_path": "assets/speech/dirty/sp_sv_a_0001_00_dirty.wav"
```
### Dirty files

Dirty files are useful for:
- Diagnosing normalization issues (compare dirty peak vs. `normalized_dbfs`)
- Checking raw TTS prosody before processing
- Re-running preprocessing with different parameters
The pre-preprocessing WAV is retained at `assets/speech/dirty/{clip_id}_dirty.wav`. Its path is recorded in `dirty_file_path`. These files are the raw TTS-mixer outputs before normalization, padding, or denoising — useful for diagnosing the pipeline, not for training.

!!! warning "Do not modify dirty files"
The `assets/` directory is managed by SynthBanshee. Manual edits to `.wav` files under `assets/speech/` will break SHA-256 cache lookups.
!!! warning "Don't modify files under `assets/`"
`assets/speech/` is the SynthBanshee SHA-256 SSML cache. Renaming or editing any file there will break cache lookups and force a paid re-synthesis on next run.

---

## TTS backends

| Backend | Voices | Clips in delivery-003 |
|---------|--------|----------------------|
| Backend | Voices in delivery-003 | Clips |
|---------|-----------------------|------:|
| Azure Cognitive Services | `he-IL-AvriNeural` (M), `he-IL-HilaNeural` (F) | 18 |
| Google Cloud TTS Chirp 3 HD | `he-IL-Chirp3-HD-Achird` (M), `he-IL-Chirp3-HD-Achernar` (F) | 2 |

The backend per speaker is recorded in `generation_metadata.tts_backend`:
Per-speaker backend is in `generation_metadata.tts_backend`:

```json
"tts_backend": {
"AGG_M_30-45_002": "google",
"VIC_F_25-40_003": "google"
}
```

??? info "Azure SSML cache"
SynthBanshee caches per-utterance WAVs under `assets/speech/` keyed by SHA-256 of the full rendered SSML string. Re-running generation with the same SSML is **free** for Azure clips — the file is returned directly from cache without an API call. Google Chirp HD does not use the same cache: it produces slightly different audio on each synthesis (minor bit-level variation at the same parameters).
Azure is deterministic — re-rendering the same SSML returns byte-identical WAVs (via the SHA-256 cache). Google Chirp 3 HD is not — it produces minor bit-level variation on each synthesis at the same parameters. If you need byte-stable reproducibility for an experiment, you may see the Google clips re-render slightly differently between fresh generations even though peak / RMS / duration stay within tolerance.

---

## Known audio quirks
## How the normalization actually works

### `vic_f0_high` — Google Chirp HD female F0 baseline
You don't need this to consume the data. Open the section below if you're debugging loudness drift, building a comparable pipeline, or just curious.

The two Google Chirp 3 HD clips (`sp_sv_a_0003_00`, `sp_it_a_0003_00`) use the female voice `he-IL-Chirp3-HD-Achernar`. This voice's F0 baseline runs measurably higher than `he-IL-HilaNeural` (Azure), against which the corpus QA M10a thresholds were calibrated.
??? info "The normalization pipeline (3 stages)"
```
TTS render → per-turn RMS gain → single global peak gain → safety limiter → Tier B: room IR + device + noise → renormalize → output WAV
```

Both clips are flagged `vic_f0_high` in the QA report. This is expected and tracked — it reflects a real backend difference, not a synthesis failure. **Do not exclude these clips** on the basis of this flag; calibrate your model's F0 features against the correct baseline per backend.
**Stage 1: per-turn RMS gain.** Each dialogue turn is gain-adjusted so its RMS matches a per-intensity target. This creates the calm-to-loud gradient you'd expect — a whispered I1 turn stays quieter than a shouted I5 turn. Without this step, raw Azure and Google output is nearly constant-loudness regardless of the requested prosody style.

### `quality_flags: ["emotion_downgrade"]`
**Stage 2: single global peak gain.** A single multiplicative gain lands the clip's absolute peak at `loudness_target_peak_dbfs` (default –2.0 dBFS). Because it's one gain applied to the whole mix, every per-turn RMS ratio from Stage 1 survives unchanged.

**Stage 3: safety limiter.** A hard ceiling at –1.0 dBFS. For in-spec targets in `[-12.0, -1.5]` dBFS, this is always a no-op. It exists as a safety rail against config drift.

**Tier B post-processing.** Room IR convolution, device frequency response (e.g. `pi_budget_mic`), and background-noise injection happen after Stage 3. Then the same `peak_normalize_to_target` helper renormalises so every tier exits at the same absolute peak — Tier A and Tier B are comparable on the loudness dimension.

Several clips carry an `emotion_downgrade` quality flag. This means the TTS engine produced a less emotionally intense output than requested by the SSML prosody hints — the pipeline detected the downgrade and flagged it. Audio quality is still acceptable; the prosody is slightly less extreme than the scene specification intended.
??? info "Why per-turn RMS gain matters"
Without it, the TTS engine produces flat RMS across turns regardless of the requested prosody. The raw Azure and Google outputs are nearly constant-loudness even when the SSML requests a "shout" style or sets `prosody volume="+50%"`. Per-turn RMS gain is the mechanism that creates the acoustic loudness gradient between calm and escalated turns — without it, your model has nothing to learn loudness escalation from.

In delivery-003: 15 clips carry at least one quality flag, mostly from prosody cap activations at I3+.
??? info "Why peak normalize to –2.0 dBFS instead of 0 dBFS"
The 2 dB of headroom buys safety against any later processing step that might add 1–2 dB of gain (room IR convolution can do this). Peak at 0 dBFS would clip; peak at –1.0 dBFS leaves no headroom for the limiter. –2.0 is the conservative middle.
Loading
Loading