diff --git a/docs/assets/extra.css b/docs/assets/extra.css new file mode 100644 index 0000000..a6e2c4c --- /dev/null +++ b/docs/assets/extra.css @@ -0,0 +1,60 @@ +/* Status pill used in headers and front-page hero */ +.status-pill { + display: inline-block; + padding: 0.15rem 0.55rem; + border-radius: 0.4rem; + font-size: 0.72rem; + font-weight: 600; + letter-spacing: 0.03em; + text-transform: uppercase; + vertical-align: middle; + margin: 0 0.25rem; +} +.status-pill.provisional { background: #FFB300; color: #3E2723; } +.status-pill.approved { background: #43A047; color: white; } +.status-pill.superseded { background: #BDBDBD; color: #424242; } + +/* Cards used on the home page to replace tabbed "What is this?" widget */ +.team-cards { + display: grid; + grid-template-columns: 1fr 1fr; + gap: 1rem; + margin: 1.25rem 0 1.5rem; +} +@media (max-width: 720px) { + .team-cards { grid-template-columns: 1fr; } +} +.team-card { + border: 1px solid var(--md-default-fg-color--lightest); + border-radius: 0.45rem; + padding: 1rem 1.1rem; + background: var(--md-default-bg-color); + transition: transform 0.15s ease, box-shadow 0.15s ease; +} +.team-card:hover { + transform: translateY(-2px); + box-shadow: 0 6px 18px rgba(0,0,0,0.06); +} +.team-card h3 { + margin: 0 0 0.35rem; + font-size: 1rem; + color: var(--md-primary-fg-color); +} +.team-card .tagline { + font-size: 0.78rem; + color: var(--md-default-fg-color--light); + text-transform: uppercase; + letter-spacing: 0.05em; + margin-bottom: 0.5rem; +} +.team-card p { margin: 0.4rem 0; font-size: 0.92rem; } +.team-card a.card-link { + display: inline-block; + margin-top: 0.5rem; + font-weight: 600; + font-size: 0.9rem; +} + +/* Tighter table look for reference pages */ +.md-typeset table:not([class]) { font-size: 0.78rem; } +.md-typeset table:not([class]) code { font-size: 0.78rem; } diff --git a/docs/assets/sp_sv_a_0001_00_waveform.png b/docs/assets/sp_sv_a_0001_00_waveform.png new file mode 100644 index 0000000..f7bf4fe Binary files /dev/null and b/docs/assets/sp_sv_a_0001_00_waveform.png differ diff --git a/docs/audio-format.md b/docs/audio-format.md index b0173a9..61912f8 100644 --- a/docs/audio-format.md +++ b/docs/audio-format.md @@ -1,110 +1,75 @@ # Audio Format -All clips in the corpus conform to the following hard constraints. Clips that fail these checks are rejected at generation time and will not appear in the corpus. +The three facts you need to use the data, then optional detail on how it gets that way. --- -## Format requirements +## What you need to know -| Property | Value | -|----------|-------| -| Sample rate | 16 000 Hz | -| Channels | 1 (mono) | -| Bit depth | 16-bit PCM | -| Peak level | ≤ –1.0 dBFS (safety ceiling) | -| Duration | ≥ 3.0 s | -| Encoding | WAV (no lossy formats) | +| Fact | Value | Why it matters | +|------|-------|----------------| +| **Sample rate** | 16 000 Hz | Always. Resample your features for this. | +| **Channels / depth** | mono / 16-bit PCM WAV | `wav.ndim == 1`. No lossy formats anywhere. | +| **Peak level** | ≤ –1.0 dBFS (target –2.0 dBFS) | `np.abs(wav).max() ≈ 0.79`, **not** 1.0. | +| **Silence pad** | ≥ 0.5 s at head and tail | Onset/offset timestamps **already account for it** — no shift needed. | +| **Duration** | ≥ 3.0 s | Hard minimum; clips below it are rejected. | ```python -import soundfile as sf -import numpy as np +import soundfile as sf, numpy as np wav, sr = sf.read("data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav") assert sr == 16000 -assert wav.ndim == 1 # mono -assert wav.dtype == np.float64 # soundfile returns float64 by default -assert np.abs(wav).max() <= 1.0 # -1.0 dBFS ≈ linear amplitude 1.0 - -# Check format info -info = sf.info("data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav") -print(info.subtype) # PCM_16 +assert wav.ndim == 1 +assert wav.dtype == np.float64 # soundfile default +assert np.abs(wav).max() <= 1.0 # safety ceiling at -1.0 dBFS +print(sf.info("data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav").subtype) # PCM_16 ``` --- -## Normalization pipeline - -Each clip passes through two normalization steps: - -``` -TTS render (float32, arbitrary loudness) - ↓ -[1] Per-turn RMS gain (M3a) — preserves inter-turn contrast - ↓ -[2] Single global peak gain — lands absolute peak at target_peak_dbfs - ↓ -[3] Safety limiter — clips at ≤ –1.0 dBFS (guaranteed no-op for target ≥ –12.0) - ↓ -Tier B only: room IR + device → renormalize to same target - ↓ -Output WAV -``` - -### Step 1 — Per-turn RMS gain (M3a) - -Each dialogue turn is gain-adjusted so its RMS matches a per-intensity target. This preserves the acoustic contrast between calm and escalated turns — a whispered turn at I1 stays quieter than a shouted turn at I5 — while giving the subsequent global normalization a stable peak-to-RMS ratio to work with. - -??? info "Why per-turn RMS matters" - Without per-turn normalization, the TTS engine produces flat RMS across intensities regardless of the requested prosody. The raw Azure and Google outputs are nearly constant-loudness even when the SSML requests "shout" style. Per-turn RMS gain is the mechanism that creates the acoustic loudness gradient you expect to see in the data. - -### Step 2 — Single global peak gain - -A single gain is applied to the whole mix so the clip's absolute peak lands at `loudness_target_peak_dbfs` (default: –2.0 dBFS). Because it's a single gain, all per-turn RMS *ratios* survive unchanged — the contrast from Step 1 is preserved. +## Two peak fields, two meanings -The configured target is recorded in `generation_metadata.loudness_target_peak_dbfs`. -The measured output peak is recorded in `preprocessing_applied.normalized_dbfs`. +Every clip records two related loudness values: -### Step 3 — Safety limiter +| Field | Set by | What it is | +|-------|--------|------------| +| `generation_metadata.loudness_target_peak_dbfs` | The pipeline config | **Configured** peak target (default –2.0 dBFS) | +| `preprocessing_applied.normalized_dbfs` | Measurement at write time | **Measured** post-preprocess peak of the actual WAV | -A hard ceiling at –1.0 dBFS. For in-spec targets (range: [–12.0, –1.5] dBFS), this is a guaranteed no-op. It exists as a safety rail against misconfiguration. +If those two disagree by more than a fraction of a dB, something is wrong with normalization. Useful as a diagnostic check. --- -## Silence padding +## Known audio quirks -Every clip has at least 0.5 s of ambient silence at the head and tail. This is applied by `preprocess()` and logged in `preprocessing_applied.silence_padded: true`. +### `vic_f0_high` on the 2 Google clips -Onset/offset timestamps in the `.txt` transcript and `.jsonl` events are already shifted to account for the leading pad — they refer to positions in the final processed WAV, not the raw TTS output. +`sp_sv_a_0003_00` and `sp_it_a_0003_00` use the Google Chirp 3 HD female voice (`he-IL-Chirp3-HD-Achernar`). Its F0 baseline runs measurably higher than the Azure reference voice (`he-IL-HilaNeural`), against which the QA F0 thresholds were calibrated. ---- +**What to do about it:** nothing. The flag fires correctly; the audio is fine. If you compute F0-derived features, calibrate per backend (`generation_metadata.tts_backend`) — or just use spectral features that aren't sensitive to baseline F0. Don't exclude these two clips: they're the only backend diversity you have in this delivery. -## Dirty files +### `quality_flags: ["emotion_downgrade"]` -`preprocessing_applied` records the processing that was applied. The **pre-preprocessing WAV** is retained as `{clip_id}_dirty.wav` under `assets/speech/dirty/`. These are the raw TTS-mixer outputs before normalization, padding, or denoising. +The pipeline detected that the TTS engine produced slightly less intense prosody than the SSML asked for at high-intensity turns. The audio is still valid; the prosody is just a touch tamer than the scene intended. About 15 of 20 clips in delivery-003 carry this flag — it's not a defect signal. -The `dirty_file_path` field in ClipMetadata gives the repo-relative path: -``` -"dirty_file_path": "assets/speech/dirty/sp_sv_a_0001_00_dirty.wav" -``` +### Dirty files -Dirty files are useful for: -- Diagnosing normalization issues (compare dirty peak vs. `normalized_dbfs`) -- Checking raw TTS prosody before processing -- Re-running preprocessing with different parameters +The pre-preprocessing WAV is retained at `assets/speech/dirty/{clip_id}_dirty.wav`. Its path is recorded in `dirty_file_path`. These files are the raw TTS-mixer outputs before normalization, padding, or denoising — useful for diagnosing the pipeline, not for training. -!!! warning "Do not modify dirty files" - The `assets/` directory is managed by SynthBanshee. Manual edits to `.wav` files under `assets/speech/` will break SHA-256 cache lookups. +!!! warning "Don't modify files under `assets/`" + `assets/speech/` is the SynthBanshee SHA-256 SSML cache. Renaming or editing any file there will break cache lookups and force a paid re-synthesis on next run. --- ## TTS backends -| Backend | Voices | Clips in delivery-003 | -|---------|--------|----------------------| +| Backend | Voices in delivery-003 | Clips | +|---------|-----------------------|------:| | Azure Cognitive Services | `he-IL-AvriNeural` (M), `he-IL-HilaNeural` (F) | 18 | | Google Cloud TTS Chirp 3 HD | `he-IL-Chirp3-HD-Achird` (M), `he-IL-Chirp3-HD-Achernar` (F) | 2 | -The backend per speaker is recorded in `generation_metadata.tts_backend`: +Per-speaker backend is in `generation_metadata.tts_backend`: + ```json "tts_backend": { "AGG_M_30-45_002": "google", @@ -112,21 +77,29 @@ The backend per speaker is recorded in `generation_metadata.tts_backend`: } ``` -??? info "Azure SSML cache" - SynthBanshee caches per-utterance WAVs under `assets/speech/` keyed by SHA-256 of the full rendered SSML string. Re-running generation with the same SSML is **free** for Azure clips — the file is returned directly from cache without an API call. Google Chirp HD does not use the same cache: it produces slightly different audio on each synthesis (minor bit-level variation at the same parameters). +Azure is deterministic — re-rendering the same SSML returns byte-identical WAVs (via the SHA-256 cache). Google Chirp 3 HD is not — it produces minor bit-level variation on each synthesis at the same parameters. If you need byte-stable reproducibility for an experiment, you may see the Google clips re-render slightly differently between fresh generations even though peak / RMS / duration stay within tolerance. --- -## Known audio quirks +## How the normalization actually works -### `vic_f0_high` — Google Chirp HD female F0 baseline +You don't need this to consume the data. Open the section below if you're debugging loudness drift, building a comparable pipeline, or just curious. -The two Google Chirp 3 HD clips (`sp_sv_a_0003_00`, `sp_it_a_0003_00`) use the female voice `he-IL-Chirp3-HD-Achernar`. This voice's F0 baseline runs measurably higher than `he-IL-HilaNeural` (Azure), against which the corpus QA M10a thresholds were calibrated. +??? info "The normalization pipeline (3 stages)" + ``` + TTS render → per-turn RMS gain → single global peak gain → safety limiter → Tier B: room IR + device + noise → renormalize → output WAV + ``` -Both clips are flagged `vic_f0_high` in the QA report. This is expected and tracked — it reflects a real backend difference, not a synthesis failure. **Do not exclude these clips** on the basis of this flag; calibrate your model's F0 features against the correct baseline per backend. + **Stage 1: per-turn RMS gain.** Each dialogue turn is gain-adjusted so its RMS matches a per-intensity target. This creates the calm-to-loud gradient you'd expect — a whispered I1 turn stays quieter than a shouted I5 turn. Without this step, raw Azure and Google output is nearly constant-loudness regardless of the requested prosody style. -### `quality_flags: ["emotion_downgrade"]` + **Stage 2: single global peak gain.** A single multiplicative gain lands the clip's absolute peak at `loudness_target_peak_dbfs` (default –2.0 dBFS). Because it's one gain applied to the whole mix, every per-turn RMS ratio from Stage 1 survives unchanged. + + **Stage 3: safety limiter.** A hard ceiling at –1.0 dBFS. For in-spec targets in `[-12.0, -1.5]` dBFS, this is always a no-op. It exists as a safety rail against config drift. + + **Tier B post-processing.** Room IR convolution, device frequency response (e.g. `pi_budget_mic`), and background-noise injection happen after Stage 3. Then the same `peak_normalize_to_target` helper renormalises so every tier exits at the same absolute peak — Tier A and Tier B are comparable on the loudness dimension. -Several clips carry an `emotion_downgrade` quality flag. This means the TTS engine produced a less emotionally intense output than requested by the SSML prosody hints — the pipeline detected the downgrade and flagged it. Audio quality is still acceptable; the prosody is slightly less extreme than the scene specification intended. +??? info "Why per-turn RMS gain matters" + Without it, the TTS engine produces flat RMS across turns regardless of the requested prosody. The raw Azure and Google outputs are nearly constant-loudness even when the SSML requests a "shout" style or sets `prosody volume="+50%"`. Per-turn RMS gain is the mechanism that creates the acoustic loudness gradient between calm and escalated turns — without it, your model has nothing to learn loudness escalation from. -In delivery-003: 15 clips carry at least one quality flag, mostly from prosody cap activations at I3+. +??? info "Why peak normalize to –2.0 dBFS instead of 0 dBFS" + The 2 dB of headroom buys safety against any later processing step that might add 1–2 dB of gain (room IR convolution can do this). Peak at 0 dBFS would clip; peak at –1.0 dBFS leaves no headroom for the limiter. –2.0 is the conservative middle. diff --git a/docs/deliveries.md b/docs/deliveries.md index 71dac3b..12c9e2a 100644 --- a/docs/deliveries.md +++ b/docs/deliveries.md @@ -1,16 +1,14 @@ # Deliveries -All data deliveries are logged here. Each entry links to per-delivery notes with clip counts, QA findings, known limitations, and the SynthBanshee commit that produced the batch. +What's currently in the corpus, what's missing, and what changed in the latest batch. One row per data delivery in the log at the bottom. --- -## Delivery 003 — multi-project, multi-voice +## Current delivery — 003 -**Date:** 2026-05-12 · **Status:** provisional · **PR:** [#5](https://github.com/DataHackIL/avdp-synth-corpus/pull/5) +provisional · 2026-05-12 [`#5`](https://github.com/DataHackIL/avdp-synth-corpus/pull/5) · slug: `multi-project-multi-voice` · supersedes delivery-002. -This is the current working delivery. It replaces delivery-002. - -### At a glance +### What's in it | | | |---|---| @@ -19,53 +17,74 @@ This is the current working delivery. It replaces delivery-002. | Projects | `she_proves` (12) + `elephant_in_the_room` (8) | | Tiers | A (12 clean) + B (8 room-augmented) | | TTS backends | Azure (18) + Google Chirp 3 HD (2) | +| Unique speaker personas | 6 (4 in She-Proves, 2 in Elephant) | | Validation failures | 0 / 20 | | Pipeline | SynthBanshee `0.1.0` @ [`1ea48f3`](https://github.com/DataHackIL/SynthBanshee/commit/1ea48f3) | -[Full notes](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/003-multi-project-multi-voice/notes.md) · [QA report](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/003-multi-project-multi-voice/qa-report.json) +Authoritative records: [`metadata.yaml`](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/003-multi-project-multi-voice/metadata.yaml) · [`notes.md`](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/003-multi-project-multi-voice/notes.md) · [`qa-report.json`](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/003-multi-project-multi-voice/qa-report.json). -### QA findings — closed (vs. delivery-002) +### Known limitations -| Finding | Delivery-002 | Delivery-003 | -|---------|:---:|:---:| -| `agg_no_escalation` | 3 clips | **0** — AGG RMS now escalates with intensity | -| `warn_no_overlap` | 4 clips | **0** — overlap_ratio 100% on I4+ clips | -| `warn_emotion_downgrade` | 4 clips | **0** — emotion_downgrade_ratio 0% | -| `generation_metadata` absent | 0 of 8 clips | **20 of 20** carry the full block | -| `dirty_file_path` null | 7 of 8 clips | **20 of 20** retain dirty files | -| `normalized_dbfs` hardcoded `-1.0` | all 8 clips | **fixed** — now the measured peak | +- **All clips are `split: train`.** Only 4 unique speaker personas across 20 clips — speaker-disjoint partitioning isn't feasible at this scale. +- **One room type for Elephant.** All 8 Tier-B clips use `clinic_office`. `welfare_office` and `open_office` are in the pipeline but not exercised yet. +- **One device profile for She-Proves.** No `phone_in_pocket` etc. augmentation applied yet — Tier-A clips are clean, not phone-captured. +- **Voice diversity is low.** 2 voice families per gender; the QA threshold for "diverse" is ≥3. +- **Toy-batch scale.** 20 clips is enough to wire up consumer plumbing. Not enough to train a production model. -Additional findings closed by the 2026-05-12 schema-shift regen (PRs [#110](https://github.com/DataHackIL/SynthBanshee/pull/110)/[#111](https://github.com/DataHackIL/SynthBanshee/pull/111)/[#112](https://github.com/DataHackIL/SynthBanshee/pull/112)): +### Open QA flags -| Finding | Resolution | -|---------|-----------| -| `single_backend` false positive | `qa.py` now derives backend diversity from `generation_metadata.tts_backend.values()`; reports `clips_by_tts_backend: {azure: 18, google: 2}` | -| Absolute paths in clip JSON | `dirty_file_path` and `transcript_path` are now repo-relative POSIX strings | -| Leaked pytest tmp_path on `sp_neu_a_0001_00` | Regen overwrote with canonical path; autouse env-var strip fixture prevents future leaks | +| Flag | Detail | What to do about it | +|------|--------|---------------------| +| `low_voice_diversity_male` | 2 male voice families across the corpus (threshold ≥3) | Track per-voice eval separately; expect feature overfit to AvriNeural until more voices land | +| `low_voice_diversity_female` | Same, for female voices | Same | +| `vic_f0_high` (per-clip × 2) | `sp_sv_a_0003_00`, `sp_it_a_0003_00` — Google Chirp HD female F0 above Azure baseline | **Nothing.** Don't exclude the clips. Calibrate F0 features per backend if you compute them. See [Audio Format](audio-format.md#vic_f0_high-on-the-2-google-clips). | +| `quality_flagged_clips: 15` | Mostly `emotion_downgrade` from prosody cap activations at I3+ | Don't reflexively filter these out — they pass validation. See [Common mistakes #7](gotchas.md#7-quality_flags-doesnt-mean-broken). | -### QA findings — open +### Distribution -| Finding | Detail | -|---------|--------| -| `low_voice_diversity_male` | 2 voice families per gender; threshold ≥ 3 | -| `low_voice_diversity_female` | 2 voice families per gender; threshold ≥ 3 | -| `vic_f0_high` (2 clips) | `sp_sv_a_0003_00` and `sp_it_a_0003_00` — Google Chirp HD female F0 runs higher than Azure Hila reference | -| `quality_flagged_clips: 15` | Mostly from prosody cap activations at I3+; expected behaviour | +| Typology | Tier A (She-Proves) | Tier B (Elephant) | Total | +|----------|:--:|:--:|:--:| +| `SV` | 3 | 2 | 5 | +| `IT` | 3 | 2 | 5 | +| `NEG` | 3 | 2 | 5 | +| `NEU` | 3 | 2 | 5 | -### Known limitations +`max_intensity` across the 20 clips: I5 = 10 clips · I3 = 4 clips · I2 = 6 clips. + +--- + +## What this delivery exercises + +Use these to check your consumer code on the schema features the delivery was designed to cover: + +1. Full `ClipMetadata` schema — including the `generation_metadata` block and (for Tier B) populated `acoustic_scene`. +2. Per-surface casing rules — UPPERCASE `speaker_id`, lowercase paths and clip IDs. +3. `has_violence` derivation from events — NEG clips correctly `false` even at `max_intensity ≥ 3`. +4. Multi-project layout under a single `data/he/` root. +5. Multi-backend provenance — `generation_metadata.tts_backend` differs per speaker. + +--- + +## What changed vs delivery-002 -- **Speaker-disjoint splits not feasible.** 4 unique speaker personas across 20 clips; all clips are `split: train`. -- **Two speaker directories only.** `agg_m_30-45_002/` and `ben_m_40-55_003/` are first appearances — code hardcoding `agg_m_30-45_001/` will miss them. -- **One room type.** All 8 Elephant Tier B clips use `clinic_office`. Future deliveries will add `welfare_office` and `open_office`. -- **Toy corpus only.** 20 clips is not sufficient for training production models. +??? abstract "Closed QA findings (vs. delivery-002)" + | Finding | Delivery-002 | Delivery-003 | + |---------|:---:|:---:| + | `agg_no_escalation` | 3 clips | **0** — AGG RMS now escalates with intensity | + | `warn_no_overlap` | 4 clips | **0** — turn-overlap fires on I4+ clips | + | `warn_emotion_downgrade` | 4 clips | **0** | + | `generation_metadata` absent | 0 of 8 clips had it | **20 of 20** carry the full block | + | `dirty_file_path` null | 7 of 8 clips | **20 of 20** retain dirty files | + | `normalized_dbfs` hardcoded `-1.0` | all 8 clips | Records the measured peak | -### What this delivery exercises +??? abstract "Closed by the 2026-05-12 schema-shift regen" + Three SynthBanshee PRs landed alongside the regen ([#110](https://github.com/DataHackIL/SynthBanshee/pull/110) / [#111](https://github.com/DataHackIL/SynthBanshee/pull/111) / [#112](https://github.com/DataHackIL/SynthBanshee/pull/112)): -1. Full `ClipMetadata` schema including `generation_metadata`, `voice_family`, and (for Tier B) the populated `acoustic_scene` block -2. Per-surface casing rules: UPPERCASE `speaker_id`, lowercase paths and clip IDs -3. `has_violence` derivation from events: NEG clips are correctly `false` even at `max_intensity ≥ 3` -4. Multi-project layout under a single `data/he/` root -5. Multi-backend provenance: `generation_metadata.tts_backend` per speaker + | Finding | Resolution | + |---------|-----------| + | `single_backend` false positive | `qa.py` derives backend diversity from `generation_metadata.tts_backend.values()`; reports `clips_by_tts_backend: {azure: 18, google: 2}` | + | Absolute paths in clip JSON | `dirty_file_path` and `transcript_path` are now repo-relative POSIX | + | Leaked pytest tmp_path on `sp_neu_a_0001_00` | Regen overwrote with canonical path; autouse env-var strip fixture prevents future leaks | --- @@ -73,14 +92,14 @@ Additional findings closed by the 2026-05-12 schema-shift regen (PRs [#110](http | # | Date | Slug | Project | Tier | Clips | Duration | Status | |---|------|------|---------|------|------:|------:|--------| -| [003](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/003-multi-project-multi-voice/notes.md) | 2026-05-12 | multi-project-multi-voice | she_proves + elephant | A + B | 20 | ~42m | provisional | -| [002](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/002-m2a-wettest/notes.md) | 2026-04-15 | m2a-wettest | she_proves | A | 8 | ~17m | superseded | -| [001](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/001-debug-run-1/notes.md) | 2026-04-15 | debug-run-1 | she_proves | A | 1 | 2m 36s | superseded | +| [003](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/003-multi-project-multi-voice/notes.md) | 2026-05-12 | multi-project-multi-voice | she_proves + elephant | A + B | 20 | ~42m | provisional | +| [002](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/002-m2a-wettest/notes.md) | 2026-04-15 | m2a-wettest | she_proves | A | 8 | ~17m | superseded | +| [001](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/001-debug-run-1/notes.md) | 2026-04-15 | debug-run-1 | she_proves | A | 1 | 2m 36s | superseded | ## Status definitions | Status | Meaning | |--------|---------| -| `provisional` | Wet-test batch; not yet approved for model training | +| `provisional` | Preview batch; consumer-integration only, not approved for training | | `approved` | QA passed; cleared for training use | -| `superseded` | Replaced by a later delivery with the same scenes at higher quality | +| `superseded` | Replaced by a later delivery covering the same scenes at higher quality | diff --git a/docs/elephant.md b/docs/elephant.md index fa62e35..2409795 100644 --- a/docs/elephant.md +++ b/docs/elephant.md @@ -1,178 +1,155 @@ # Elephant in the Room Guide -**Elephant in the Room (הפיל שבחדר)** is a Raspberry Pi–class device placed in clinic and welfare offices that alerts security when a social worker is under threat. +Elephant in the Room (הפיל שבחדר) is a Raspberry Pi–class device placed in clinic and welfare offices that alerts security when a social worker is under threat. **Optimisation target: high precision** — false alarms erode trust with the security team and the workers they protect. -**Optimization target: high precision.** False alarms erode trust with security staff and social workers alike. +This page is the *differential* between Elephant clips and the rest of the corpus. For shared concepts (schema, labels, audio format) follow the cross-links. --- -## Scene structure +## Scene profile -| Property | Value | -|----------|-------| -| Duration | 1–4 minutes | -| Tier | B (room IR + device + noise augmentation) | -| Alert window | Final 40% of the clip | -| Device profile | `pi_budget_mic` | -| Room types | `clinic_office`, `welfare_office`, `open_office` | -| Language | Hebrew (`he`) | +| | | +|---|---| +| Project code | `elephant_in_the_room` (clip-id prefix `el_*`) | +| Tier | B — room IR + device profile + background noise applied | +| Duration | 1–4 min | +| Alert window | Final 40% of the clip — violence events concentrate here | +| Device | `pi_budget_mic` | +| Room types | `clinic_office`, `welfare_office`, `open_office` (only `clinic_office` in delivery-003) | -The alert-in-final-40% constraint reflects real-world deployment: the device picks up normal consultation audio before a client becomes threatening. The model must recognize genuine escalation from a baseline of professional interaction. - -??? info "Tier B acoustic augmentation pipeline" - Tier B clips go through three augmentation steps after TTS rendering and preprocessing: - - 1. **Room impulse response (IR)** — the clean speech is convolved with a synthetic room IR (generated by `pyroomacoustics` image-source method) to simulate the acoustic of the target room type. - 2. **Device frequency response** — the `pi_budget_mic` profile applies the frequency response of a budget Raspberry Pi microphone capsule. - 3. **Background noise injection** — ambient noise events (HVAC hum, equipment sounds) are mixed in at specified SNR levels. - - After augmentation, the clip is renormalized to the same peak target (–2.0 dBFS) via the shared `peak_normalize_to_target` helper — so all tiers exit at the same absolute peak level. +The alert-in-final-40% constraint mirrors real-world deployment: the device picks up normal consultation audio for most of the session before any threat emerges. The model must distinguish escalation from a baseline of routine professional interaction. --- -## Speaker pair +## What Tier B adds (and why) -Delivery-003 has one Elephant speaker pair. +Tier B clips run through three augmentation stages after preprocessing. This is what separates them from She-Proves (Tier A) clips. -| Speaker dir | Male speaker | Female speaker | Backend | -|-------------|--------------|----------------|---------| -| `ben_m_40-55_003/` | `BEN_M_40-55_003` → `he-IL-AvriNeural` | `SW_F_30-45_001` → `he-IL-HilaNeural` | Azure | +| Stage | What it adds | Where to find it in metadata | +|-------|--------------|------------------------------| +| Room IR convolution | Reverb of a real-sounding room | `acoustic_scene.room_type`, `ir_source` | +| Device profile | Frequency response of a budget Pi microphone | `acoustic_scene.device` | +| Background noise injection | HVAC hum + occasional `ACOU_*` events | `acoustic_scene.background_events` | -The roles are **BEN (beneficiary/client, male) + SW (social worker, female)** — matching the most common demographic in Israeli welfare/clinic settings. +After augmentation the clip is renormalised to the same peak target (–2.0 dBFS) as Tier A, so the two tiers are comparable on the loudness dimension. -!!! note "`ben_m_40-55_003/` is a new speaker directory in delivery-003" - Downstream code that hardcoded `agg_m_30-45_001/` for She-Proves will not find these clips. Use `manifest.csv` or filter by `meta["project"] == "elephant_in_the_room"`. +!!! info "What `pyroomacoustics_ism` is" + The image-source method (ISM) synthesises a room impulse response by simulating a virtual point source reflecting off the walls of a modelled room. [`pyroomacoustics`](https://pyroomacoustics.readthedocs.io/) is the Python library that implements it. The resulting IR, when convolved with clean speech, makes the speech sound like it was recorded in the modelled room — without needing a real recording. --- -## The `acoustic_scene` block +## The `acoustic_scene` field -This is the key difference between Tier A and Tier B metadata. For Elephant clips, `acoustic_scene` is fully populated: +For Tier A clips this is all `null` / empty. For Elephant clips it's fully populated: ```json "acoustic_scene": { - "room_type": "clinic_office", - "device": "pi_budget_mic", - "ir_source": "pyroomacoustics_ism", - "snr_db_actual": 11.2, - "speaker_distance_meters": 1.2, - "background_events": [ - {"type": "hvac_hum", "onset": 0.0, "offset": 147.0, "level_db": -37.4}, - {"type": "ACOU_SLAM", "onset": 72.164, "offset": 72.476, "level_db": 9.9}, - {"type": "ACOU_FALL", "onset": 97.57, "offset": 98.473, "level_db": 9.6} - ] + "room_type": "clinic_office", + "device": "pi_budget_mic", + "ir_source": "pyroomacoustics_ism", + "snr_db_actual": 11.2, + "speaker_distance_meters": 1.2, + "background_events": [ + {"type": "hvac_hum", "onset": 0.000, "offset": 147.031, "level_db": -37.4}, + {"type": "ACOU_SLAM", "onset": 72.164, "offset": 72.476, "level_db": 9.9}, + {"type": "ACOU_FALL", "onset": 97.570, "offset": 98.473, "level_db": 9.6} + ] } ``` -| Field | Meaning | -|-------|---------| -| `room_type` | Simulated room environment | -| `device` | Microphone/device profile applied | -| `ir_source` | Method used to generate room IR | -| `snr_db_actual` | Measured speech-to-noise ratio after mixing | -| `speaker_distance_meters` | Simulated speaker-to-mic distance | -| `background_events` | Non-speech acoustic events: type, timestamps, level | +| Field | What it tells you | +|-------|-------------------| +| `room_type` | Modelled room (`clinic_office` / `welfare_office` / `open_office`) | +| `device` | Microphone profile applied (`pi_budget_mic`) | +| `ir_source` | How the room IR was generated (currently always `pyroomacoustics_ism`) | +| `snr_db_actual` | Measured speech-to-noise ratio in dB **after** mixing — your ground truth for SNR-stratified eval | +| `speaker_distance_meters` | Simulated distance from speaker to microphone | +| `background_events` | List of non-speech acoustic events: `hvac_hum` (constant low-level), `ACOU_SLAM` / `ACOU_FALL` (brief, high-level) | -??? info "What is `pyroomacoustics_ism`?" - The image-source method (ISM) is an algorithm for computing room impulse responses by reflecting a virtual point source off the room's walls. `pyroomacoustics` is a Python library that implements it. +!!! info "`ACOU_*` events are double-recorded" + Each `ACOU_SLAM` / `ACOU_FALL` event lives in **both** `acoustic_scene.background_events` (with `level_db` mixing metadata) **and** the `.jsonl` strong-label file (as a regular `EventLabel` with `tier1_category: "ACOU"`). The two views are deliberate — the first carries audio-level provenance, the second is the supervision target. If you train an event detector, use the `.jsonl` view. + +--- + +## Speaker pair - The resulting IR simulates how sound travels from a speaker to a microphone in a room of specified dimensions and surface absorption coefficients — giving the audio the characteristic reverb of the target room type without recording in a real room. +One pair in delivery-003. Roles match Israeli welfare/clinic demographics: BEN (client/service-user, male) + SW (social worker, female). -??? info "Background event types" - | Type | Description | - |------|-------------| - | `hvac_hum` | Constant HVAC/ventilation hum (low level, full duration) | - | `ACOU_SLAM` | Door slam or hard object impact (brief, high level) | - | `ACOU_FALL` | Object falling or being thrown (brief, high level) | +Speaker directory: `data/he/ben_m_40-55_003/` - `ACOU_*` events are also tagged as `EventLabel` entries in the `.jsonl` strong labels with `tier1_category: "ACOU"`. This means they contribute to `weak_label.violence_categories` even in SV/IT clips where the primary violence is verbal or physical. +| Role | speaker_id | TTS voice | +|------|-----------|-----------| +| BEN | `BEN_M_40-55_003` | `he-IL-AvriNeural` | +| SW | `SW_F_30-45_001` | `he-IL-HilaNeural` | + +Both speakers use the Azure backend. See [Glossary — Speaker roles](glossary.md#speaker-roles) if `BEN` and `SW` are new abbreviations. --- ## Clips in delivery-003 -`data/he/ben_m_40-55_003/` +**8 clips · ~17 min · 4 violent (SV + IT), 4 non-violent (NEG + NEU) · all `room_type: clinic_office`, all `device: pi_budget_mic`, SNR ~11 dB** -| Clip ID | Typology | `has_violence` | Duration | SNR (dB) | -|---------|----------|:---:|------:|:---:| -| `el_sv_b_0001_00` | SV | ✓ | 2m 27.0s | ~11 | -| `el_sv_b_0002_00` | SV | ✓ | 2m 18.5s | ~11 | -| `el_it_b_0001_00` | IT | ✓ | 2m 30.0s | ~11 | -| `el_it_b_0002_00` | IT | ✓ | 2m 31.6s | ~11 | -| `el_neg_b_0001_00` | NEG | — | 1m 53.8s | ~11 | -| `el_neg_b_0002_00` | NEG | — | 2m 54.6s | ~11 | -| `el_neu_b_0001_00` | NEU | — | 1m 56.9s | ~11 | -| `el_neu_b_0002_00` | NEU | — | 1m 19.7s | ~11 | +??? abstract "Full clip listing" + All in `data/he/ben_m_40-55_003/`: -All 8 clips are Tier B with `device: pi_budget_mic` and `room_type: clinic_office`. + | Clip ID | Typology | violent | Duration | + |---------|----------|:---:|---------:| + | `el_sv_b_0001_00` | SV | ✓ | 2m 27.0s | + | `el_sv_b_0002_00` | SV | ✓ | 2m 18.5s | + | `el_it_b_0001_00` | IT | ✓ | 2m 30.0s | + | `el_it_b_0002_00` | IT | ✓ | 2m 31.6s | + | `el_neg_b_0001_00` | NEG | — | 1m 53.8s | + | `el_neg_b_0002_00` | NEG | — | 2m 54.6s | + | `el_neu_b_0001_00` | NEU | — | 1m 56.9s | + | `el_neu_b_0002_00` | NEU | — | 1m 19.7s | --- -## Loading Elephant clips +## Loading and inspecting an Elephant clip ```python -import json -import soundfile as sf -import numpy as np -import pandas as pd +import pandas as pd, soundfile as sf, json from pathlib import Path root = Path(".") df = pd.read_csv("data/he/manifest.csv") -el_clips = df[df["project"] == "elephant_in_the_room"] +el = df[df["project"] == "elephant_in_the_room"] # 8 rows -# Load audio + metadata for a Tier B clip -clip_id = "el_sv_b_0001_00" -wav, sr = sf.read(root / f"data/he/ben_m_40-55_003/{clip_id}.wav") -meta = json.loads((root / f"data/he/ben_m_40-55_003/{clip_id}.json").read_text()) +# Pick one clip +row = el.iloc[0] +wav, sr = sf.read(root / row.wav_path) +meta = json.loads((root / row.wav_path).with_suffix(".json").read_text()) -# Inspect acoustic scene +# Acoustic scene scene = meta["acoustic_scene"] -print(f"Room: {scene['room_type']} Device: {scene['device']} SNR: {scene['snr_db_actual']} dB") -# Room: clinic_office Device: pi_budget_mic SNR: 11.2 dB +print(f"{scene['room_type']} {scene['device']} SNR {scene['snr_db_actual']} dB " + f"dist {scene['speaker_distance_meters']} m") -# Find background acoustic events +# Background acoustic events for evt in scene["background_events"]: - print(f"{evt['type']}: {evt['onset']:.1f}s – {evt['offset']:.1f}s @ {evt['level_db']} dB") -# hvac_hum: 0.0s – 147.0s @ -37.4 dB -# ACOU_SLAM: 72.2s – 72.5s @ 9.9 dB -# ACOU_FALL: 97.6s – 98.5s @ 9.6 dB + print(f" {evt['type']:10s} {evt['onset']:6.1f}s – {evt['offset']:6.1f}s @ {evt['level_db']:+5.1f} dB") -# Get alert window (final 40%) +# Alert window (final 40%) — for sliding-window evaluation duration = meta["duration_seconds"] -alert_start = duration * 0.60 -print(f"Alert window: {alert_start:.1f}s – {duration:.1f}s") +alert_start = 0.60 * duration -# Filter strong labels to alert window only -events = [json.loads(l) for l in - (root / f"data/he/ben_m_40-55_003/{clip_id}.jsonl").read_text().splitlines()] +events = [json.loads(l) for l in (root / row.strong_labels_path).read_text().splitlines() if l.strip()] alert_events = [e for e in events if e["onset"] >= alert_start] +print(f"alert window: {alert_start:.1f}s – {duration:.1f}s " + f"{len(alert_events)} events fire in window " + f"(of {len(events)} total)") ``` --- -## Guidance for model training - -!!! warning "This is a toy corpus — not for production training" - 8 Elephant clips from 1 speaker pair in 1 room type is insufficient for training. This delivery exists to bootstrap your data pipeline and acoustic-scene parsing code. - -**High-precision orientation:** - -- **NEG clips are essential.** Your precision target means you must not fire on `el_neg_b_*` clips — intense speech in a clinic room with background noise, but no violence. Train hard against these. -- **The alert-in-final-40% window** is where violence events concentrate. Consider a sliding-window detector that scores the final portion of each clip more aggressively than the opening. -- **SNR is ~11 dB.** This is a realistic but challenging condition for acoustic feature extraction. Verify that your features (MFCCs, log-mel, etc.) are robust at this SNR before comparing with She-Proves Tier A results. - -**Tier B–specific features:** - -- `acoustic_scene.snr_db_actual` gives you the ground-truth SNR per clip — useful for SNR-conditioned training or evaluation stratification. -- `background_events` timestamps let you train event detectors separately from the speech violence detector. -- `acoustic_scene.room_type` will diversify across room types at scale (`clinic_office`, `welfare_office`, `open_office`). Future deliveries will include all three. - -**What delivery-003 doesn't cover:** +## Training-time notes (specific to this project) -- Only `clinic_office` room type (all 8 clips) -- Only one speaker pair (BEN_M_40-55_003 + SW_F_30-45_001) -- No test/val split (4 unique speakers total; all are `split: train`) -- SNR variation (all ~11 dB) +- **NEG clips are essential for precision.** `el_neg_b_*` is intense speech in a clinic room with background noise but no violence. If your detector fires on these, security stops trusting it. Train hard against these. +- **The alert-in-final-40% structure is exploitable.** Consider a sliding-window detector that biases toward the back half of each clip — or use the window structure as a positional feature. Don't reward early firing. +- **SNR ~11 dB is challenging.** Verify your features (MFCCs, log-mel, etc.) are robust here before comparing with She-Proves Tier A results. SNR is recorded per clip (`acoustic_scene.snr_db_actual`) — use it for SNR-stratified eval. +- **`ACOU_*` events double as strong labels.** You can train an event detector on `ACOU_SLAM` / `ACOU_FALL` separately from the speech-violence detector and ensemble them. +- **What delivery-003 *doesn't* cover:** only `clinic_office`, only one speaker pair (BEN+SW), only Azure backend, SNR essentially constant at ~11 dB. Plan for room diversity, SNR stratification, and speaker-disjoint splits when scaling. -Plan for room-type diversity, SNR stratification, and speaker disjoint splits at scale. +!!! warning "Still a small test batch" + 8 clips, 1 room type, 1 speaker pair, 1 SNR is enough to wire up data loaders and acoustic-scene parsing. It is not enough to train a production model. Build the plumbing; wait for the real batch. diff --git a/docs/getting-started.md b/docs/getting-started.md index 0310028..a7f9cbb 100644 --- a/docs/getting-started.md +++ b/docs/getting-started.md @@ -1,183 +1,163 @@ -# Getting Started +# Start here -This guide walks through loading and using clips from the corpus in Python. All paths are relative to the repository root. +Your first 10 minutes with the corpus. By the end you'll have cloned it, verified the clone, loaded one clip with its labels, and seen what's in a transcript. -## Prerequisites +--- + +## 1. Clone ```bash -pip install soundfile numpy pandas pydantic +git clone https://github.com/DataHackIL/avdp-synth-corpus.git +cd avdp-synth-corpus ``` -??? note "Optional: full SynthBanshee schema" - If you want strict Pydantic validation against the full `ClipMetadata` schema: - ```bash - git clone https://github.com/DataHackIL/SynthBanshee - cd SynthBanshee && pip install -e . - ``` - This gives you `from synthbanshee.labels.schema import ClipMetadata` and `validate_clip()`. - For most DS workflows, plain `json.loads()` is sufficient. +No Git LFS. Total size is a few hundred megabytes for delivery-003 — the audio lives in `data/he/`, the SSML caches live in `assets/`. -## Clone the corpus +--- + +## 2. Verify the clone ```bash -git clone https://github.com/DataHackIL/avdp-synth-corpus.git -cd avdp-synth-corpus +find data/he -name "*.wav" | wc -l # expect 20 +wc -l data/he/manifest.csv # expect 21 (header + 20 rows) ``` -The repository contains the audio files directly (no LFS). Total size is moderate — `data/he/` is roughly a few hundred MB for delivery-003. +If those numbers don't match, the clone is incomplete — `git lfs pull` is not the answer (we don't use LFS). Re-clone. --- -## Load a single clip +## 3. Install the minimal Python deps + +```bash +pip install soundfile numpy pandas +``` + +That's enough for everything on this page. `pydantic` is only needed if you want strict schema validation; `jsonlines` only if you prefer it to the one-liner that reads `.jsonl` directly. + +??? note "When you'd want the full SynthBanshee install" + If you want `from synthbanshee.labels.schema import ClipMetadata` for strict Pydantic validation, or `synthbanshee qa-report` to re-run QA over the data directory: + ```bash + git clone https://github.com/DataHackIL/SynthBanshee + cd SynthBanshee && pip install -e . + ``` + For consuming the corpus, `json.loads()` is fine and is what the examples below use. + +--- + +## 4. Load one clip end-to-end + +The path on disk is **lowercase** even though the speaker ID in JSON is **UPPERCASE** — that's a [Gotcha #4](gotchas.md#4-uppercase-in-json-lowercase-on-disk). ```python import json from pathlib import Path import soundfile as sf -import numpy as np - -root = Path(".") # run from repo root +root = Path(".") # repo root +clip_dir = root / "data/he/agg_m_30-45_001" clip_id = "sp_sv_a_0001_00" -speaker_dir = root / "data/he/agg_m_30-45_001" -# --- Audio --- -wav, sr = sf.read(speaker_dir / f"{clip_id}.wav") -# wav: float64 array, shape (N,). sr: always 16000. +# Audio +wav, sr = sf.read(clip_dir / f"{clip_id}.wav") +assert sr == 16000 and wav.ndim == 1 # always 16 kHz mono -print(f"Duration: {len(wav)/sr:.1f}s Sample rate: {sr} Peak: {np.abs(wav).max():.4f}") -# Duration: 110.5s Sample rate: 16000 Peak: 0.7943 - -# --- Weak labels (ClipMetadata) --- -meta = json.loads((speaker_dir / f"{clip_id}.json").read_text()) -wl = meta["weak_label"] -print(f"Typology: {meta['violence_typology']} has_violence: {wl['has_violence']} " - f"max_intensity: {wl['max_intensity']}") -# Typology: SV has_violence: True max_intensity: 5 - -# --- Transcript --- -transcript = (speaker_dir / f"{clip_id}.txt").read_text(encoding="utf-8") -print(transcript[:200]) # Hebrew turns with timestamps +print(f"duration={len(wav)/sr:.1f}s peak={abs(wav).max():.3f}") +# duration=110.5s peak=0.794 ``` -??? info "Why is the peak ~0.79 (–2.0 dBFS) not 1.0?" - All clips are peak-normalized to a **–2.0 dBFS target** (not –1.0 dBFS = 1.0 linear). - This gives 2 dB of headroom above the safety limiter ceiling (–1.0 dBFS). - `preprocessing_applied.normalized_dbfs` in the JSON records the measured peak. - See [Audio Format](audio-format.md) for the full normalization pipeline. - ---- +!!! info "Why is the peak 0.794 and not 1.0?" + Clips are normalized to a **–2.0 dBFS peak target**, which is roughly 0.79 linear amplitude. Use `generation_metadata.loudness_target_peak_dbfs` to read the configured target and `preprocessing_applied.normalized_dbfs` to read the measured output peak. Full detail: [Audio Format](audio-format.md). -## Load strong-label events +Clip-level labels (weak labels): ```python -import jsonlines # pip install jsonlines +meta = json.loads((clip_dir / f"{clip_id}.json").read_text()) +wl = meta["weak_label"] +print(f"typology={meta['violence_typology']} has_violence={wl['has_violence']} " + f"intensity_max={wl['max_intensity']} categories={wl['violence_categories']}") +# typology=SV has_violence=True intensity_max=5 categories=['DIST', 'PHYS', 'VERB'] +``` -events = [] -with jsonlines.open(speaker_dir / f"{clip_id}.jsonl") as reader: - for evt in reader: - events.append(evt) +Event-level labels (strong labels): -# Or without jsonlines: +```python events = [ json.loads(line) - for line in (speaker_dir / f"{clip_id}.jsonl").read_text().splitlines() + for line in (clip_dir / f"{clip_id}.jsonl").read_text().splitlines() if line.strip() ] for evt in events[:3]: - print(f"[{evt['onset']:.1f}s – {evt['offset']:.1f}s] " - f"{evt['tier1_category']}/{evt['tier2_subtype']} I{evt['intensity']}") -# [0.8s – 10.1s] VERB/VERB_SHOUT I2 -# [10.5s – 18.7s] VERB/VERB_SHOUT I2 -# [18.3s – 29.7s] VERB/VERB_THREAT I3 + print(f"[{evt['onset']:5.1f}s – {evt['offset']:5.1f}s] " + f"{evt['speaker_role']} {evt['tier1_category']}/{evt['tier2_subtype']} " + f"I{evt['intensity']}") +# [ 0.8s – 10.1s] AGG VERB/VERB_SHOUT I2 +# [ 10.5s – 18.7s] VIC VERB/VERB_SHOUT I2 +# [ 18.3s – 29.7s] AGG VERB/VERB_THREAT I3 ``` -??? info "What are tier1_category and tier2_subtype?" - Strong labels follow a three-level taxonomy: +The full 14-event escalation arc for this clip is the one visualised on the [home page](index.md#see-it-first) — verbal → distress → physical → settle. + +--- + +## 5. Read a transcript - **Typology** (clip-level): `SV` · `IT` · `NEG` · `NEU` +`.txt` files are turn-major with a small header block per turn. They use UTF-8 Hebrew and are intended both for human reading and as ASR reference. - **Tier 1 category** (event-level): `VERB` · `DIST` · `PHYS` · `EMOT` · `ACOU` · `NONE` +``` +[CLIP_ID: sp_sv_a_0001_00] +[SPEAKER: AGG_M_30-45_001 | ROLE: AGG | ONSET: 0.76 | OFFSET: 10.07] +מה זה הארוחה הזאת? שאלתי אותך דבר אחד פשוט, לעשות ארוחת ערב נורמלית. +[ACTION: VERB_SHOUT | INTENSITY: 2] +[SPEAKER: VIC_F_25-40_002 | ROLE: VIC | ONSET: 10.49 | OFFSET: 18.74] +עבדתי עד שש היום. עשיתי מה שהספקתי... +``` - **Tier 2 subtype** (event-level): e.g. `VERB_SHOUT`, `VERB_THREAT`, `DIST_SCREAM`, `PHYS_HARD`, `ACOU_SLAM` +!!! note "Hebrew is right-to-left; some terminals mis-render it" + macOS Terminal.app handles it correctly; older Windows consoles don't. If transcripts look reversed or garbled, view the `.txt` in an editor (VS Code, BBEdit) rather than `cat`. - See [Label Taxonomy](taxonomy.md) for the full table and has_violence derivation rule. +Timestamps in the header are already relative to the **final processed WAV** — they include the 0.5 s silence pad at the head. No shift needed. --- -## Work with the manifest +## 6. Work from the manifest, not from hardcoded paths -`data/he/manifest.csv` is a flat summary of all clips. It's the fastest entry point for filtering and dataset construction. +`data/he/manifest.csv` is one row per clip. It's the fastest entry point for filtering and the safest way to find files (because [hardcoded speaker directories will miss two-thirds of the clips](gotchas.md#2-dont-hardcode-speaker-directory-paths)). ```python -import pandas as pd +import pandas as pd, soundfile as sf df = pd.read_csv("data/he/manifest.csv") -print(df.columns.tolist()) +df.columns.tolist() # ['clip_id', 'project', 'violence_typology', 'tier', 'duration_seconds', # 'speaker_ids', 'voice_families', 'has_violence', 'max_intensity', # 'quality_flags', 'split', 'wav_path', 'strong_labels_path'] -# Filter by project -she_proves_clips = df[df["project"] == "she_proves"] +# Filter +violent = df[df["has_violence"]] # 10 clips +elephant = df[df["project"] == "elephant_in_the_room"] # 8 clips +sv_high = df[(df["violence_typology"] == "SV") & (df["max_intensity"] >= 4)] -# Filter by typology -sv_clips = df[df["violence_typology"] == "SV"] - -# High-intensity violent clips only -high_intensity = df[(df["has_violence"]) & (df["max_intensity"] >= 4)] - -# Load audio for a manifest row +# Load audio for any manifest row — wav_path is already repo-relative POSIX row = df.iloc[0] -wav, sr = sf.read(row["wav_path"]) # paths are repo-relative POSIX strings -``` - -!!! warning "`speaker_ids` and `voice_families` are pipe-delimited" - These columns contain multiple values joined by `|`: - ```python - speakers = row["speaker_ids"].split("|") - # ['AGG_M_30-45_001', 'VIC_F_25-40_002'] - ``` - -!!! note "All clips are `split: train` in delivery-003" - The corpus has only 4 unique speaker personas across 20 clips — speaker-disjoint splits are not feasible at this scale. When the corpus scales, speaker-disjoint train/val/test splits will be assigned by SynthBanshee. Until then, treat this as an unpartitioned pool. - ---- - -## Find a clip's speaker directory - -Clip IDs follow the pattern `{project_prefix}_{typology}_{tier}_{scene_num}_{take}`. The on-disk directory is the **lowercase** form of the first speaker ID listed in `speakers[]`: - -```python -def clip_dir(root: Path, clip_id: str, meta: dict) -> Path: - first_speaker = meta["speakers"][0]["speaker_id"] - return root / "data" / meta["language"] / first_speaker.lower() +wav, sr = sf.read(row["wav_path"]) +speakers = row["speaker_ids"].split("|") # pipe-delimited! +voices = row["voice_families"].split("|") # same order as speaker_ids ``` -| clip_id | speaker_dir | -|---------|-------------| -| `sp_sv_a_0001_00` | `data/he/agg_m_30-45_001/` | -| `sp_sv_a_0003_00` | `data/he/agg_m_30-45_002/` | -| `el_sv_b_0001_00` | `data/he/ben_m_40-55_003/` | - -Or use `manifest.csv` directly — `wav_path` already contains the full repo-relative path. +!!! warning "`speaker_ids` and `voice_families` are pipe-delimited strings" + They are not CSV-nested lists. Split on `|`. --- -## Validate a clip - -If you have SynthBanshee installed: - -```bash -synthbanshee validate data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav -``` - -This checks: all four files present, WAV format (16 kHz mono), peak ≤ –1.0 dBFS, duration ≥ 3 s, JSON parses as `ClipMetadata`. - -To run QA over the entire language directory: - -```bash -synthbanshee qa-report data/he/ -synthbanshee qa-report data/he/ --run-summary # adds corpus-level aggregates -``` +## 7. Where to go next + +| You're about to… | Read | +|------------------|------| +| Write data-loading code | [Common mistakes](gotchas.md) (2 min read; saves debugging) | +| Look up a field in `.json` | [Schema Reference](schema.md) | +| Understand `has_violence` semantics | [Label Taxonomy](taxonomy.md) | +| Look up a term (F0, SSML, IR, BEN…) | [Glossary](glossary.md) | +| Work specifically with phone-app data | [She-Proves guide](she-proves.md) | +| Work specifically with Tier B / room-augmented audio | [Elephant in the Room guide](elephant.md) | +| Verify a clip is spec-compliant | `synthbanshee validate ` (requires SynthBanshee installed) | diff --git a/docs/glossary.md b/docs/glossary.md new file mode 100644 index 0000000..7a18b7e --- /dev/null +++ b/docs/glossary.md @@ -0,0 +1,110 @@ +# Glossary + +Abbreviations and jargon that show up across the corpus and on this site, in one place. + +--- + +## Speaker roles + +The role of each speaker is encoded in the speaker_id prefix and in `speakers[].role`. + +| Code | Stands for | Used in | +|------|-----------|---------| +| `AGG` | **Aggressor** — the perpetrator in a domestic-violence scene | She-Proves clips (`AGG_M_30-45_*`) | +| `VIC` | **Victim** — the target of violence in a domestic-violence scene | She-Proves clips (`VIC_F_25-40_*`) | +| `BEN` | **Beneficiary / client** — a service-user in a welfare or clinic setting (the threatening party in Elephant scenes) | Elephant clips (`BEN_M_40-55_*`) | +| `SW` | **Social Worker** — the threatened professional in Elephant scenes | Elephant clips (`SW_F_30-45_*`) | + +The role determines the prosody profile, scene position, and which `tier1_category` events the speaker can produce. + +--- + +## Project codes + +| Code | Project | Clip ID prefix | +|------|---------|----------------| +| `she_proves` | She-Proves smartphone app | `sp_*` | +| `elephant_in_the_room` | Elephant in the Room (clinic/welfare device) | `el_*` | + +--- + +## Violence typology + +The clip-level `violence_typology` field — not an ordered scale. See [Label Taxonomy](taxonomy.md) for details. + +| Code | Stands for | +|------|------------| +| `SV` | Severe Violence | +| `IT` | Intimate Terrorism | +| `NEG` | Negative confusor (sounds intense, no violence) | +| `NEU` | Neutral | + +--- + +## Tier 1 event category + +The event-level `tier1_category` field on each `EventLabel`. + +| Code | Stands for | +|------|------------| +| `VERB` | Verbal violence (shouting, threats, insults) | +| `DIST` | Distress vocalisations (screaming, crying under duress) | +| `PHYS` | Physical violence cues (impact sounds, struggle) | +| `EMOT` | Emotional manipulation (gaslighting, guilt-tripping) | +| `ACOU` | Acoustic non-vocal events (slams, falls) | +| `NONE` | Ambient / neutral / no violence cue | + +--- + +## Tier codes + +| Code | Meaning | +|------|---------| +| `A` | Clean audio — no room IR, no device profile, no background noise | +| `B` | Room IR + device profile + background noise injection | + +--- + +## Audio jargon + +| Term | Meaning | +|------|---------| +| **F0** | Fundamental frequency — the lowest frequency of a periodic signal; for voice, the pitch. Reported per speaker in some QA outputs. | +| **dBFS** | Decibels relative to full scale — 0 dBFS is the maximum amplitude representable by the format; –2 dBFS is ~80% of full amplitude. | +| **Peak normalization** | Applying a single gain to the whole signal so its absolute maximum matches a target level. | +| **RMS** | Root-mean-square — a measure of average signal energy. SynthBanshee uses per-turn RMS gain to enforce the loudness gradient between calm and escalated turns. | +| **SNR** | Signal-to-noise ratio — speech level minus background-noise level, in dB. Recorded in `acoustic_scene.snr_db_actual` for Tier B clips. | +| **IR** | Impulse response — a recording of how a room (or microphone, or speaker) responds to an idealised pulse. Convolving clean speech with a room IR makes it sound like it was recorded in that room. | +| **ISM** | Image-source method — an algorithm for synthetically generating room IRs by reflecting virtual sound sources off room walls. Implemented by `pyroomacoustics`. | +| **SSML** | Speech Synthesis Markup Language — an XML dialect that controls TTS output (pitch, rate, emphasis, breaks, voice). Azure and Google both accept SSML. | +| **TTS** | Text-to-speech — the generation of audio from a text prompt. | +| **Prosody** | The patterns of stress, intonation, pitch, and rate that make speech expressive (vs. flat). | +| **Prosody cap** | A safety clamp applied by SynthBanshee to LLM-suggested prosody values to prevent unnatural extremes (pitch ≤ +2 st, rate ∈ [0.85, 1.20]). | +| **Whisper** | OpenAI's open-weight ASR model, used internally as a sanity check that synthesised audio is still transcribable. | + +--- + +## Pipeline / corpus jargon + +| Term | Meaning | +|------|---------| +| **Dirty file** | The pre-preprocessing WAV (raw TTS-mixer output, before normalization and padding). Retained under `assets/speech/dirty/{clip_id}_dirty.wav`. | +| **Generation metadata** | The `generation_metadata` field — pipeline provenance: which TTS backend was used, which voice family, what mix mode, etc. | +| **Manifest** | The flat CSV summary at `data/he/manifest.csv` — one row per clip, columns for filtering. | +| **Strong labels** | Event-level labels in `.jsonl` files — one `EventLabel` object per labelled event, with onset/offset/category. | +| **Weak labels** | Clip-level summary labels in `.json` — `has_violence`, `max_intensity`, `violence_typology`, `violence_categories`. | +| **Quality flag** | A soft warning in `quality_flags` (e.g. `emotion_downgrade`). Doesn't fail validation; flags audio worth a second look. | +| **Delivery** | A merged data batch under `deliveries/{slug}/`. Each delivery records its SynthBanshee commit, metadata, and per-batch QA notes. | + +--- + +## Hebrew TTS voice IDs + +The four voices used in delivery-003: + +| Voice ID | Gender | Backend | +|----------|:---:|---------| +| `he-IL-AvriNeural` | M | Azure | +| `he-IL-HilaNeural` | F | Azure | +| `he-IL-Chirp3-HD-Achird` | M | Google Chirp 3 HD | +| `he-IL-Chirp3-HD-Achernar` | F | Google Chirp 3 HD | diff --git a/docs/gotchas.md b/docs/gotchas.md new file mode 100644 index 0000000..816c90b --- /dev/null +++ b/docs/gotchas.md @@ -0,0 +1,136 @@ +# Common mistakes + +Read this once before you write code against the corpus. Two minutes here saves a debugging session later. + +--- + +## 1. Don't derive `has_violence` from typology + +This will misclassify every `NEG` clip: + +```python +# WRONG — NEG clips will look violent because of their max_intensity +has_violence = typology in ("SV", "IT") + +# CORRECT — uses the event-level ground truth +has_violence = any(e["tier1_category"] != "NONE" for e in events) +``` + +`has_violence` in `weak_label` is **derived from strong-label events**, not from typology. NEG clips can have `max_intensity = 3` (raised voices, distress) and still be `has_violence: false` because every one of their events lands `tier1_category: "NONE"` by design. That's the whole point of NEG: hard negatives that sound intense but aren't violent. + +--- + +## 2. Don't hardcode speaker directory paths + +There's already more than one. Delivery-003 has three speaker directories under `data/he/`: + +``` +data/he/agg_m_30-45_001/ # She-Proves, Azure pair +data/he/agg_m_30-45_002/ # She-Proves, Google Chirp HD pair (new in delivery-003) +data/he/ben_m_40-55_003/ # Elephant in the Room, Azure pair (new in delivery-003) +``` + +Code that hardcodes `data/he/agg_m_30-45_001/` will miss two-thirds of the clips. Use `manifest.csv` (the `wav_path` column is repo-relative POSIX), or derive the directory from the first entry in `speakers[]`: + +```python +speaker_dir = root / "data" / meta["language"] / meta["speakers"][0]["speaker_id"].lower() +``` + +--- + +## 3. Audio peak is ~0.79, not 1.0 + +Clips are normalized to a **–2.0 dBFS peak target** (not –1.0 dBFS = linear 1.0). Loading a clip and expecting full-range float values will surprise you: + +```python +wav, sr = sf.read("data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav") +print(np.abs(wav).max()) # ~0.7943, not 1.0 +``` + +The –2 dBFS target leaves 2 dB of headroom above the safety limiter at –1.0 dBFS. The configured target is recorded in `generation_metadata.loudness_target_peak_dbfs`; the measured peak is recorded in `preprocessing_applied.normalized_dbfs`. + +--- + +## 4. UPPERCASE in JSON, lowercase on disk + +The same speaker has two surface forms: + +| Surface | Form | Example | +|---------|------|---------| +| JSON field (`speaker_id`, `speakers[].speaker_id`) | **UPPERCASE** | `AGG_M_30-45_001` | +| Filesystem directory | **lowercase** | `agg_m_30-45_001/` | +| `clip_id` (everywhere) | **lowercase** | `sp_sv_a_0001_00` | + +If you build a dict keyed on speaker IDs from JSON and then try to look up paths with the same string, you'll get a `FileNotFoundError`. Always `.lower()` when converting from JSON to a path. + +--- + +## 5. NEG is not "violent at low intensity" + +The four violence typologies are **not** an ordered scale. + +| | | +|---|---| +| `SV` | Severe Violence — physical attacks, life-threatening | +| `IT` | Intimate Terrorism — sustained coercive control, repeated abuse | +| `NEG` | **Negative confusor** — sounds intense, no violence (hard negative) | +| `NEU` | Neutral — mundane conversation | + +A NEG clip is **not** "a milder SV." It is acoustic distress that a naive model would mistake for violence. Treating NEG as a positive class will tank your precision. + +--- + +## 6. All clips are `split: train` in delivery-003 + +The `split` column exists in `manifest.csv`, but there are only 4 unique speaker personas across all 20 clips. Speaker-disjoint train/val/test partitioning isn't feasible at this scale — every clip is therefore assigned `split: train`. **Don't trust the `split` column as a usable partition.** Treat the whole corpus as an unpartitioned pool for now. SynthBanshee will assign meaningful splits once the speaker pool grows. + +--- + +## 7. `quality_flags` doesn't mean "broken" + +About 15 of 20 clips in delivery-003 carry at least one `quality_flags` entry — usually `emotion_downgrade` (the TTS produced slightly less intense prosody than the SSML asked for at high-intensity turns). These clips are still validated and spec-compliant; the flag is a soft hint, not a failure. Don't filter them out reflexively. + +The hard line is `synthbanshee validate` — a clip either passes or doesn't. If it's in the corpus, it passed. + +--- + +## 8. The 2 Google clips have a `vic_f0_high` flag — that's expected + +`sp_sv_a_0003_00` and `sp_it_a_0003_00` use the Google Chirp 3 HD female voice (`he-IL-Chirp3-HD-Achernar`), whose fundamental-frequency baseline runs higher than the Azure reference voice the QA thresholds were calibrated against. The flag is fired correctly; the audio is fine. **Don't exclude these clips on the basis of this flag** — your model needs the backend diversity. If you compute F0-derived features, calibrate per backend. + +--- + +## 9. Timestamps already account for silence padding + +Every clip has ≥0.5 s of silence at head and tail. **Onset/offset timestamps in `.txt` and `.jsonl` are already shifted** to refer to positions in the final processed WAV. You don't need to add the pad — read the timestamp, slice the WAV, done. + +--- + +## 10. The `.json` and `.jsonl` files aren't the same thing + +| File | Contains | When to load | +|------|----------|--------------| +| `{clip_id}.json` | `ClipMetadata` — one object per clip: weak labels, speakers, provenance, acoustic scene | Always | +| `{clip_id}.jsonl` | `EventLabel` records — one JSON object per **line**, one per labelled event in the clip | When you need per-event strong labels (onset/offset/category) | + +If you `json.loads()` the `.jsonl` you'll get an error. Read line by line. + +--- + +## Quick verification + +Use these snippets to confirm a fresh clone is intact: + +```bash +find data/he -name "*.wav" | wc -l # expect 20 +wc -l data/he/manifest.csv # expect 21 (header + 20 rows) +``` + +```python +import pandas as pd +df = pd.read_csv("data/he/manifest.csv") +assert len(df) == 20 +assert set(df["tier"]) == {"A", "B"} +assert set(df["violence_typology"]) == {"SV", "IT", "NEG", "NEU"} +assert df["has_violence"].sum() == 10 # 5 SV + 5 IT +``` diff --git a/docs/index.md b/docs/index.md index 7b5b4d2..aa280c8 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,127 +1,82 @@ -# avdp-synth-corpus +# AVDP Synthetic Corpus -**Synthetic Hebrew audio corpus for the Audio Violence Detection Pipeline (AVDP)** +**Synthetic Hebrew audio clips for the Audio Violence Detection Pipeline.** +Hebrew (he-IL) · 16 kHz mono 16-bit PCM · generated by [SynthBanshee](https://github.com/DataHackIL/SynthBanshee). -Generated by [SynthBanshee](https://github.com/DataHackIL/SynthBanshee) · Hebrew (he-IL) · 16 kHz mono 16-bit PCM +delivery 003 · 2026-05-12 · provisional +20 clips · ~41.6 min · `she_proves` (12) + `elephant_in_the_room` (8) · Azure (18) + Google (2) · 0 validation failures. --- -!!! warning "Toy corpus — not approved for model training" - All current deliveries are provisional wet-test batches for spec validation and pipeline bootstrapping. - The `split` field in `manifest.csv` is informational only. **Do not train production models on this data.** - See [Deliveries](deliveries.md) for the full status of each batch. +## See it first ---- - -## What is this? - -This repository contains **synthetic Hebrew audio clips** representing domestic-violence and threat scenarios, produced by a text-to-speech pipeline with automatic prosody modelling and acoustic augmentation. +A real clip from the corpus — Severe Violence scene, two speakers, with strong-label events overlaid on the waveform: -Two downstream products consume this data: +![Waveform of sp_sv_a_0001_00 with event boundaries](assets/sp_sv_a_0001_00_waveform.png) -=== "She-Proves" +You can read the typical escalation arc directly: an argument starts as verbal (`VERB`, blue), peaks into distress vocalisations (`DIST`, orange) around 36s, then into physical-violence cues (`PHYS`, red) around 71s. Intensity badges (`I2` → `I5`) follow the same curve. - A smartphone app that passively monitors audio for domestic violence incidents and preserves evidence for legal use. High-recall orientation — better to flag and review than to miss. - - → [She-Proves team guide](she-proves.md) +--- -=== "Elephant in the Room" +## Load a clip in 4 lines - A Raspberry Pi–class device placed in clinic and welfare offices that alerts security when a social worker is under threat. High-precision orientation — false alarms erode trust. +```python +import json, soundfile as sf +wav, sr = sf.read("data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav") +meta = json.loads(open("data/he/agg_m_30-45_001/sp_sv_a_0001_00.json").read()) +print(f"{len(wav)/sr:.1f}s has_violence={meta['weak_label']['has_violence']} " + f"intensity_max={meta['weak_label']['max_intensity']}") +# 110.5s has_violence=True intensity_max=5 +``` - → [Elephant in the Room team guide](elephant.md) +For everything else: [Start here →](getting-started.md) --- -## Current delivery at a glance +## Two consumer teams -**Delivery 003 — multi-project, multi-voice** · 2026-05-12 · provisional +
-| Dimension | Value | -|-----------|-------| -| Clips | 20 | -| Total duration | ~41.6 min | -| Projects | `she_proves` (12 clips) + `elephant_in_the_room` (8 clips) | -| Tiers | A — clean (12) + B — room-augmented (8) | -| TTS backends | Azure (18 clips) + Google Chirp 3 HD (2 clips) | -| Validation failures | 0 / 20 | -| Pipeline | SynthBanshee `0.1.0` @ [`1ea48f3`](https://github.com/DataHackIL/SynthBanshee/commit/1ea48f3) | +
+
smartphone app
+### She-Proves +Passively monitors a phone for domestic-violence incidents and preserves audio evidence for legal use. **High-recall** orientation — better to flag and review than to miss. -Full breakdown: [Deliveries](deliveries.md) · [She-Proves clips](she-proves.md#clips-in-delivery-003) · [Elephant clips](elephant.md#clips-in-delivery-003) +12 clips · Tier A (clean audio) · scenes 3–6 min · phone-pocket device profile. ---- +[She-Proves guide →](she-proves.md){ .card-link } +
-## Repository layout +
+
raspberry pi · clinic / welfare office
+### Elephant in the Room +A Pi-class device that alerts security when a social worker is under threat. **High-precision** orientation — false alarms erode trust. -``` -data/ - he/ # ISO 639-1 language code - {speaker_dir}/ # e.g. agg_m_30-45_001/ (lowercase of first speaker ID) - {clip_id}.wav # 16 kHz mono 16-bit PCM - {clip_id}.txt # per-turn transcript with onset/offset markers - {clip_id}.json # ClipMetadata (weak labels, provenance, speaker info) - {clip_id}.jsonl # EventLabel records — one JSON object per line - manifest.csv # flat summary of all clips under data/he/ - -assets/ - speech/ # SHA-256-keyed per-utterance WAV cache (do not modify) - dirty/ # pre-preprocessing WAVs, retained per spec - scripts/ # SHA-256-keyed LLM script cache (do not modify) - -deliveries/ - {slug}/ - metadata.yaml # structured delivery record - notes.md # narrative QA notes and known limitations - qa-report.json # synthbanshee qa-report output -``` +8 clips · Tier B (room IR + budget mic + noise) · scenes 1–4 min · alert in final 40%. -??? info "Why are there four files per clip?" - - **`.wav`** — the audio, spec-compliant (normalized, padded, validated) - - **`.txt`** — the transcript with turn-level onset/offset markers, used as ASR reference - - **`.json`** — `ClipMetadata`: weak labels (`has_violence`, `max_intensity`), speaker list, acoustic scene, provenance (`generation_metadata`) - - **`.jsonl`** — `EventLabel` records: one line per strong-label event with category, subtype, onset, offset, intensity, emotional state +[Elephant guide →](elephant.md){ .card-link } +
- You only need `.wav` + `.json` for most training pipelines. Add `.jsonl` when you need per-event strong labels or onset/offset supervision. +
--- -## Where to start +## Where to go -| I want to… | Go to | -|------------|-------| -| Load my first clip in Python | [Getting Started → Load a clip](getting-started.md#load-a-single-clip) | -| Understand what the labels mean | [Label Taxonomy](taxonomy.md) | -| Parse `ClipMetadata` with Pydantic | [Schema Reference](schema.md) | -| Work with She-Proves scenes | [She-Proves guide](she-proves.md) | -| Work with Elephant Tier B audio | [Elephant in the Room guide](elephant.md) | -| Understand the audio normalization | [Audio Format](audio-format.md) | -| Check current quality status | [Deliveries](deliveries.md) | +| | | +|---|---| +| **First time here** | [Start here](getting-started.md) — clone, load one clip, read its labels | +| **About to write code** | [Common mistakes](gotchas.md) — read this once; it'll save you a few | +| **Decoding a label** | [Label Taxonomy](taxonomy.md) — typologies, categories, `has_violence` rule | +| **Decoding a JSON field** | [Schema Reference](schema.md) — annotated `ClipMetadata` example | +| **Working with team data** | [She-Proves](she-proves.md) · [Elephant](elephant.md) | +| **Looking up a term** | [Glossary](glossary.md) — F0, SSML, IR, AGG/VIC/SW/BEN, etc. | +| **Checking what's current** | [Deliveries](deliveries.md) — current batch, known gaps | --- -## Quick snippet +!!! warning "This is a small test batch, not training data" + All current deliveries are preview batches for verifying that downstream data-loading code works before the full dataset arrives. The `split` column in `manifest.csv` is informational only — all 20 clips are `split: train` because there aren't enough unique speakers for a disjoint partition at this scale. **Do not train production models on this corpus.** -```python -import json -from pathlib import Path -import soundfile as sf - -root = Path(".") # repo root - -# Load a clip -wav, sr = sf.read(root / "data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav") -meta = json.loads((root / "data/he/agg_m_30-45_001/sp_sv_a_0001_00.json").read_text()) - -print(f"Duration: {len(wav)/sr:.1f}s has_violence: {meta['weak_label']['has_violence']}") -# Duration: 110.5s has_violence: True -``` - -For manifest-level operations: - -```python -import pandas as pd - -df = pd.read_csv("data/he/manifest.csv") -violent = df[df["has_violence"] == True] -print(violent[["clip_id", "project", "violence_typology", "duration_seconds"]].to_string()) -``` +!!! info "What's *not* in this corpus" + No real human recordings (synthetic TTS only) · no Arabic or English (Hebrew only) · no inter-annotator agreement metrics (labels are auto-generated by SynthBanshee) · no demographic detail beyond `gender` + `age_range`. Scripts are LLM-generated in Hebrew, not human-written. See [Glossary](glossary.md) for what each abbreviation means. diff --git a/docs/schema.md b/docs/schema.md index 98a10fd..dccbf1e 100644 --- a/docs/schema.md +++ b/docs/schema.md @@ -1,219 +1,226 @@ # Schema Reference -Every clip's `.json` file contains a `ClipMetadata` object. The authoritative Pydantic model is in [SynthBanshee `synthbanshee/labels/schema.py`](https://github.com/DataHackIL/SynthBanshee/blob/main/synthbanshee/labels/schema.py). +A real `ClipMetadata` JSON, fully annotated. Click the `+` markers to jump to a field's explanation. Fields are ordered by how often you'll actually use them: top-level → labels → speakers → augmentation (Tier B only) → provenance (diagnostic, usually skip). ---- - -## Loading with Pydantic - -```python -from synthbanshee.labels.schema import ClipMetadata # requires SynthBanshee installed -from pathlib import Path - -meta = ClipMetadata.model_validate_json( - Path("data/he/agg_m_30-45_001/sp_sv_a_0001_00.json").read_text() -) -print(meta.clip_id, meta.violence_typology, meta.weak_label.has_violence) -# sp_sv_a_0001_00 SV True -``` - -Plain JSON (no SynthBanshee required): - -```python -import json -from pathlib import Path - -meta = json.loads(Path("data/he/agg_m_30-45_001/sp_sv_a_0001_00.json").read_text()) -``` +The authoritative Pydantic model lives in [SynthBanshee `synthbanshee/labels/schema.py`](https://github.com/DataHackIL/SynthBanshee/blob/main/synthbanshee/labels/schema.py). For day-to-day consumer work, `json.loads()` is fine. --- -## Top-level `ClipMetadata` fields - -| Field | Type | Description | -|-------|------|-------------| -| `clip_id` | `str` | Lowercase ASCII clip identifier, e.g. `sp_sv_a_0001_00` | -| `project` | `str` | `she_proves` or `elephant_in_the_room` | -| `language` | `str` | ISO 639-1, always `"he"` | -| `violence_typology` | `str` | `SV` / `IT` / `NEG` / `NEU` — see [taxonomy](taxonomy.md) | -| `tier` | `str` | `"A"` (clean) or `"B"` (room-augmented) | -| `duration_seconds` | `float` | Duration of the processed WAV | -| `sample_rate` | `int` | Always `16000` | -| `channels` | `int` | Always `1` | -| `is_synthetic` | `bool` | Always `true` in this corpus | -| `generator_version` | `str` | SynthBanshee semver, e.g. `"0.1.0"` | -| `generation_date` | `str` | ISO 8601 date of generation | -| `random_seed` | `int` | Scene-level RNG seed for reproducibility | -| `scene_config` | `str` | Relative path to the scene YAML in SynthBanshee | -| `transcript_path` | `str` | Repo-relative POSIX path to the `.txt` transcript | -| `dirty_file_path` | `str` | Repo-relative POSIX path to the pre-preprocessing WAV | -| `speakers` | `list[SpeakerInfo]` | Speaker metadata — see below | -| `weak_label` | `WeakLabel` | Clip-level summary labels | -| `generation_metadata` | `GenerationMetadata \| null` | Pipeline provenance — see below | -| `preprocessing_applied` | `PreprocessingApplied` | What preprocessing steps ran | -| `acoustic_scene` | `AcousticScene` | Room/device augmentation (Tier B) | -| `quality_flags` | `list[str]` | QA flags, e.g. `["emotion_downgrade"]` | -| `snr_db_estimated` | `float \| null` | Estimated SNR (not always populated) | -| `annotator_confidence` | `float` | Auto-label confidence, 0–1 (auto-generated: always `1.0`) | -| `iaa_reviewed` | `bool` | Whether inter-annotator agreement review was done | -| `she_proves_meta` | `null` | Reserved for She-Proves–specific metadata (future) | -| `elephant_meta` | `null` | Reserved for Elephant–specific metadata (future) | - ---- - -## `SpeakerInfo` - -One entry per speaker in `speakers[]`. - -| Field | Type | Description | -|-------|------|-------------| -| `speaker_id` | `str` | UPPERCASE persona ID, e.g. `AGG_M_30-45_001` | -| `role` | `str` | `AGG` (aggressor), `VIC` (victim), `SW` (social worker), `BEN` (beneficiary/client) | -| `gender` | `str` | `"male"` or `"female"` | -| `age_range` | `str` | e.g. `"30-45"` | -| `tts_voice_id` | `str` | TTS voice identifier, e.g. `"he-IL-AvriNeural"` | -| `voice_family` | `str` | Same as `tts_voice_id` (may diverge in future) | - -??? info "Speaker ID casing convention" - The `speaker_id` field in JSON is always **UPPERCASE**: `AGG_M_30-45_001`. - The on-disk directory is **lowercase**: `agg_m_30-45_001/`. - This is a deliberate per-surface casing rule — see [SynthBanshee spec §2.5](https://github.com/DataHackIL/SynthBanshee/blob/main/docs/spec.md#25-filename-constraints). - ---- - -## `WeakLabel` - -| Field | Type | Description | -|-------|------|-------------| -| `has_violence` | `bool` | `any(e.tier1_category != "NONE" for e in events)` — see [taxonomy](taxonomy.md#has_violence-the-correct-derivation) | -| `violence_typology` | `str` | Mirrors top-level `violence_typology` | -| `max_intensity` | `int` | Highest per-turn intensity across the clip (1–5) | -| `violence_categories` | `list[str]` | Distinct `tier1_category` values observed in events | - ---- +## Annotated example + +```json +{ + "clip_id": "sp_sv_a_0001_00", // (1)! + "project": "she_proves", // (2)! + "language": "he", // (3)! + "violence_typology": "SV", // (4)! + "tier": "A", // (5)! + "duration_seconds": 110.46, // (6)! + "sample_rate": 16000, // (7)! + "channels": 1, + "is_synthetic": true, // (8)! + + "weak_label": { // (9)! + "has_violence": true, + "violence_typology": "SV", + "max_intensity": 5, + "violence_categories": ["DIST", "PHYS", "VERB"] + }, + + "speakers": [ // (10)! + { + "speaker_id": "AGG_M_30-45_001", + "role": "AGG", + "gender": "male", + "age_range": "30-45", + "tts_voice_id": "he-IL-AvriNeural", + "voice_family": "he-IL-AvriNeural" + }, + { + "speaker_id": "VIC_F_25-40_002", + "role": "VIC", + "gender": "female", + "age_range": "25-40", + "tts_voice_id": "he-IL-HilaNeural", + "voice_family": "he-IL-HilaNeural" + } + ], + + "transcript_path": "data/he/agg_m_30-45_001/sp_sv_a_0001_00.txt", // (11)! + "dirty_file_path": "assets/speech/dirty/sp_sv_a_0001_00_dirty.wav", // (12)! + + "quality_flags": ["emotion_downgrade"], // (13)! + + "acoustic_scene": { // (14)! + "room_type": null, + "device": null, + "ir_source": null, + "snr_db_actual": null, + "speaker_distance_meters": null, + "background_events": [] + }, + + "preprocessing_applied": { // (15)! + "resampled_to_16k": true, + "downmixed_to_mono": true, + "normalized_dbfs": -2.0000002, + "silence_padded": true, + "denoised": true, + "spectral_filtered": true + }, + + "generation_metadata": { /* ...see below... */ }, // (16)! + + "generator_version": "0.1.0", // (17)! + "generation_date": "2026-05-12", + "random_seed": 1201, + "scene_config": "configs/scenes/she_proves/sp_sv_a_0001.yaml", + "snr_db_estimated": null, // (18)! + "annotator_confidence": 1.0, // (19)! + "iaa_reviewed": false, + "she_proves_meta": null, // (20)! + "elephant_meta": null +} +``` -## `GenerationMetadata` - -Present on all delivery-003 clips; may be `null` on older clips. - -| Field | Type | Description | -|-------|------|-------------| -| `pipeline_version` | `str` | SynthBanshee semver | -| `tts_backend` | `dict[str, str]` | Speaker ID → `"azure"` or `"google"` | -| `voice_family` | `dict[str, str]` | Speaker ID → voice family string | -| `mix_mode_used` | `str` | `"sequential"` (turns in order) or `"overlapping"` | -| `normalization_strategy` | `str` | `"per_turn_rms_v2_target_peak"` | -| `loudness_target_peak_dbfs` | `float` | Configured peak target, e.g. `-2.0` | -| `breathiness_applied` | `bool` | Whether breathiness augmentation was applied | -| `effective_prosody_caps` | `list[ProsodyCap]` | Per-turn cap activations at I3–I5 | -| `speaker_state_serialized` | `dict[str, SpeakerState]` | Final prosody state per speaker | -| `prosody_controller_version` | `str \| null` | Version of the prosody controller | -| `text_normalization_version` | `str \| null` | Version of text normalization | -| `timing_controller_version` | `str \| null` | Version of timing controller | - -### `ProsodyCap` (entry in `effective_prosody_caps`) - -| Field | Description | -|-------|-------------| -| `turn_index` | Zero-based turn index | -| `intensity` | Intensity score for that turn | -| `dim` | `"pitch"` or `"rate"` | -| `pre_cap` | Prosody value before capping (semitones for pitch, ratio for rate) | -| `post_cap` | Prosody value after capping | - -### `SpeakerState` (entry in `speaker_state_serialized`) - -| Field | Description | -|-------|-------------| -| `pitch_offset_st` | Final pitch offset in semitones | -| `rate_offset` | Final speaking rate multiplier | -| `volume_offset_db` | Final volume offset in dB | -| `breathiness_level` | Breathiness level 0–1 | +1. Lowercase ASCII clip identifier. Pattern: `{project_prefix}_{typology}_{tier}_{scene_num}_{take}`. +2. `she_proves` or `elephant_in_the_room`. Determines clip-id prefix (`sp_*` / `el_*`) and which `*_meta` field is non-null. +3. ISO 639-1 — always `"he"` in this corpus. +4. `SV` · `IT` · `NEG` · `NEU`. **Not** an ordered scale — see [Label Taxonomy](taxonomy.md). `NEG` is the hard-negative class (sounds intense, not violent). +5. `"A"` (clean, TTS only) or `"B"` (room IR + device profile + background noise applied). Determines whether `acoustic_scene` is populated. +6. Duration of the final processed WAV, **including** the 0.5 s silence pad on each end. +7. Always 16000. Channels always 1. Format always 16-bit PCM WAV. +8. Always `true` in this corpus. The field exists because future real-recording deliveries will set it `false`. +9. Clip-level summary labels. `has_violence` is derived from events: `any(e.tier1_category != "NONE")`. Don't derive it from typology — see [Gotcha #1](gotchas.md#1-dont-derive-has_violence-from-typology). +10. One entry per speaker. The on-disk directory is **`speakers[0].speaker_id.lower()`** — UPPERCASE in JSON, lowercase on disk ([Gotcha #4](gotchas.md#4-uppercase-in-json-lowercase-on-disk)). +11. Repo-relative POSIX path to the `.txt` transcript. +12. Repo-relative POSIX path to the pre-preprocessing ("dirty") WAV, retained per spec. Useful for diagnosing normalization issues. **Don't modify** — `assets/` is managed by SynthBanshee ([Gotcha #7](gotchas.md#7-quality_flags-doesnt-mean-broken)). +13. Soft warnings. Don't filter on these reflexively — they don't fail validation. Most common: `emotion_downgrade` (TTS produced slightly less intense prosody than requested), `vic_f0_high` (Google female F0 above Azure baseline; expected on the 2 Google clips). +14. Populated for Tier B (Elephant) clips; all `null` / empty for Tier A. See [Elephant guide](elephant.md#the-acoustic_scene-field). +15. Records *what* preprocessing ran. `normalized_dbfs` is the **measured** post-preprocess peak — pair with `generation_metadata.loudness_target_peak_dbfs` (the configured target) to diagnose loudness drift. +16. Pipeline provenance. Always present on delivery-003 clips; may be `null` on older clips. Expanded below. +17. SynthBanshee version that produced this clip. Combined with `random_seed` + `scene_config`, scenes are reproducible. +18. Estimated SNR — not populated for any current delivery. Use `acoustic_scene.snr_db_actual` for Tier B. +19. Auto-label confidence; always `1.0` because labels are generated by the pipeline (not human-annotated). `iaa_reviewed` is always `false` for the same reason. +20. Reserved for per-project metadata. Always `null` in current deliveries. --- -## `PreprocessingApplied` +## `generation_metadata` — pipeline provenance + +Expanded view of field (16). Use this block for diagnostics, not for filtering training data. + +```json +{ + "pipeline_version": "0.1.0", + "tts_backend": {"AGG_M_30-45_001": "azure", "VIC_F_25-40_002": "azure"}, + "voice_family": {"AGG_M_30-45_001": "he-IL-AvriNeural", "VIC_F_25-40_002": "he-IL-HilaNeural"}, + "mix_mode_used": "sequential", + "normalization_strategy": "per_turn_rms_v2_target_peak", // internal version string; informational + "loudness_target_peak_dbfs": -2.0, + "breathiness_applied": false, + "effective_prosody_caps": [ // per-turn cap activations at I3+ + {"turn_index": 1, "intensity": 2, "dim": "rate", "pre_cap": 0.912, "post_cap": 0.95}, + {"turn_index": 4, "intensity": 4, "dim": "pitch", "pre_cap": 2.348, "post_cap": 2.0} + ], + "speaker_state_serialized": { + "AGG_M_30-45_001": {"pitch_offset_st": 1.40, "rate_offset": 1.14, "volume_offset_db": 3.80, "breathiness_level": 0.0}, + "VIC_F_25-40_002": {"pitch_offset_st": 0.56, "rate_offset": 0.89, "volume_offset_db": -2.58, "breathiness_level": 0.0} + } +} +``` -| Field | Type | Description | -|-------|------|-------------| -| `resampled_to_16k` | `bool` | Whether sample rate conversion ran | -| `downmixed_to_mono` | `bool` | Whether channel downmix ran | -| `normalized_dbfs` | `float` | **Measured** peak dBFS of the output WAV (not the target) | -| `silence_padded` | `bool` | Whether silence padding was applied | -| `denoised` | `bool` | Whether denoising ran | -| `spectral_filtered` | `bool` | Whether spectral filtering ran | +| Field | What it tells you | +|-------|-------------------| +| `tts_backend` | Per-speaker dict mapping speaker_id → `"azure"` or `"google"`. The corpus-level backend distribution is derived from this — don't look for a top-level `tts_engine` field, it was removed. | +| `voice_family` | Per-speaker dict mapping speaker_id → voice ID. Currently identical to `speakers[].tts_voice_id`. | +| `mix_mode_used` | `"sequential"` (turns in order) or `"overlapping"` (turns can overlap at I4+). All delivery-003 violent clips use `"overlapping"` at high intensity; calm clips use `"sequential"`. | +| `loudness_target_peak_dbfs` | The **configured** peak target (–2.0 dBFS by default). Pair with `preprocessing_applied.normalized_dbfs` (the measured peak) to detect drift. | +| `effective_prosody_caps` | Per-turn list of cap activations — when the LLM-suggested pitch or rate exceeded the safety cap. Common at I3+ in this delivery. Recording them lets you compute the "uncapped" prosody the LLM intended. | +| `speaker_state_serialized` | Final per-speaker prosody offset. Used for reproducing a scene with the same speaker drift. | -!!! note "`normalized_dbfs` is the measured peak, not the target" - Use `generation_metadata.loudness_target_peak_dbfs` for the configured target. - Use `preprocessing_applied.normalized_dbfs` to verify the actual output peak. - On delivery-003, both should be very close to `–2.0` (within floating-point precision). +??? info "Internal version-string fields" + `normalization_strategy`, `prosody_controller_version`, `text_normalization_version`, `timing_controller_version` are internal version strings. They're recorded for provenance but you won't filter on them as a consumer. --- -## `AcousticScene` - -Populated for Tier B clips. Null fields indicate Tier A (no augmentation). - -| Field | Type | Description | -|-------|------|-------------| -| `room_type` | `str \| null` | e.g. `"clinic_office"`, `"welfare_office"`, `"open_office"` | -| `device` | `str \| null` | e.g. `"pi_budget_mic"` | -| `ir_source` | `str \| null` | Room impulse response source, e.g. `"pyroomacoustics_ism"` | -| `snr_db_actual` | `float \| null` | Actual SNR after augmentation (dB) | -| `speaker_distance_meters` | `float \| null` | Simulated speaker distance from microphone | -| `background_events` | `list[BackgroundEvent]` | Non-speech acoustic events added | - -### `BackgroundEvent` - -| Field | Description | -|-------|-------------| -| `type` | `"hvac_hum"`, `"ACOU_SLAM"`, `"ACOU_FALL"`, etc. | -| `onset` | Start time in seconds | -| `offset` | End time in seconds | -| `level_db` | Relative level of the event (dB) | - ---- +## `EventLabel` — `.jsonl` rows + +One JSON object per line. One line per labelled event. Read line-by-line — `json.loads()` on the whole file errors. + +```json +{ + "event_id": "sp_sv_a_0001_00_EVT_004", + "clip_id": "sp_sv_a_0001_00", + "onset": 36.736, + "offset": 46.552, + "tier1_category": "DIST", + "tier2_subtype": "DIST_SCREAM", + "intensity": 4, + "speaker_id": "AGG_M_30-45_001", + "speaker_role": "AGG", + "emotional_state": "anger", + "confidence": 1.0, + "label_source": "auto", + "iaa_reviewed": false, + "truncated": false, + "notes": null +} +``` -## `EventLabel` (`.jsonl` rows) - -One JSON object per line. Each represents a single labelled event within the clip. - -| Field | Type | Description | -|-------|------|-------------| -| `event_id` | `str` | `{clip_id}_EVT_{index:03d}` | -| `clip_id` | `str` | Parent clip ID | -| `onset` | `float` | Event start time in seconds (in the processed WAV) | -| `offset` | `float` | Event end time in seconds | -| `tier1_category` | `str` | `VERB` / `DIST` / `PHYS` / `EMOT` / `ACOU` / `NONE` | -| `tier2_subtype` | `str` | e.g. `VERB_SHOUT`, `PHYS_HARD` | -| `intensity` | `int` | Turn intensity 1–5 | -| `speaker_id` | `str` | UPPERCASE speaker persona ID | -| `speaker_role` | `str` | `AGG`, `VIC`, `SW`, `BEN` | -| `emotional_state` | `str` | e.g. `"anger"`, `"fear"`, `"desperation"`, `"neutral"` | -| `confidence` | `float` | Auto-label confidence (always `1.0` for auto-generated) | -| `label_source` | `str` | `"auto"` for all current clips | -| `iaa_reviewed` | `bool` | Always `false` in current deliveries | -| `truncated` | `bool` | Whether the event was cut short by a turn boundary | -| `notes` | `str \| null` | Annotator notes | +| Field | Notes | +|-------|-------| +| `onset` / `offset` | Seconds in the **final processed WAV**. Already shifted to account for the 0.5 s leading silence pad. | +| `tier1_category` | `VERB` · `DIST` · `PHYS` · `EMOT` · `ACOU` · `NONE`. See [Label Taxonomy](taxonomy.md). | +| `tier2_subtype` | e.g. `VERB_SHOUT`, `DIST_SCREAM`, `PHYS_HARD`, `ACOU_SLAM`. | +| `intensity` | The intensity of the turn the event belongs to (1–5). | +| `speaker_id` | UPPERCASE. Matches one of `ClipMetadata.speakers[].speaker_id`. | +| `speaker_role` | `AGG` · `VIC` · `SW` · `BEN`. See [Glossary](glossary.md#speaker-roles). | +| `emotional_state` | Free-text label of speaker emotion at this turn (e.g. `"anger"`, `"fear"`, `"desperation"`, `"neutral"`). | +| `confidence` | Always `1.0` (labels are auto-generated). | +| `label_source` | Always `"auto"`. | +| `iaa_reviewed` | Always `false` in current deliveries — no human inter-annotator agreement review yet. | +| `truncated` | `true` if the event was cut short by a turn boundary. | --- ## Manifest CSV columns -`data/he/manifest.csv` — one row per clip. +`data/he/manifest.csv` — one row per clip, the fastest entry point for filtering. | Column | Type | Notes | |--------|------|-------| -| `clip_id` | str | Matches JSON `clip_id` | +| `clip_id` | str | Matches `ClipMetadata.clip_id` | | `project` | str | `she_proves` / `elephant_in_the_room` | | `violence_typology` | str | `SV` / `IT` / `NEG` / `NEU` | | `tier` | str | `A` / `B` | -| `duration_seconds` | float | | -| `speaker_ids` | str | Pipe-delimited, e.g. `AGG_M_30-45_001\|VIC_F_25-40_002` | -| `voice_families` | str | Pipe-delimited, matches `speaker_ids` order | -| `has_violence` | bool | See [taxonomy](taxonomy.md#has_violence-the-correct-derivation) | +| `duration_seconds` | float | Final WAV duration including pads | +| `speaker_ids` | str | **Pipe-delimited.** `AGG_M_30-45_001\|VIC_F_25-40_002` | +| `voice_families` | str | **Pipe-delimited**, same order as `speaker_ids` | +| `has_violence` | bool | Derived from events — see [Gotcha #1](gotchas.md#1-dont-derive-has_violence-from-typology) | | `max_intensity` | int | 1–5 | -| `quality_flags` | str | Comma-delimited flag list | -| `split` | str | `train` / `val` / `test` — all `train` in delivery-003 | -| `wav_path` | str | Repo-relative POSIX path | -| `strong_labels_path` | str | Repo-relative POSIX path to `.jsonl` | +| `quality_flags` | str | Comma-delimited soft warnings | +| `split` | str | `train` / `val` / `test` — **all `train`** in delivery-003 ([Gotcha #6](gotchas.md#6-all-clips-are-split-train-in-delivery-003)) | +| `wav_path` | str | Repo-relative POSIX path to the `.wav` | +| `strong_labels_path` | str | Repo-relative POSIX path to the `.jsonl` | + +--- + +## Transcript file format (`.txt`) + +Plain UTF-8. One turn = one header line + one or more text lines + one action line. Hebrew text only; no Latin script in the body. + +``` +[CLIP_ID: sp_sv_a_0001_00] +[SPEAKER: AGG_M_30-45_001 | ROLE: AGG | ONSET: 0.76 | OFFSET: 10.07] +מה זה הארוחה הזאת? שאלתי אותך דבר אחד פשוט, לעשות ארוחת ערב נורמלית. +[ACTION: VERB_SHOUT | INTENSITY: 2] +[SPEAKER: VIC_F_25-40_002 | ROLE: VIC | ONSET: 10.49 | OFFSET: 18.74] +עבדתי עד שש היום. עשיתי מה שהספקתי... +[ACTION: VERB_SHOUT | INTENSITY: 2] +``` + +- The first line is a single `[CLIP_ID: ...]` header. +- Each subsequent turn is a `[SPEAKER: ... | ROLE: ... | ONSET: ... | OFFSET: ...]` line, the Hebrew text, then `[ACTION: | INTENSITY: 1–5]`. +- `ONSET` / `OFFSET` are in seconds, relative to the final processed WAV (already include the leading pad). +- The `.jsonl` strong labels are the canonical source for events; the transcript is for human reading and as an ASR reference. diff --git a/docs/she-proves.md b/docs/she-proves.md index fef7983..6be24c7 100644 --- a/docs/she-proves.md +++ b/docs/she-proves.md @@ -1,133 +1,117 @@ -# She-Proves Team Guide +# She-Proves Guide -She-Proves is a smartphone app that **passively monitors audio for domestic violence incidents** and preserves evidence for legal use. +She-Proves is a smartphone app that passively monitors audio for domestic-violence incidents and preserves evidence for legal use. **Optimisation target: high recall** — better to flag for review than to miss. -**Optimization target: high recall.** It is better to flag an incident for review than to miss one. +This page is the *differential* between She-Proves clips and the rest of the corpus. For shared concepts (schema, labels, audio format) follow the cross-links. --- -## Scene structure +## Scene profile -| Property | Value | -|----------|-------| -| Duration | 3–6 minutes | -| Tier | A (clean — no room processing) | -| Pre-incident window | ≥ 60% of clip duration before the first violence event | -| Device profile | `phone_in_pocket`, `phone_on_table`, `phone_in_hand` | -| Room types | apartment rooms (living room, bedroom, kitchen) | -| Language | Hebrew (`he`) | +| | | +|---|---| +| Project code | `she_proves` (clip-id prefix `sp_*`) | +| Tier | A — clean audio, no room/device augmentation | +| Duration | 3–6 min | +| Pre-incident window | ≥ 60% of clip is normal speech before the first violence event | +| Device | `phone_in_pocket`, `phone_on_table`, `phone_in_hand` (planned; not active in delivery-003) | +| Room | Apartment (living room, bedroom, kitchen) — planned; not active in delivery-003 | -The long pre-incident window reflects real-world deployment: the app is always listening, and incidents are rare. Models trained on this data should handle extended periods of mundane speech before a rapid escalation. +The long pre-incident window is intentional. In deployment the app is always listening; incidents are rare. A model trained only on escalation segments will miss the gradual-buildup signal that precedes most domestic-violence events. -??? info "Tier A — what does 'clean' mean?" - Tier A clips have **no acoustic augmentation** — no room impulse response convolution, no device frequency response, no background noise injection. The audio is the direct TTS-mixer output after preprocessing: peak-normalized, silence-padded, 16 kHz mono 16-bit PCM. +!!! note "What 'Tier A' means here" + Tier A audio is the direct TTS-mixer output after preprocessing — peak-normalised, silence-padded, 16 kHz mono PCM. No room IR, no microphone profile, no background noise. `acoustic_scene.room_type`, `device`, `ir_source`, and `snr_db_actual` are all `null` for every Tier A clip. - For Tier A, `acoustic_scene.room_type`, `device`, `ir_source`, and `snr_db_actual` are all `null`. - - Tier B (used by Elephant) adds all of the above. See [Elephant in the Room](elephant.md) for details. + Delivery-003 has no Tier-A device augmentation yet (the `phone_in_pocket` etc. profiles exist in the pipeline but aren't applied at this stage). When that's added in a future delivery, the `acoustic_scene` block will start carrying `device` while keeping `room_type` null. --- ## Speaker pairs -Delivery-003 has two She-Proves speaker pairs — one per TTS backend. +Two pairs in delivery-003, one per TTS backend. Both pairs play the **AGG (aggressor, male) + VIC (victim, female)** roles. + +=== "Azure pair (10 clips)" + Speaker directory: `data/he/agg_m_30-45_001/` + + | Role | speaker_id | TTS voice | + |------|-----------|-----------| + | AGG | `AGG_M_30-45_001` | `he-IL-AvriNeural` | + | VIC | `VIC_F_25-40_002` | `he-IL-HilaNeural` | + +=== "Google Chirp HD pair (2 clips)" + Speaker directory: `data/he/agg_m_30-45_002/` -| Pair | Speaker dir | Male speaker | Female speaker | Backend | -|------|-------------|--------------|----------------|---------| -| Azure | `agg_m_30-45_001/` | `AGG_M_30-45_001` → `he-IL-AvriNeural` | `VIC_F_25-40_002` → `he-IL-HilaNeural` | Azure | -| Google Chirp HD | `agg_m_30-45_002/` | `AGG_M_30-45_002` → `he-IL-Chirp3-HD-Achird` | `VIC_F_25-40_003` → `he-IL-Chirp3-HD-Achernar` | Google | + | Role | speaker_id | TTS voice | + |------|-----------|-----------| + | AGG | `AGG_M_30-45_002` | `he-IL-Chirp3-HD-Achird` | + | VIC | `VIC_F_25-40_003` | `he-IL-Chirp3-HD-Achernar` | -Both pairs play **AGG (aggressor, male) + VIC (victim, female)** roles. The Google pair was added in delivery-003 specifically to introduce backend diversity. + The Google pair was added in delivery-003 specifically to introduce backend diversity. Both clips carry a `vic_f0_high` flag — see [Audio Format](audio-format.md#vic_f0_high-on-the-2-google-clips). -!!! note "Two speaker directories" - Clips from the Azure pair live under `data/he/agg_m_30-45_001/`. - Clips from the Google pair live under `data/he/agg_m_30-45_002/`. - Downstream code that hardcodes `agg_m_30-45_001/` will miss the Google clips. - Use `manifest.csv` or filter `meta["generation_metadata"]["tts_backend"]` to find both. +[Gotcha #2: don't hardcode `agg_m_30-45_001/`](gotchas.md#2-dont-hardcode-speaker-directory-paths) — three speaker directories exist now, including one for Elephant. Filter on `manifest.csv["project"] == "she_proves"` or on `meta["project"]`. --- ## Clips in delivery-003 -### Azure pair — 10 clips +**12 clips · ~20 min · 6 violent (`SV` + `IT`), 6 non-violent (`NEG` + `NEU`)** -`data/he/agg_m_30-45_001/` +??? abstract "Full clip listing" + Azure pair, `data/he/agg_m_30-45_001/` — 10 clips: -| Clip ID | Typology | `has_violence` | Duration | -|---------|----------|:---:|------:| -| `sp_sv_a_0001_00` | SV | ✓ | 1m 50.5s | -| `sp_sv_a_0002_00` | SV | ✓ | 1m 32.1s | -| `sp_it_a_0001_00` | IT | ✓ | 2m 23.8s | -| `sp_it_a_0002_00` | IT | ✓ | 2m 19.7s | -| `sp_neg_a_0001_00` | NEG | — | 1m 58.8s | -| `sp_neg_a_0002_00` | NEG | — | 1m 47.8s | -| `sp_neg_a_0003_00` | NEG | — | 2m 26.3s | -| `sp_neu_a_0001_00` | NEU | — | 1m 59.2s | -| `sp_neu_a_0002_00` | NEU | — | 2m 09.0s | -| `sp_neu_a_0003_00` | NEU | — | 1m 45.1s | + | Clip ID | Typology | violent | Duration | + |---------|----------|:---:|---------:| + | `sp_sv_a_0001_00` | SV | ✓ | 1m 50.5s | + | `sp_sv_a_0002_00` | SV | ✓ | 1m 32.1s | + | `sp_it_a_0001_00` | IT | ✓ | 2m 23.8s | + | `sp_it_a_0002_00` | IT | ✓ | 2m 19.7s | + | `sp_neg_a_0001_00` | NEG | — | 1m 58.8s | + | `sp_neg_a_0002_00` | NEG | — | 1m 47.8s | + | `sp_neg_a_0003_00` | NEG | — | 2m 26.3s | + | `sp_neu_a_0001_00` | NEU | — | 1m 59.2s | + | `sp_neu_a_0002_00` | NEU | — | 2m 09.0s | + | `sp_neu_a_0003_00` | NEU | — | 1m 45.1s | -### Google Chirp HD pair — 2 clips + Google Chirp HD pair, `data/he/agg_m_30-45_002/` — 2 clips: -`data/he/agg_m_30-45_002/` + | Clip ID | Typology | violent | Duration | Flags | + |---------|----------|:---:|---------:|-------| + | `sp_sv_a_0003_00` | SV | ✓ | 1m 42.8s | `vic_f0_high` | + | `sp_it_a_0003_00` | IT | ✓ | 1m 53.9s | `vic_f0_high` | -| Clip ID | Typology | `has_violence` | Duration | Note | -|---------|----------|:---:|------:|------| -| `sp_sv_a_0003_00` | SV | ✓ | 1m 42.8s | `vic_f0_high` flag | -| `sp_it_a_0003_00` | IT | ✓ | 1m 53.9s | `vic_f0_high` flag | - -The `vic_f0_high` flag on the Google clips indicates the female voice (`he-IL-Chirp3-HD-Achernar`) has a higher F0 baseline than the Azure Hila reference. See [Audio Format → vic_f0_high](audio-format.md#vic_f0_high-google-chirp-hd-female-f0-baseline). +The waveform on the [home page](index.md#see-it-first) is `sp_sv_a_0001_00` — a worked example of an SV escalation arc in this project's data. --- -## Loading She-Proves clips +## Loading just the She-Proves clips ```python -import json -import soundfile as sf -import pandas as pd +import pandas as pd, soundfile as sf, json from pathlib import Path root = Path(".") - -# Via manifest — easiest df = pd.read_csv("data/he/manifest.csv") -sp_clips = df[df["project"] == "she_proves"] - -# Load all She-Proves audio -wavs = {} -for _, row in sp_clips.iterrows(): - wav, sr = sf.read(root / row["wav_path"]) - wavs[row["clip_id"]] = wav - -# Filter to violent She-Proves clips only -sp_violent = sp_clips[sp_clips["has_violence"] == True] - -# Get per-backend split -sp_clips["backend"] = sp_clips["voice_families"].apply( - lambda v: "google" if "Chirp" in v else "azure" -) -print(sp_clips.groupby("backend")["clip_id"].count()) -# azure 10 -# google 2 -``` +sp = df[df["project"] == "she_proves"] # 12 rows ---- - -## Guidance for model training - -!!! warning "This is a toy corpus — not for production training" - 12 She-Proves clips (10 Azure + 2 Google) are not enough for training a production model. Use this delivery to validate your data pipeline and schema parsing. Full-scale data follows. +# Tag backend per row (Google clips have "Chirp" in voice_families) +sp = sp.assign(backend=sp["voice_families"].str.contains("Chirp").map({True: "google", False: "azure"})) +print(sp.groupby("backend")["clip_id"].count()) +# azure 10 +# google 2 -**High-recall orientation:** - -- **NEG clips are your hardest negatives.** They contain intense speech (raised voices, arguments, crying) with `has_violence: false`. Your recall model must not fire on them. -- **The pre-incident window** (first 60% of the clip) will look like NEU/low-intensity speech. Include it in your training windows — models that only see escalated segments will miss early warning signals. -- **Per-turn intensity** in the `.jsonl` events gives you fine-grained supervision beyond binary `has_violence`. Consider training an intensity regressor as an auxiliary objective. +# Load audio for each row +audio = {row.clip_id: sf.read(root / row.wav_path) for row in sp.itertuples()} +``` -**Backend diversity:** +--- -The 2 Google Chirp HD clips expose your feature extractor to a different F0 baseline and spectral profile. At small scale, they're useful for checking that your features don't overfit to Azure voice characteristics. +## Training-time notes (specific to this project) -**Speaker splits:** +- **NEG clips are your hardest negatives.** `sp_neg_a_*` clips have raised voices, distress, arguments — and `has_violence: false`. Recall metrics that fire on these will tank precision. See [Gotcha #1](gotchas.md#1-dont-derive-has_violence-from-typology) and [Gotcha #5](gotchas.md#5-neg-is-not-violent-at-low-intensity). +- **Use the pre-incident window.** The first 60% of each violent clip looks like NEU-grade speech. Train across the full clip, not only on escalation segments — early-warning signal lives in the buildup. +- **Per-turn intensity is a useful auxiliary objective.** `EventLabel.intensity` gives turn-level supervision beyond binary `has_violence`. An intensity regressor trained alongside the classifier often boosts the latter. +- **Only 2 voice families per gender in this delivery** (`low_voice_diversity_*` is flagged at the corpus level). Expect your acoustic features to over-fit to AvriNeural and HilaNeural — track per-voice eval separately when the corpus grows. +- **No device/room augmentation yet on She-Proves clips.** When the `phone_in_pocket` profile activates in a future delivery, your model will see substantially more high-frequency roll-off and handling noise than what's in delivery-003. -All 12 clips share 2 unique speaker personas (4 if you count Azure+Google pairs separately). There are not enough speakers for a speaker-disjoint split in this delivery. Re-evaluate when the corpus scales to 100+ speakers. +!!! warning "Still a small test batch" + 12 clips and 4 voices is enough to wire up your data loaders, label parsers, and evaluation harness. It is not enough to train a production model. Build the plumbing; wait for the real batch. diff --git a/docs/taxonomy.md b/docs/taxonomy.md index 8154a06..973f5cd 100644 --- a/docs/taxonomy.md +++ b/docs/taxonomy.md @@ -1,126 +1,137 @@ # Label Taxonomy -Labels follow a three-level hierarchy. The **source of truth** is `taxonomy.yaml` in the [SynthBanshee](https://github.com/DataHackIL/SynthBanshee) repo. Never derive labels from field names alone — always read from the actual data. +Three levels: clip-level **typology** → event-level **tier 1 category** → event-level **tier 2 subtype**. Plus a per-turn **intensity** (1–5) that drives prosody. The source of truth is `taxonomy.yaml` in [SynthBanshee](https://github.com/DataHackIL/SynthBanshee). --- -## Violence typologies (clip-level) +## Violence typology (clip-level) -The `violence_typology` field classifies the overall scenario of the clip. +The `violence_typology` field. **Not** an ordered scale. -| Typology | Full name | Description | -|----------|-----------|-------------| -| `SV` | Severe Violence | Physical violence, life-threatening escalation | -| `IT` | Intimate Terrorism | Systematic coercive control, repeated verbal/emotional abuse | -| `NEG` | Negative / Confusor | Acoustically intense but non-violent — anger, argument, distress, crying | -| `NEU` | Neutral | Calm or mundane conversation with no violence markers | +| Code | Name | What it sounds like | +|------|------|---------------------| +| `SV` | Severe Violence | Physical violence, life-threatening escalation. `tier1_category` includes `PHYS`, `DIST`, often `VERB`. | +| `IT` | Intimate Terrorism | Sustained coercive control, repeated verbal/emotional abuse — typically without physical attack. Heavy on `VERB` and `EMOT`. | +| `NEG` | Negative confusor | Acoustically intense but non-violent — anger, argument, distress, crying. **Hard negative class.** All events are `tier1_category: "NONE"`. | +| `NEU` | Neutral | Calm or mundane conversation. No violence markers. | -??? info "Why NEG is not the same as non-violent IT/SV" - NEG clips are designed as **hard negatives** — they sound intense and may have raised voices, crying, or confrontational tone, but no actual violence occurs. Their purpose is to train models to distinguish acoustic distress from violence. - - Models that rely only on loudness or emotional tone will misclassify NEG clips. This is by design. +!!! danger "NEG is the trap" + A NEG clip can have raised voices, crying, and `max_intensity: 3`. It will *sound* like violence to a model that only listens for loudness or emotional tone. But it is by definition `has_violence: false` — its purpose is to teach your model the difference between distress and violence. Training NEG as a positive class will collapse your precision. See [Gotcha #5](gotchas.md#5-neg-is-not-violent-at-low-intensity). --- -## `has_violence` — the correct derivation - -`has_violence` is a **derived convenience field** computed from the strong-label events, not from typology: +## `has_violence` — derived from events ```python has_violence = any(e["tier1_category"] != "NONE" for e in events) ``` -This means: +That's the rule. Two consequences worth knowing: -- `NEG` clips are **always** `has_violence: false`, regardless of `max_intensity` — by definition, every event in a NEG clip lands `tier1_category: "NONE"`. -- A `NEU` clip with even one stray non-NONE event would be `has_violence: true` (shouldn't happen in a well-labelled corpus, but the rule is defensive). +- **NEG clips are always `has_violence: false`** — every event in a NEG clip has `tier1_category: "NONE"` by construction, even when `max_intensity` is high. +- **NEU clips are always `has_violence: false`** for the same reason. +- A `SV` or `IT` clip is `has_violence: true` because at least one event has a non-NONE category. -!!! danger "Do not re-derive `has_violence` from typology + intensity" +!!! danger "Don't derive `has_violence` from typology or intensity" ```python - # WRONG — will misclassify every NEG clip - has_violence = typology in ("SV", "IT") - - # CORRECT - has_violence = any(e["tier1_category"] != "NONE" for e in events) + has_violence = typology in ("SV", "IT") # WRONG — works on this corpus but fragile + has_violence = max_intensity >= 3 # VERY WRONG — fires on every NEG clip ``` - The taxonomy columns are the ground truth. `has_violence` exists only for fast filtering and baseline modelling — never use it as the sole training label. + The event-level taxonomy is the ground truth. `weak_label.has_violence` exists for fast filtering and baseline modelling only — never as the sole training label. Train on the strong-label events when you can. --- ## Tier 1 categories (event-level) -Each `EventLabel` in the `.jsonl` file has a `tier1_category`: +The `tier1_category` field on each `EventLabel`. Six values. -| Category | Description | Example contexts | -|----------|-------------|-----------------| -| `VERB` | Verbal violence — threats, shouting, demeaning language | Arguments, intimidation | -| `DIST` | Distress vocalisations — screaming, crying under duress | Peak escalation turns | -| `PHYS` | Physical violence cues — impact sounds, struggle | Severe violence scenes | -| `EMOT` | Emotional manipulation — guilt-tripping, gaslighting | IT/coercive control | -| `ACOU` | Acoustic events — object impacts, slams, falls | Background events in Tier B | -| `NONE` | No violence — ambient speech, neutral turns | All NEU/NEG events | +| Category | What it covers | Where it shows up | +|----------|----------------|-------------------| +| `VERB` | Verbal violence — threats, shouting, demeaning language | Most violent clips, all intensity levels | +| `DIST` | Distress vocalisations — screaming, crying under duress | I3+ turns in SV/IT, peak escalation | +| `PHYS` | Physical violence cues — impact sounds, struggle | I4+ turns in SV clips | +| `EMOT` | Emotional manipulation — gaslighting, guilt-tripping | IT clips, coercive control turns | +| `ACOU` | Acoustic non-vocal events — slams, falls | Tier B clips, recorded in `acoustic_scene.background_events` | +| `NONE` | Ambient speech / neutral turn | All NEU clips, all NEG clips, calm turns in SV/IT | -??? info "ACOU vs DIST" - `ACOU` captures **non-vocal acoustic cues** — a door slam, an object falling, an impact sound. These appear in Tier B clips as `background_events` in the `acoustic_scene` block. +!!! info "ACOU vs DIST" + `ACOU` is **non-vocal** acoustic — a door slam, an object hitting the floor. `DIST` is **vocal distress** — a scream, crying. A scene where someone throws a glass and the victim screams will have an `ACOU_SLAM` event for the glass and a `DIST_SCREAM` event for the scream. - `DIST` captures **vocal distress** — screams, panic vocalisations, crying under coercion. + Tier B Elephant clips inject `ACOU_*` events as part of room augmentation; they show up both in `acoustic_scene.background_events` (with audio-level metadata) and in `.jsonl` strong labels (as labelled events). Tier A She-Proves clips can't produce ACOU events — there's no room-augmentation stage to add them. --- ## Tier 2 subtypes (event-level) -| Tier 1 | Tier 2 subtype | Description | -|--------|----------------|-------------| -| VERB | `VERB_SHOUT` | Raised or shouted speech | -| VERB | `VERB_THREAT` | Direct verbal threats | -| VERB | `VERB_INSULT` | Demeaning or insulting language | -| DIST | `DIST_SCREAM` | Distress scream or panic vocalisation | -| DIST | `DIST_CRY` | Crying or sobbing under duress | -| PHYS | `PHYS_HARD` | Hard physical impact cue | -| PHYS | `PHYS_SOFT` | Softer physical contact cue | -| EMOT | `EMOT_GASLIGHT` | Gaslighting or reality-denial | -| EMOT | `EMOT_GUILT` | Guilt-tripping or emotional coercion | -| ACOU | `ACOU_SLAM` | Object slam or door slam | -| ACOU | `ACOU_FALL` | Object falling or thrown | -| NONE | `NONE_AMBIENT` | Regular ambient speech or neutral turn | +| Tier 1 | Tier 2 | Description | +|--------|--------|-------------| +| `VERB` | `VERB_SHOUT` | Raised or shouted speech | +| `VERB` | `VERB_THREAT` | Direct verbal threats | +| `VERB` | `VERB_INSULT` | Demeaning or insulting language | +| `DIST` | `DIST_SCREAM` | Distress scream or panic vocalisation | +| `DIST` | `DIST_CRY` | Crying or sobbing under duress | +| `PHYS` | `PHYS_HARD` | Hard physical impact cue | +| `PHYS` | `PHYS_SOFT` | Softer physical contact cue | +| `EMOT` | `EMOT_GASLIGHT` | Gaslighting or reality-denial | +| `EMOT` | `EMOT_GUILT` | Guilt-tripping or emotional coercion | +| `ACOU` | `ACOU_SLAM` | Object slam or door slam | +| `ACOU` | `ACOU_FALL` | Object falling or thrown | +| `NONE` | `NONE_AMBIENT` | Regular ambient speech or neutral turn | --- ## Intensity scale (turn-level) -Intensity is scored 1–5 per dialogue turn. It controls prosody generation (pitch, rate, volume) and determines which tier1/tier2 labels are applied. +Each turn has an `intensity` in `[1, 5]`. It controls prosody generation (pitch, rate, volume) and the LLM script tone. + +| Score | Label | What's happening | +|-------|-------|------------------| +| 1 | Low tension | Calm conversation, mild undercurrent | +| 2 | Moderate tension | Noticeable friction, raised voices | +| 3 | Active conflict | Clear verbal aggression or intimidation | +| 4 | Escalated violence | Physical or high-intensity verbal violence | +| 5 | Extreme | Severe physical violence, panic, imminent danger | + +### How intensity and typology relate + +They are correlated but not the same. + +| Typology | Typical `max_intensity` range | Why | +|----------|:-----------------------------:|-----| +| `NEU` | 1–2 | Mundane conversation by definition | +| `NEG` | 2–3 | Distressed but non-violent; intensity rises with shouting/crying, but no PHYS/DIST events fire | +| `IT` | 3–5 | Sustained verbal/emotional aggression; can hit I5 on threats without physical violence | +| `SV` | 4–5 | Physical escalation requires I4+ turns | -| Score | Label | Description | Prosody profile | -|-------|-------|-------------|----------------| -| 1 | Low tension | Calm conversation, mild undercurrent | Near-neutral | -| 2 | Moderate tension | Noticeable friction, raised voices | Slightly raised pitch/rate | -| 3 | Active conflict | Clear verbal aggression or intimidation | Elevated pitch, faster rate | -| 4 | Escalated violence | Physical or high-intensity verbal violence | High pitch, fast rate, volume up | -| 5 | Extreme / life-threatening | Severe physical violence, panic | Maximally expressive (capped) | +In delivery-003 the actual distribution is `max_intensity` 5 = 10 clips, 3 = 4 clips, 2 = 6 clips. Useful for designing stratified eval splits: if you want a balanced eval set across intensity *and* typology, you'll need to upsample (or wait for more data). -??? info "The prosody cap at I4–I5" - At intensity 4–5, the LLM-generated prosody values are capped before SSML rendering to prevent Whisper transcription failures and maintain naturalness. The cap values are: +??? info "What is the prosody cap?" + At I3+, the LLM-suggested prosody values are clamped before SSML rendering to keep speech natural and transcribable by Whisper: - - **Pitch:** max +2.0 semitones (post-cap) - - **Rate:** range [0.85, 1.20] (post-cap) + - **Pitch:** capped at +2.0 semitones (post-cap) + - **Rate:** clamped to [0.85, 1.20] - Any cap activation is recorded in `generation_metadata.effective_prosody_caps` per turn. You'll see many activations at I4–I5 in delivery-003 — this is expected. The cap was calibrated in a listening test in May 2026 (SynthBanshee PR #87). + When clamping fires, the pre- and post-cap values are recorded per turn in `generation_metadata.effective_prosody_caps`. You'll see many activations at I4–I5 in delivery-003 — that's the intended behaviour, calibrated by listening test in May 2026 (SynthBanshee PR #87). --- -## Distribution in delivery-003 +## Where the labels come from + +- **Strong labels (`.jsonl`)** are generated by SynthBanshee from the LLM-authored script — the LLM produces turn-level intensity and an action tag (`VERB_SHOUT`, `DIST_SCREAM`, …), SynthBanshee converts them into `EventLabel` records. +- **Weak labels (`.json` → `weak_label`)** are derived from the strong labels by aggregation. +- **No human annotation has happened.** `confidence` is always `1.0`; `label_source` is always `"auto"`; `iaa_reviewed` is always `false`. Future deliveries may introduce human review on a subset — they'll set `iaa_reviewed: true` per clip when that happens. + +The scripts themselves are LLM-generated Hebrew dialogue, conditioned on the scene YAML and persona definitions in SynthBanshee. They are **not** transcripts of real conversations. -| Typology | Clips | Projects | Tiers | -|----------|------:|---------|-------| -| SV | 5 | she_proves (3) + elephant (2) | A (3) + B (2) | -| IT | 5 | she_proves (3) + elephant (2) | A (3) + B (2) | -| NEG | 5 | she_proves (3) + elephant (2) | A (3) + B (2) | -| NEU | 5 | she_proves (3) + elephant (2) | A (3) + B (2) | +--- + +## Distribution in delivery-003 -Intensity distribution across all 20 clips: +| Typology | Clips | Tier A (she_proves) | Tier B (elephant) | +|----------|------:|:-------------------:|:------------------:| +| `SV` | 5 | 3 | 2 | +| `IT` | 5 | 3 | 2 | +| `NEG` | 5 | 3 | 2 | +| `NEU` | 5 | 3 | 2 | -| Max intensity | Clips | -|:---:|:---:| -| 5 | 10 | -| 3 | 4 | -| 2 | 6 | +Balanced across typology and across project. Not balanced across speakers — see [Deliveries](deliveries.md#known-limitations). diff --git a/mkdocs.yml b/mkdocs.yml index 11dbc73..d6b6d04 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -1,5 +1,5 @@ -site_name: avdp-synth-corpus -site_description: Synthetic Hebrew audio corpus for the Audio Violence Detection Pipeline — consumer guide for She-Proves and Elephant in the Room teams +site_name: AVDP Synthetic Corpus +site_description: Synthetic Hebrew audio corpus for the Audio Violence Detection Pipeline — consumer guide for the She-Proves and Elephant in the Room teams site_url: https://datahackil.github.io/avdp-synth-corpus/ repo_url: https://github.com/DataHackIL/avdp-synth-corpus repo_name: DataHackIL/avdp-synth-corpus @@ -26,7 +26,6 @@ theme: - navigation.tabs - navigation.tabs.sticky - navigation.sections - - navigation.expand - navigation.indexes - navigation.top - toc.follow @@ -38,6 +37,9 @@ theme: - content.tabs.link - announce.dismiss +extra_css: + - assets/extra.css + markdown_extensions: - admonition - pymdownx.details @@ -63,6 +65,7 @@ markdown_extensions: - toc: permalink: true - def_list + - abbr plugins: - search: @@ -70,7 +73,8 @@ plugins: nav: - Home: index.md - - Getting Started: getting-started.md + - Start here: getting-started.md + - Common mistakes: gotchas.md - Team Guides: - She-Proves: she-proves.md - Elephant in the Room: elephant.md @@ -78,6 +82,7 @@ nav: - Label Taxonomy: taxonomy.md - Schema Reference: schema.md - Audio Format: audio-format.md + - Glossary: glossary.md - Deliveries: deliveries.md extra: