diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml new file mode 100644 index 0000000..c300fe0 --- /dev/null +++ b/.github/workflows/docs.yml @@ -0,0 +1,29 @@ +name: Deploy docs + +on: + push: + branches: + - main + paths: + - "docs/**" + - "mkdocs.yml" + workflow_dispatch: + +permissions: + contents: write + +jobs: + deploy: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + + - uses: actions/setup-python@v5 + with: + python-version: "3.12" + + - name: Install MkDocs Material + run: pip install mkdocs-material + + - name: Deploy to GitHub Pages + run: mkdocs gh-deploy --force diff --git a/docs/assets/logo.svg b/docs/assets/logo.svg new file mode 100644 index 0000000..138e42d --- /dev/null +++ b/docs/assets/logo.svg @@ -0,0 +1,10 @@ + + + + + + + + + + diff --git a/docs/audio-format.md b/docs/audio-format.md new file mode 100644 index 0000000..b0173a9 --- /dev/null +++ b/docs/audio-format.md @@ -0,0 +1,132 @@ +# Audio Format + +All clips in the corpus conform to the following hard constraints. Clips that fail these checks are rejected at generation time and will not appear in the corpus. + +--- + +## Format requirements + +| Property | Value | +|----------|-------| +| Sample rate | 16 000 Hz | +| Channels | 1 (mono) | +| Bit depth | 16-bit PCM | +| Peak level | ≤ –1.0 dBFS (safety ceiling) | +| Duration | ≥ 3.0 s | +| Encoding | WAV (no lossy formats) | + +```python +import soundfile as sf +import numpy as np + +wav, sr = sf.read("data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav") +assert sr == 16000 +assert wav.ndim == 1 # mono +assert wav.dtype == np.float64 # soundfile returns float64 by default +assert np.abs(wav).max() <= 1.0 # -1.0 dBFS ≈ linear amplitude 1.0 + +# Check format info +info = sf.info("data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav") +print(info.subtype) # PCM_16 +``` + +--- + +## Normalization pipeline + +Each clip passes through two normalization steps: + +``` +TTS render (float32, arbitrary loudness) + ↓ +[1] Per-turn RMS gain (M3a) — preserves inter-turn contrast + ↓ +[2] Single global peak gain — lands absolute peak at target_peak_dbfs + ↓ +[3] Safety limiter — clips at ≤ –1.0 dBFS (guaranteed no-op for target ≥ –12.0) + ↓ +Tier B only: room IR + device → renormalize to same target + ↓ +Output WAV +``` + +### Step 1 — Per-turn RMS gain (M3a) + +Each dialogue turn is gain-adjusted so its RMS matches a per-intensity target. This preserves the acoustic contrast between calm and escalated turns — a whispered turn at I1 stays quieter than a shouted turn at I5 — while giving the subsequent global normalization a stable peak-to-RMS ratio to work with. + +??? info "Why per-turn RMS matters" + Without per-turn normalization, the TTS engine produces flat RMS across intensities regardless of the requested prosody. The raw Azure and Google outputs are nearly constant-loudness even when the SSML requests "shout" style. Per-turn RMS gain is the mechanism that creates the acoustic loudness gradient you expect to see in the data. + +### Step 2 — Single global peak gain + +A single gain is applied to the whole mix so the clip's absolute peak lands at `loudness_target_peak_dbfs` (default: –2.0 dBFS). Because it's a single gain, all per-turn RMS *ratios* survive unchanged — the contrast from Step 1 is preserved. + +The configured target is recorded in `generation_metadata.loudness_target_peak_dbfs`. +The measured output peak is recorded in `preprocessing_applied.normalized_dbfs`. + +### Step 3 — Safety limiter + +A hard ceiling at –1.0 dBFS. For in-spec targets (range: [–12.0, –1.5] dBFS), this is a guaranteed no-op. It exists as a safety rail against misconfiguration. + +--- + +## Silence padding + +Every clip has at least 0.5 s of ambient silence at the head and tail. This is applied by `preprocess()` and logged in `preprocessing_applied.silence_padded: true`. + +Onset/offset timestamps in the `.txt` transcript and `.jsonl` events are already shifted to account for the leading pad — they refer to positions in the final processed WAV, not the raw TTS output. + +--- + +## Dirty files + +`preprocessing_applied` records the processing that was applied. The **pre-preprocessing WAV** is retained as `{clip_id}_dirty.wav` under `assets/speech/dirty/`. These are the raw TTS-mixer outputs before normalization, padding, or denoising. + +The `dirty_file_path` field in ClipMetadata gives the repo-relative path: +``` +"dirty_file_path": "assets/speech/dirty/sp_sv_a_0001_00_dirty.wav" +``` + +Dirty files are useful for: +- Diagnosing normalization issues (compare dirty peak vs. `normalized_dbfs`) +- Checking raw TTS prosody before processing +- Re-running preprocessing with different parameters + +!!! warning "Do not modify dirty files" + The `assets/` directory is managed by SynthBanshee. Manual edits to `.wav` files under `assets/speech/` will break SHA-256 cache lookups. + +--- + +## TTS backends + +| Backend | Voices | Clips in delivery-003 | +|---------|--------|----------------------| +| Azure Cognitive Services | `he-IL-AvriNeural` (M), `he-IL-HilaNeural` (F) | 18 | +| Google Cloud TTS Chirp 3 HD | `he-IL-Chirp3-HD-Achird` (M), `he-IL-Chirp3-HD-Achernar` (F) | 2 | + +The backend per speaker is recorded in `generation_metadata.tts_backend`: +```json +"tts_backend": { + "AGG_M_30-45_002": "google", + "VIC_F_25-40_003": "google" +} +``` + +??? info "Azure SSML cache" + SynthBanshee caches per-utterance WAVs under `assets/speech/` keyed by SHA-256 of the full rendered SSML string. Re-running generation with the same SSML is **free** for Azure clips — the file is returned directly from cache without an API call. Google Chirp HD does not use the same cache: it produces slightly different audio on each synthesis (minor bit-level variation at the same parameters). + +--- + +## Known audio quirks + +### `vic_f0_high` — Google Chirp HD female F0 baseline + +The two Google Chirp 3 HD clips (`sp_sv_a_0003_00`, `sp_it_a_0003_00`) use the female voice `he-IL-Chirp3-HD-Achernar`. This voice's F0 baseline runs measurably higher than `he-IL-HilaNeural` (Azure), against which the corpus QA M10a thresholds were calibrated. + +Both clips are flagged `vic_f0_high` in the QA report. This is expected and tracked — it reflects a real backend difference, not a synthesis failure. **Do not exclude these clips** on the basis of this flag; calibrate your model's F0 features against the correct baseline per backend. + +### `quality_flags: ["emotion_downgrade"]` + +Several clips carry an `emotion_downgrade` quality flag. This means the TTS engine produced a less emotionally intense output than requested by the SSML prosody hints — the pipeline detected the downgrade and flagged it. Audio quality is still acceptable; the prosody is slightly less extreme than the scene specification intended. + +In delivery-003: 15 clips carry at least one quality flag, mostly from prosody cap activations at I3+. diff --git a/docs/deliveries.md b/docs/deliveries.md new file mode 100644 index 0000000..71dac3b --- /dev/null +++ b/docs/deliveries.md @@ -0,0 +1,86 @@ +# Deliveries + +All data deliveries are logged here. Each entry links to per-delivery notes with clip counts, QA findings, known limitations, and the SynthBanshee commit that produced the batch. + +--- + +## Delivery 003 — multi-project, multi-voice + +**Date:** 2026-05-12 · **Status:** provisional · **PR:** [#5](https://github.com/DataHackIL/avdp-synth-corpus/pull/5) + +This is the current working delivery. It replaces delivery-002. + +### At a glance + +| | | +|---|---| +| Clips | 20 | +| Total duration | ~41.6 min | +| Projects | `she_proves` (12) + `elephant_in_the_room` (8) | +| Tiers | A (12 clean) + B (8 room-augmented) | +| TTS backends | Azure (18) + Google Chirp 3 HD (2) | +| Validation failures | 0 / 20 | +| Pipeline | SynthBanshee `0.1.0` @ [`1ea48f3`](https://github.com/DataHackIL/SynthBanshee/commit/1ea48f3) | + +[Full notes](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/003-multi-project-multi-voice/notes.md) · [QA report](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/003-multi-project-multi-voice/qa-report.json) + +### QA findings — closed (vs. delivery-002) + +| Finding | Delivery-002 | Delivery-003 | +|---------|:---:|:---:| +| `agg_no_escalation` | 3 clips | **0** — AGG RMS now escalates with intensity | +| `warn_no_overlap` | 4 clips | **0** — overlap_ratio 100% on I4+ clips | +| `warn_emotion_downgrade` | 4 clips | **0** — emotion_downgrade_ratio 0% | +| `generation_metadata` absent | 0 of 8 clips | **20 of 20** carry the full block | +| `dirty_file_path` null | 7 of 8 clips | **20 of 20** retain dirty files | +| `normalized_dbfs` hardcoded `-1.0` | all 8 clips | **fixed** — now the measured peak | + +Additional findings closed by the 2026-05-12 schema-shift regen (PRs [#110](https://github.com/DataHackIL/SynthBanshee/pull/110)/[#111](https://github.com/DataHackIL/SynthBanshee/pull/111)/[#112](https://github.com/DataHackIL/SynthBanshee/pull/112)): + +| Finding | Resolution | +|---------|-----------| +| `single_backend` false positive | `qa.py` now derives backend diversity from `generation_metadata.tts_backend.values()`; reports `clips_by_tts_backend: {azure: 18, google: 2}` | +| Absolute paths in clip JSON | `dirty_file_path` and `transcript_path` are now repo-relative POSIX strings | +| Leaked pytest tmp_path on `sp_neu_a_0001_00` | Regen overwrote with canonical path; autouse env-var strip fixture prevents future leaks | + +### QA findings — open + +| Finding | Detail | +|---------|--------| +| `low_voice_diversity_male` | 2 voice families per gender; threshold ≥ 3 | +| `low_voice_diversity_female` | 2 voice families per gender; threshold ≥ 3 | +| `vic_f0_high` (2 clips) | `sp_sv_a_0003_00` and `sp_it_a_0003_00` — Google Chirp HD female F0 runs higher than Azure Hila reference | +| `quality_flagged_clips: 15` | Mostly from prosody cap activations at I3+; expected behaviour | + +### Known limitations + +- **Speaker-disjoint splits not feasible.** 4 unique speaker personas across 20 clips; all clips are `split: train`. +- **Two speaker directories only.** `agg_m_30-45_002/` and `ben_m_40-55_003/` are first appearances — code hardcoding `agg_m_30-45_001/` will miss them. +- **One room type.** All 8 Elephant Tier B clips use `clinic_office`. Future deliveries will add `welfare_office` and `open_office`. +- **Toy corpus only.** 20 clips is not sufficient for training production models. + +### What this delivery exercises + +1. Full `ClipMetadata` schema including `generation_metadata`, `voice_family`, and (for Tier B) the populated `acoustic_scene` block +2. Per-surface casing rules: UPPERCASE `speaker_id`, lowercase paths and clip IDs +3. `has_violence` derivation from events: NEG clips are correctly `false` even at `max_intensity ≥ 3` +4. Multi-project layout under a single `data/he/` root +5. Multi-backend provenance: `generation_metadata.tts_backend` per speaker + +--- + +## Delivery log + +| # | Date | Slug | Project | Tier | Clips | Duration | Status | +|---|------|------|---------|------|------:|------:|--------| +| [003](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/003-multi-project-multi-voice/notes.md) | 2026-05-12 | multi-project-multi-voice | she_proves + elephant | A + B | 20 | ~42m | provisional | +| [002](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/002-m2a-wettest/notes.md) | 2026-04-15 | m2a-wettest | she_proves | A | 8 | ~17m | superseded | +| [001](https://github.com/DataHackIL/avdp-synth-corpus/blob/main/deliveries/001-debug-run-1/notes.md) | 2026-04-15 | debug-run-1 | she_proves | A | 1 | 2m 36s | superseded | + +## Status definitions + +| Status | Meaning | +|--------|---------| +| `provisional` | Wet-test batch; not yet approved for model training | +| `approved` | QA passed; cleared for training use | +| `superseded` | Replaced by a later delivery with the same scenes at higher quality | diff --git a/docs/elephant.md b/docs/elephant.md new file mode 100644 index 0000000..fa62e35 --- /dev/null +++ b/docs/elephant.md @@ -0,0 +1,178 @@ +# Elephant in the Room Guide + +**Elephant in the Room (הפיל שבחדר)** is a Raspberry Pi–class device placed in clinic and welfare offices that alerts security when a social worker is under threat. + +**Optimization target: high precision.** False alarms erode trust with security staff and social workers alike. + +--- + +## Scene structure + +| Property | Value | +|----------|-------| +| Duration | 1–4 minutes | +| Tier | B (room IR + device + noise augmentation) | +| Alert window | Final 40% of the clip | +| Device profile | `pi_budget_mic` | +| Room types | `clinic_office`, `welfare_office`, `open_office` | +| Language | Hebrew (`he`) | + +The alert-in-final-40% constraint reflects real-world deployment: the device picks up normal consultation audio before a client becomes threatening. The model must recognize genuine escalation from a baseline of professional interaction. + +??? info "Tier B acoustic augmentation pipeline" + Tier B clips go through three augmentation steps after TTS rendering and preprocessing: + + 1. **Room impulse response (IR)** — the clean speech is convolved with a synthetic room IR (generated by `pyroomacoustics` image-source method) to simulate the acoustic of the target room type. + 2. **Device frequency response** — the `pi_budget_mic` profile applies the frequency response of a budget Raspberry Pi microphone capsule. + 3. **Background noise injection** — ambient noise events (HVAC hum, equipment sounds) are mixed in at specified SNR levels. + + After augmentation, the clip is renormalized to the same peak target (–2.0 dBFS) via the shared `peak_normalize_to_target` helper — so all tiers exit at the same absolute peak level. + +--- + +## Speaker pair + +Delivery-003 has one Elephant speaker pair. + +| Speaker dir | Male speaker | Female speaker | Backend | +|-------------|--------------|----------------|---------| +| `ben_m_40-55_003/` | `BEN_M_40-55_003` → `he-IL-AvriNeural` | `SW_F_30-45_001` → `he-IL-HilaNeural` | Azure | + +The roles are **BEN (beneficiary/client, male) + SW (social worker, female)** — matching the most common demographic in Israeli welfare/clinic settings. + +!!! note "`ben_m_40-55_003/` is a new speaker directory in delivery-003" + Downstream code that hardcoded `agg_m_30-45_001/` for She-Proves will not find these clips. Use `manifest.csv` or filter by `meta["project"] == "elephant_in_the_room"`. + +--- + +## The `acoustic_scene` block + +This is the key difference between Tier A and Tier B metadata. For Elephant clips, `acoustic_scene` is fully populated: + +```json +"acoustic_scene": { + "room_type": "clinic_office", + "device": "pi_budget_mic", + "ir_source": "pyroomacoustics_ism", + "snr_db_actual": 11.2, + "speaker_distance_meters": 1.2, + "background_events": [ + {"type": "hvac_hum", "onset": 0.0, "offset": 147.0, "level_db": -37.4}, + {"type": "ACOU_SLAM", "onset": 72.164, "offset": 72.476, "level_db": 9.9}, + {"type": "ACOU_FALL", "onset": 97.57, "offset": 98.473, "level_db": 9.6} + ] +} +``` + +| Field | Meaning | +|-------|---------| +| `room_type` | Simulated room environment | +| `device` | Microphone/device profile applied | +| `ir_source` | Method used to generate room IR | +| `snr_db_actual` | Measured speech-to-noise ratio after mixing | +| `speaker_distance_meters` | Simulated speaker-to-mic distance | +| `background_events` | Non-speech acoustic events: type, timestamps, level | + +??? info "What is `pyroomacoustics_ism`?" + The image-source method (ISM) is an algorithm for computing room impulse responses by reflecting a virtual point source off the room's walls. `pyroomacoustics` is a Python library that implements it. + + The resulting IR simulates how sound travels from a speaker to a microphone in a room of specified dimensions and surface absorption coefficients — giving the audio the characteristic reverb of the target room type without recording in a real room. + +??? info "Background event types" + | Type | Description | + |------|-------------| + | `hvac_hum` | Constant HVAC/ventilation hum (low level, full duration) | + | `ACOU_SLAM` | Door slam or hard object impact (brief, high level) | + | `ACOU_FALL` | Object falling or being thrown (brief, high level) | + + `ACOU_*` events are also tagged as `EventLabel` entries in the `.jsonl` strong labels with `tier1_category: "ACOU"`. This means they contribute to `weak_label.violence_categories` even in SV/IT clips where the primary violence is verbal or physical. + +--- + +## Clips in delivery-003 + +`data/he/ben_m_40-55_003/` + +| Clip ID | Typology | `has_violence` | Duration | SNR (dB) | +|---------|----------|:---:|------:|:---:| +| `el_sv_b_0001_00` | SV | ✓ | 2m 27.0s | ~11 | +| `el_sv_b_0002_00` | SV | ✓ | 2m 18.5s | ~11 | +| `el_it_b_0001_00` | IT | ✓ | 2m 30.0s | ~11 | +| `el_it_b_0002_00` | IT | ✓ | 2m 31.6s | ~11 | +| `el_neg_b_0001_00` | NEG | — | 1m 53.8s | ~11 | +| `el_neg_b_0002_00` | NEG | — | 2m 54.6s | ~11 | +| `el_neu_b_0001_00` | NEU | — | 1m 56.9s | ~11 | +| `el_neu_b_0002_00` | NEU | — | 1m 19.7s | ~11 | + +All 8 clips are Tier B with `device: pi_budget_mic` and `room_type: clinic_office`. + +--- + +## Loading Elephant clips + +```python +import json +import soundfile as sf +import numpy as np +import pandas as pd +from pathlib import Path + +root = Path(".") +df = pd.read_csv("data/he/manifest.csv") +el_clips = df[df["project"] == "elephant_in_the_room"] + +# Load audio + metadata for a Tier B clip +clip_id = "el_sv_b_0001_00" +wav, sr = sf.read(root / f"data/he/ben_m_40-55_003/{clip_id}.wav") +meta = json.loads((root / f"data/he/ben_m_40-55_003/{clip_id}.json").read_text()) + +# Inspect acoustic scene +scene = meta["acoustic_scene"] +print(f"Room: {scene['room_type']} Device: {scene['device']} SNR: {scene['snr_db_actual']} dB") +# Room: clinic_office Device: pi_budget_mic SNR: 11.2 dB + +# Find background acoustic events +for evt in scene["background_events"]: + print(f"{evt['type']}: {evt['onset']:.1f}s – {evt['offset']:.1f}s @ {evt['level_db']} dB") +# hvac_hum: 0.0s – 147.0s @ -37.4 dB +# ACOU_SLAM: 72.2s – 72.5s @ 9.9 dB +# ACOU_FALL: 97.6s – 98.5s @ 9.6 dB + +# Get alert window (final 40%) +duration = meta["duration_seconds"] +alert_start = duration * 0.60 +print(f"Alert window: {alert_start:.1f}s – {duration:.1f}s") + +# Filter strong labels to alert window only +events = [json.loads(l) for l in + (root / f"data/he/ben_m_40-55_003/{clip_id}.jsonl").read_text().splitlines()] +alert_events = [e for e in events if e["onset"] >= alert_start] +``` + +--- + +## Guidance for model training + +!!! warning "This is a toy corpus — not for production training" + 8 Elephant clips from 1 speaker pair in 1 room type is insufficient for training. This delivery exists to bootstrap your data pipeline and acoustic-scene parsing code. + +**High-precision orientation:** + +- **NEG clips are essential.** Your precision target means you must not fire on `el_neg_b_*` clips — intense speech in a clinic room with background noise, but no violence. Train hard against these. +- **The alert-in-final-40% window** is where violence events concentrate. Consider a sliding-window detector that scores the final portion of each clip more aggressively than the opening. +- **SNR is ~11 dB.** This is a realistic but challenging condition for acoustic feature extraction. Verify that your features (MFCCs, log-mel, etc.) are robust at this SNR before comparing with She-Proves Tier A results. + +**Tier B–specific features:** + +- `acoustic_scene.snr_db_actual` gives you the ground-truth SNR per clip — useful for SNR-conditioned training or evaluation stratification. +- `background_events` timestamps let you train event detectors separately from the speech violence detector. +- `acoustic_scene.room_type` will diversify across room types at scale (`clinic_office`, `welfare_office`, `open_office`). Future deliveries will include all three. + +**What delivery-003 doesn't cover:** + +- Only `clinic_office` room type (all 8 clips) +- Only one speaker pair (BEN_M_40-55_003 + SW_F_30-45_001) +- No test/val split (4 unique speakers total; all are `split: train`) +- SNR variation (all ~11 dB) + +Plan for room-type diversity, SNR stratification, and speaker disjoint splits at scale. diff --git a/docs/getting-started.md b/docs/getting-started.md new file mode 100644 index 0000000..0310028 --- /dev/null +++ b/docs/getting-started.md @@ -0,0 +1,183 @@ +# Getting Started + +This guide walks through loading and using clips from the corpus in Python. All paths are relative to the repository root. + +## Prerequisites + +```bash +pip install soundfile numpy pandas pydantic +``` + +??? note "Optional: full SynthBanshee schema" + If you want strict Pydantic validation against the full `ClipMetadata` schema: + ```bash + git clone https://github.com/DataHackIL/SynthBanshee + cd SynthBanshee && pip install -e . + ``` + This gives you `from synthbanshee.labels.schema import ClipMetadata` and `validate_clip()`. + For most DS workflows, plain `json.loads()` is sufficient. + +## Clone the corpus + +```bash +git clone https://github.com/DataHackIL/avdp-synth-corpus.git +cd avdp-synth-corpus +``` + +The repository contains the audio files directly (no LFS). Total size is moderate — `data/he/` is roughly a few hundred MB for delivery-003. + +--- + +## Load a single clip + +```python +import json +from pathlib import Path +import soundfile as sf +import numpy as np + +root = Path(".") # run from repo root + +clip_id = "sp_sv_a_0001_00" +speaker_dir = root / "data/he/agg_m_30-45_001" + +# --- Audio --- +wav, sr = sf.read(speaker_dir / f"{clip_id}.wav") +# wav: float64 array, shape (N,). sr: always 16000. + +print(f"Duration: {len(wav)/sr:.1f}s Sample rate: {sr} Peak: {np.abs(wav).max():.4f}") +# Duration: 110.5s Sample rate: 16000 Peak: 0.7943 + +# --- Weak labels (ClipMetadata) --- +meta = json.loads((speaker_dir / f"{clip_id}.json").read_text()) +wl = meta["weak_label"] +print(f"Typology: {meta['violence_typology']} has_violence: {wl['has_violence']} " + f"max_intensity: {wl['max_intensity']}") +# Typology: SV has_violence: True max_intensity: 5 + +# --- Transcript --- +transcript = (speaker_dir / f"{clip_id}.txt").read_text(encoding="utf-8") +print(transcript[:200]) # Hebrew turns with timestamps +``` + +??? info "Why is the peak ~0.79 (–2.0 dBFS) not 1.0?" + All clips are peak-normalized to a **–2.0 dBFS target** (not –1.0 dBFS = 1.0 linear). + This gives 2 dB of headroom above the safety limiter ceiling (–1.0 dBFS). + `preprocessing_applied.normalized_dbfs` in the JSON records the measured peak. + See [Audio Format](audio-format.md) for the full normalization pipeline. + +--- + +## Load strong-label events + +```python +import jsonlines # pip install jsonlines + +events = [] +with jsonlines.open(speaker_dir / f"{clip_id}.jsonl") as reader: + for evt in reader: + events.append(evt) + +# Or without jsonlines: +events = [ + json.loads(line) + for line in (speaker_dir / f"{clip_id}.jsonl").read_text().splitlines() + if line.strip() +] + +for evt in events[:3]: + print(f"[{evt['onset']:.1f}s – {evt['offset']:.1f}s] " + f"{evt['tier1_category']}/{evt['tier2_subtype']} I{evt['intensity']}") +# [0.8s – 10.1s] VERB/VERB_SHOUT I2 +# [10.5s – 18.7s] VERB/VERB_SHOUT I2 +# [18.3s – 29.7s] VERB/VERB_THREAT I3 +``` + +??? info "What are tier1_category and tier2_subtype?" + Strong labels follow a three-level taxonomy: + + **Typology** (clip-level): `SV` · `IT` · `NEG` · `NEU` + + **Tier 1 category** (event-level): `VERB` · `DIST` · `PHYS` · `EMOT` · `ACOU` · `NONE` + + **Tier 2 subtype** (event-level): e.g. `VERB_SHOUT`, `VERB_THREAT`, `DIST_SCREAM`, `PHYS_HARD`, `ACOU_SLAM` + + See [Label Taxonomy](taxonomy.md) for the full table and has_violence derivation rule. + +--- + +## Work with the manifest + +`data/he/manifest.csv` is a flat summary of all clips. It's the fastest entry point for filtering and dataset construction. + +```python +import pandas as pd + +df = pd.read_csv("data/he/manifest.csv") +print(df.columns.tolist()) +# ['clip_id', 'project', 'violence_typology', 'tier', 'duration_seconds', +# 'speaker_ids', 'voice_families', 'has_violence', 'max_intensity', +# 'quality_flags', 'split', 'wav_path', 'strong_labels_path'] + +# Filter by project +she_proves_clips = df[df["project"] == "she_proves"] + +# Filter by typology +sv_clips = df[df["violence_typology"] == "SV"] + +# High-intensity violent clips only +high_intensity = df[(df["has_violence"]) & (df["max_intensity"] >= 4)] + +# Load audio for a manifest row +row = df.iloc[0] +wav, sr = sf.read(row["wav_path"]) # paths are repo-relative POSIX strings +``` + +!!! warning "`speaker_ids` and `voice_families` are pipe-delimited" + These columns contain multiple values joined by `|`: + ```python + speakers = row["speaker_ids"].split("|") + # ['AGG_M_30-45_001', 'VIC_F_25-40_002'] + ``` + +!!! note "All clips are `split: train` in delivery-003" + The corpus has only 4 unique speaker personas across 20 clips — speaker-disjoint splits are not feasible at this scale. When the corpus scales, speaker-disjoint train/val/test splits will be assigned by SynthBanshee. Until then, treat this as an unpartitioned pool. + +--- + +## Find a clip's speaker directory + +Clip IDs follow the pattern `{project_prefix}_{typology}_{tier}_{scene_num}_{take}`. The on-disk directory is the **lowercase** form of the first speaker ID listed in `speakers[]`: + +```python +def clip_dir(root: Path, clip_id: str, meta: dict) -> Path: + first_speaker = meta["speakers"][0]["speaker_id"] + return root / "data" / meta["language"] / first_speaker.lower() +``` + +| clip_id | speaker_dir | +|---------|-------------| +| `sp_sv_a_0001_00` | `data/he/agg_m_30-45_001/` | +| `sp_sv_a_0003_00` | `data/he/agg_m_30-45_002/` | +| `el_sv_b_0001_00` | `data/he/ben_m_40-55_003/` | + +Or use `manifest.csv` directly — `wav_path` already contains the full repo-relative path. + +--- + +## Validate a clip + +If you have SynthBanshee installed: + +```bash +synthbanshee validate data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav +``` + +This checks: all four files present, WAV format (16 kHz mono), peak ≤ –1.0 dBFS, duration ≥ 3 s, JSON parses as `ClipMetadata`. + +To run QA over the entire language directory: + +```bash +synthbanshee qa-report data/he/ +synthbanshee qa-report data/he/ --run-summary # adds corpus-level aggregates +``` diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 0000000..7b5b4d2 --- /dev/null +++ b/docs/index.md @@ -0,0 +1,127 @@ +# avdp-synth-corpus + +**Synthetic Hebrew audio corpus for the Audio Violence Detection Pipeline (AVDP)** + +Generated by [SynthBanshee](https://github.com/DataHackIL/SynthBanshee) · Hebrew (he-IL) · 16 kHz mono 16-bit PCM + +--- + +!!! warning "Toy corpus — not approved for model training" + All current deliveries are provisional wet-test batches for spec validation and pipeline bootstrapping. + The `split` field in `manifest.csv` is informational only. **Do not train production models on this data.** + See [Deliveries](deliveries.md) for the full status of each batch. + +--- + +## What is this? + +This repository contains **synthetic Hebrew audio clips** representing domestic-violence and threat scenarios, produced by a text-to-speech pipeline with automatic prosody modelling and acoustic augmentation. + +Two downstream products consume this data: + +=== "She-Proves" + + A smartphone app that passively monitors audio for domestic violence incidents and preserves evidence for legal use. High-recall orientation — better to flag and review than to miss. + + → [She-Proves team guide](she-proves.md) + +=== "Elephant in the Room" + + A Raspberry Pi–class device placed in clinic and welfare offices that alerts security when a social worker is under threat. High-precision orientation — false alarms erode trust. + + → [Elephant in the Room team guide](elephant.md) + +--- + +## Current delivery at a glance + +**Delivery 003 — multi-project, multi-voice** · 2026-05-12 · provisional + +| Dimension | Value | +|-----------|-------| +| Clips | 20 | +| Total duration | ~41.6 min | +| Projects | `she_proves` (12 clips) + `elephant_in_the_room` (8 clips) | +| Tiers | A — clean (12) + B — room-augmented (8) | +| TTS backends | Azure (18 clips) + Google Chirp 3 HD (2 clips) | +| Validation failures | 0 / 20 | +| Pipeline | SynthBanshee `0.1.0` @ [`1ea48f3`](https://github.com/DataHackIL/SynthBanshee/commit/1ea48f3) | + +Full breakdown: [Deliveries](deliveries.md) · [She-Proves clips](she-proves.md#clips-in-delivery-003) · [Elephant clips](elephant.md#clips-in-delivery-003) + +--- + +## Repository layout + +``` +data/ + he/ # ISO 639-1 language code + {speaker_dir}/ # e.g. agg_m_30-45_001/ (lowercase of first speaker ID) + {clip_id}.wav # 16 kHz mono 16-bit PCM + {clip_id}.txt # per-turn transcript with onset/offset markers + {clip_id}.json # ClipMetadata (weak labels, provenance, speaker info) + {clip_id}.jsonl # EventLabel records — one JSON object per line + manifest.csv # flat summary of all clips under data/he/ + +assets/ + speech/ # SHA-256-keyed per-utterance WAV cache (do not modify) + dirty/ # pre-preprocessing WAVs, retained per spec + scripts/ # SHA-256-keyed LLM script cache (do not modify) + +deliveries/ + {slug}/ + metadata.yaml # structured delivery record + notes.md # narrative QA notes and known limitations + qa-report.json # synthbanshee qa-report output +``` + +??? info "Why are there four files per clip?" + - **`.wav`** — the audio, spec-compliant (normalized, padded, validated) + - **`.txt`** — the transcript with turn-level onset/offset markers, used as ASR reference + - **`.json`** — `ClipMetadata`: weak labels (`has_violence`, `max_intensity`), speaker list, acoustic scene, provenance (`generation_metadata`) + - **`.jsonl`** — `EventLabel` records: one line per strong-label event with category, subtype, onset, offset, intensity, emotional state + + You only need `.wav` + `.json` for most training pipelines. Add `.jsonl` when you need per-event strong labels or onset/offset supervision. + +--- + +## Where to start + +| I want to… | Go to | +|------------|-------| +| Load my first clip in Python | [Getting Started → Load a clip](getting-started.md#load-a-single-clip) | +| Understand what the labels mean | [Label Taxonomy](taxonomy.md) | +| Parse `ClipMetadata` with Pydantic | [Schema Reference](schema.md) | +| Work with She-Proves scenes | [She-Proves guide](she-proves.md) | +| Work with Elephant Tier B audio | [Elephant in the Room guide](elephant.md) | +| Understand the audio normalization | [Audio Format](audio-format.md) | +| Check current quality status | [Deliveries](deliveries.md) | + +--- + +## Quick snippet + +```python +import json +from pathlib import Path +import soundfile as sf + +root = Path(".") # repo root + +# Load a clip +wav, sr = sf.read(root / "data/he/agg_m_30-45_001/sp_sv_a_0001_00.wav") +meta = json.loads((root / "data/he/agg_m_30-45_001/sp_sv_a_0001_00.json").read_text()) + +print(f"Duration: {len(wav)/sr:.1f}s has_violence: {meta['weak_label']['has_violence']}") +# Duration: 110.5s has_violence: True +``` + +For manifest-level operations: + +```python +import pandas as pd + +df = pd.read_csv("data/he/manifest.csv") +violent = df[df["has_violence"] == True] +print(violent[["clip_id", "project", "violence_typology", "duration_seconds"]].to_string()) +``` diff --git a/docs/schema.md b/docs/schema.md new file mode 100644 index 0000000..98a10fd --- /dev/null +++ b/docs/schema.md @@ -0,0 +1,219 @@ +# Schema Reference + +Every clip's `.json` file contains a `ClipMetadata` object. The authoritative Pydantic model is in [SynthBanshee `synthbanshee/labels/schema.py`](https://github.com/DataHackIL/SynthBanshee/blob/main/synthbanshee/labels/schema.py). + +--- + +## Loading with Pydantic + +```python +from synthbanshee.labels.schema import ClipMetadata # requires SynthBanshee installed +from pathlib import Path + +meta = ClipMetadata.model_validate_json( + Path("data/he/agg_m_30-45_001/sp_sv_a_0001_00.json").read_text() +) +print(meta.clip_id, meta.violence_typology, meta.weak_label.has_violence) +# sp_sv_a_0001_00 SV True +``` + +Plain JSON (no SynthBanshee required): + +```python +import json +from pathlib import Path + +meta = json.loads(Path("data/he/agg_m_30-45_001/sp_sv_a_0001_00.json").read_text()) +``` + +--- + +## Top-level `ClipMetadata` fields + +| Field | Type | Description | +|-------|------|-------------| +| `clip_id` | `str` | Lowercase ASCII clip identifier, e.g. `sp_sv_a_0001_00` | +| `project` | `str` | `she_proves` or `elephant_in_the_room` | +| `language` | `str` | ISO 639-1, always `"he"` | +| `violence_typology` | `str` | `SV` / `IT` / `NEG` / `NEU` — see [taxonomy](taxonomy.md) | +| `tier` | `str` | `"A"` (clean) or `"B"` (room-augmented) | +| `duration_seconds` | `float` | Duration of the processed WAV | +| `sample_rate` | `int` | Always `16000` | +| `channels` | `int` | Always `1` | +| `is_synthetic` | `bool` | Always `true` in this corpus | +| `generator_version` | `str` | SynthBanshee semver, e.g. `"0.1.0"` | +| `generation_date` | `str` | ISO 8601 date of generation | +| `random_seed` | `int` | Scene-level RNG seed for reproducibility | +| `scene_config` | `str` | Relative path to the scene YAML in SynthBanshee | +| `transcript_path` | `str` | Repo-relative POSIX path to the `.txt` transcript | +| `dirty_file_path` | `str` | Repo-relative POSIX path to the pre-preprocessing WAV | +| `speakers` | `list[SpeakerInfo]` | Speaker metadata — see below | +| `weak_label` | `WeakLabel` | Clip-level summary labels | +| `generation_metadata` | `GenerationMetadata \| null` | Pipeline provenance — see below | +| `preprocessing_applied` | `PreprocessingApplied` | What preprocessing steps ran | +| `acoustic_scene` | `AcousticScene` | Room/device augmentation (Tier B) | +| `quality_flags` | `list[str]` | QA flags, e.g. `["emotion_downgrade"]` | +| `snr_db_estimated` | `float \| null` | Estimated SNR (not always populated) | +| `annotator_confidence` | `float` | Auto-label confidence, 0–1 (auto-generated: always `1.0`) | +| `iaa_reviewed` | `bool` | Whether inter-annotator agreement review was done | +| `she_proves_meta` | `null` | Reserved for She-Proves–specific metadata (future) | +| `elephant_meta` | `null` | Reserved for Elephant–specific metadata (future) | + +--- + +## `SpeakerInfo` + +One entry per speaker in `speakers[]`. + +| Field | Type | Description | +|-------|------|-------------| +| `speaker_id` | `str` | UPPERCASE persona ID, e.g. `AGG_M_30-45_001` | +| `role` | `str` | `AGG` (aggressor), `VIC` (victim), `SW` (social worker), `BEN` (beneficiary/client) | +| `gender` | `str` | `"male"` or `"female"` | +| `age_range` | `str` | e.g. `"30-45"` | +| `tts_voice_id` | `str` | TTS voice identifier, e.g. `"he-IL-AvriNeural"` | +| `voice_family` | `str` | Same as `tts_voice_id` (may diverge in future) | + +??? info "Speaker ID casing convention" + The `speaker_id` field in JSON is always **UPPERCASE**: `AGG_M_30-45_001`. + The on-disk directory is **lowercase**: `agg_m_30-45_001/`. + This is a deliberate per-surface casing rule — see [SynthBanshee spec §2.5](https://github.com/DataHackIL/SynthBanshee/blob/main/docs/spec.md#25-filename-constraints). + +--- + +## `WeakLabel` + +| Field | Type | Description | +|-------|------|-------------| +| `has_violence` | `bool` | `any(e.tier1_category != "NONE" for e in events)` — see [taxonomy](taxonomy.md#has_violence-the-correct-derivation) | +| `violence_typology` | `str` | Mirrors top-level `violence_typology` | +| `max_intensity` | `int` | Highest per-turn intensity across the clip (1–5) | +| `violence_categories` | `list[str]` | Distinct `tier1_category` values observed in events | + +--- + +## `GenerationMetadata` + +Present on all delivery-003 clips; may be `null` on older clips. + +| Field | Type | Description | +|-------|------|-------------| +| `pipeline_version` | `str` | SynthBanshee semver | +| `tts_backend` | `dict[str, str]` | Speaker ID → `"azure"` or `"google"` | +| `voice_family` | `dict[str, str]` | Speaker ID → voice family string | +| `mix_mode_used` | `str` | `"sequential"` (turns in order) or `"overlapping"` | +| `normalization_strategy` | `str` | `"per_turn_rms_v2_target_peak"` | +| `loudness_target_peak_dbfs` | `float` | Configured peak target, e.g. `-2.0` | +| `breathiness_applied` | `bool` | Whether breathiness augmentation was applied | +| `effective_prosody_caps` | `list[ProsodyCap]` | Per-turn cap activations at I3–I5 | +| `speaker_state_serialized` | `dict[str, SpeakerState]` | Final prosody state per speaker | +| `prosody_controller_version` | `str \| null` | Version of the prosody controller | +| `text_normalization_version` | `str \| null` | Version of text normalization | +| `timing_controller_version` | `str \| null` | Version of timing controller | + +### `ProsodyCap` (entry in `effective_prosody_caps`) + +| Field | Description | +|-------|-------------| +| `turn_index` | Zero-based turn index | +| `intensity` | Intensity score for that turn | +| `dim` | `"pitch"` or `"rate"` | +| `pre_cap` | Prosody value before capping (semitones for pitch, ratio for rate) | +| `post_cap` | Prosody value after capping | + +### `SpeakerState` (entry in `speaker_state_serialized`) + +| Field | Description | +|-------|-------------| +| `pitch_offset_st` | Final pitch offset in semitones | +| `rate_offset` | Final speaking rate multiplier | +| `volume_offset_db` | Final volume offset in dB | +| `breathiness_level` | Breathiness level 0–1 | + +--- + +## `PreprocessingApplied` + +| Field | Type | Description | +|-------|------|-------------| +| `resampled_to_16k` | `bool` | Whether sample rate conversion ran | +| `downmixed_to_mono` | `bool` | Whether channel downmix ran | +| `normalized_dbfs` | `float` | **Measured** peak dBFS of the output WAV (not the target) | +| `silence_padded` | `bool` | Whether silence padding was applied | +| `denoised` | `bool` | Whether denoising ran | +| `spectral_filtered` | `bool` | Whether spectral filtering ran | + +!!! note "`normalized_dbfs` is the measured peak, not the target" + Use `generation_metadata.loudness_target_peak_dbfs` for the configured target. + Use `preprocessing_applied.normalized_dbfs` to verify the actual output peak. + On delivery-003, both should be very close to `–2.0` (within floating-point precision). + +--- + +## `AcousticScene` + +Populated for Tier B clips. Null fields indicate Tier A (no augmentation). + +| Field | Type | Description | +|-------|------|-------------| +| `room_type` | `str \| null` | e.g. `"clinic_office"`, `"welfare_office"`, `"open_office"` | +| `device` | `str \| null` | e.g. `"pi_budget_mic"` | +| `ir_source` | `str \| null` | Room impulse response source, e.g. `"pyroomacoustics_ism"` | +| `snr_db_actual` | `float \| null` | Actual SNR after augmentation (dB) | +| `speaker_distance_meters` | `float \| null` | Simulated speaker distance from microphone | +| `background_events` | `list[BackgroundEvent]` | Non-speech acoustic events added | + +### `BackgroundEvent` + +| Field | Description | +|-------|-------------| +| `type` | `"hvac_hum"`, `"ACOU_SLAM"`, `"ACOU_FALL"`, etc. | +| `onset` | Start time in seconds | +| `offset` | End time in seconds | +| `level_db` | Relative level of the event (dB) | + +--- + +## `EventLabel` (`.jsonl` rows) + +One JSON object per line. Each represents a single labelled event within the clip. + +| Field | Type | Description | +|-------|------|-------------| +| `event_id` | `str` | `{clip_id}_EVT_{index:03d}` | +| `clip_id` | `str` | Parent clip ID | +| `onset` | `float` | Event start time in seconds (in the processed WAV) | +| `offset` | `float` | Event end time in seconds | +| `tier1_category` | `str` | `VERB` / `DIST` / `PHYS` / `EMOT` / `ACOU` / `NONE` | +| `tier2_subtype` | `str` | e.g. `VERB_SHOUT`, `PHYS_HARD` | +| `intensity` | `int` | Turn intensity 1–5 | +| `speaker_id` | `str` | UPPERCASE speaker persona ID | +| `speaker_role` | `str` | `AGG`, `VIC`, `SW`, `BEN` | +| `emotional_state` | `str` | e.g. `"anger"`, `"fear"`, `"desperation"`, `"neutral"` | +| `confidence` | `float` | Auto-label confidence (always `1.0` for auto-generated) | +| `label_source` | `str` | `"auto"` for all current clips | +| `iaa_reviewed` | `bool` | Always `false` in current deliveries | +| `truncated` | `bool` | Whether the event was cut short by a turn boundary | +| `notes` | `str \| null` | Annotator notes | + +--- + +## Manifest CSV columns + +`data/he/manifest.csv` — one row per clip. + +| Column | Type | Notes | +|--------|------|-------| +| `clip_id` | str | Matches JSON `clip_id` | +| `project` | str | `she_proves` / `elephant_in_the_room` | +| `violence_typology` | str | `SV` / `IT` / `NEG` / `NEU` | +| `tier` | str | `A` / `B` | +| `duration_seconds` | float | | +| `speaker_ids` | str | Pipe-delimited, e.g. `AGG_M_30-45_001\|VIC_F_25-40_002` | +| `voice_families` | str | Pipe-delimited, matches `speaker_ids` order | +| `has_violence` | bool | See [taxonomy](taxonomy.md#has_violence-the-correct-derivation) | +| `max_intensity` | int | 1–5 | +| `quality_flags` | str | Comma-delimited flag list | +| `split` | str | `train` / `val` / `test` — all `train` in delivery-003 | +| `wav_path` | str | Repo-relative POSIX path | +| `strong_labels_path` | str | Repo-relative POSIX path to `.jsonl` | diff --git a/docs/she-proves.md b/docs/she-proves.md new file mode 100644 index 0000000..fef7983 --- /dev/null +++ b/docs/she-proves.md @@ -0,0 +1,133 @@ +# She-Proves Team Guide + +She-Proves is a smartphone app that **passively monitors audio for domestic violence incidents** and preserves evidence for legal use. + +**Optimization target: high recall.** It is better to flag an incident for review than to miss one. + +--- + +## Scene structure + +| Property | Value | +|----------|-------| +| Duration | 3–6 minutes | +| Tier | A (clean — no room processing) | +| Pre-incident window | ≥ 60% of clip duration before the first violence event | +| Device profile | `phone_in_pocket`, `phone_on_table`, `phone_in_hand` | +| Room types | apartment rooms (living room, bedroom, kitchen) | +| Language | Hebrew (`he`) | + +The long pre-incident window reflects real-world deployment: the app is always listening, and incidents are rare. Models trained on this data should handle extended periods of mundane speech before a rapid escalation. + +??? info "Tier A — what does 'clean' mean?" + Tier A clips have **no acoustic augmentation** — no room impulse response convolution, no device frequency response, no background noise injection. The audio is the direct TTS-mixer output after preprocessing: peak-normalized, silence-padded, 16 kHz mono 16-bit PCM. + + For Tier A, `acoustic_scene.room_type`, `device`, `ir_source`, and `snr_db_actual` are all `null`. + + Tier B (used by Elephant) adds all of the above. See [Elephant in the Room](elephant.md) for details. + +--- + +## Speaker pairs + +Delivery-003 has two She-Proves speaker pairs — one per TTS backend. + +| Pair | Speaker dir | Male speaker | Female speaker | Backend | +|------|-------------|--------------|----------------|---------| +| Azure | `agg_m_30-45_001/` | `AGG_M_30-45_001` → `he-IL-AvriNeural` | `VIC_F_25-40_002` → `he-IL-HilaNeural` | Azure | +| Google Chirp HD | `agg_m_30-45_002/` | `AGG_M_30-45_002` → `he-IL-Chirp3-HD-Achird` | `VIC_F_25-40_003` → `he-IL-Chirp3-HD-Achernar` | Google | + +Both pairs play **AGG (aggressor, male) + VIC (victim, female)** roles. The Google pair was added in delivery-003 specifically to introduce backend diversity. + +!!! note "Two speaker directories" + Clips from the Azure pair live under `data/he/agg_m_30-45_001/`. + Clips from the Google pair live under `data/he/agg_m_30-45_002/`. + Downstream code that hardcodes `agg_m_30-45_001/` will miss the Google clips. + Use `manifest.csv` or filter `meta["generation_metadata"]["tts_backend"]` to find both. + +--- + +## Clips in delivery-003 + +### Azure pair — 10 clips + +`data/he/agg_m_30-45_001/` + +| Clip ID | Typology | `has_violence` | Duration | +|---------|----------|:---:|------:| +| `sp_sv_a_0001_00` | SV | ✓ | 1m 50.5s | +| `sp_sv_a_0002_00` | SV | ✓ | 1m 32.1s | +| `sp_it_a_0001_00` | IT | ✓ | 2m 23.8s | +| `sp_it_a_0002_00` | IT | ✓ | 2m 19.7s | +| `sp_neg_a_0001_00` | NEG | — | 1m 58.8s | +| `sp_neg_a_0002_00` | NEG | — | 1m 47.8s | +| `sp_neg_a_0003_00` | NEG | — | 2m 26.3s | +| `sp_neu_a_0001_00` | NEU | — | 1m 59.2s | +| `sp_neu_a_0002_00` | NEU | — | 2m 09.0s | +| `sp_neu_a_0003_00` | NEU | — | 1m 45.1s | + +### Google Chirp HD pair — 2 clips + +`data/he/agg_m_30-45_002/` + +| Clip ID | Typology | `has_violence` | Duration | Note | +|---------|----------|:---:|------:|------| +| `sp_sv_a_0003_00` | SV | ✓ | 1m 42.8s | `vic_f0_high` flag | +| `sp_it_a_0003_00` | IT | ✓ | 1m 53.9s | `vic_f0_high` flag | + +The `vic_f0_high` flag on the Google clips indicates the female voice (`he-IL-Chirp3-HD-Achernar`) has a higher F0 baseline than the Azure Hila reference. See [Audio Format → vic_f0_high](audio-format.md#vic_f0_high-google-chirp-hd-female-f0-baseline). + +--- + +## Loading She-Proves clips + +```python +import json +import soundfile as sf +import pandas as pd +from pathlib import Path + +root = Path(".") + +# Via manifest — easiest +df = pd.read_csv("data/he/manifest.csv") +sp_clips = df[df["project"] == "she_proves"] + +# Load all She-Proves audio +wavs = {} +for _, row in sp_clips.iterrows(): + wav, sr = sf.read(root / row["wav_path"]) + wavs[row["clip_id"]] = wav + +# Filter to violent She-Proves clips only +sp_violent = sp_clips[sp_clips["has_violence"] == True] + +# Get per-backend split +sp_clips["backend"] = sp_clips["voice_families"].apply( + lambda v: "google" if "Chirp" in v else "azure" +) +print(sp_clips.groupby("backend")["clip_id"].count()) +# azure 10 +# google 2 +``` + +--- + +## Guidance for model training + +!!! warning "This is a toy corpus — not for production training" + 12 She-Proves clips (10 Azure + 2 Google) are not enough for training a production model. Use this delivery to validate your data pipeline and schema parsing. Full-scale data follows. + +**High-recall orientation:** + +- **NEG clips are your hardest negatives.** They contain intense speech (raised voices, arguments, crying) with `has_violence: false`. Your recall model must not fire on them. +- **The pre-incident window** (first 60% of the clip) will look like NEU/low-intensity speech. Include it in your training windows — models that only see escalated segments will miss early warning signals. +- **Per-turn intensity** in the `.jsonl` events gives you fine-grained supervision beyond binary `has_violence`. Consider training an intensity regressor as an auxiliary objective. + +**Backend diversity:** + +The 2 Google Chirp HD clips expose your feature extractor to a different F0 baseline and spectral profile. At small scale, they're useful for checking that your features don't overfit to Azure voice characteristics. + +**Speaker splits:** + +All 12 clips share 2 unique speaker personas (4 if you count Azure+Google pairs separately). There are not enough speakers for a speaker-disjoint split in this delivery. Re-evaluate when the corpus scales to 100+ speakers. diff --git a/docs/taxonomy.md b/docs/taxonomy.md new file mode 100644 index 0000000..8154a06 --- /dev/null +++ b/docs/taxonomy.md @@ -0,0 +1,126 @@ +# Label Taxonomy + +Labels follow a three-level hierarchy. The **source of truth** is `taxonomy.yaml` in the [SynthBanshee](https://github.com/DataHackIL/SynthBanshee) repo. Never derive labels from field names alone — always read from the actual data. + +--- + +## Violence typologies (clip-level) + +The `violence_typology` field classifies the overall scenario of the clip. + +| Typology | Full name | Description | +|----------|-----------|-------------| +| `SV` | Severe Violence | Physical violence, life-threatening escalation | +| `IT` | Intimate Terrorism | Systematic coercive control, repeated verbal/emotional abuse | +| `NEG` | Negative / Confusor | Acoustically intense but non-violent — anger, argument, distress, crying | +| `NEU` | Neutral | Calm or mundane conversation with no violence markers | + +??? info "Why NEG is not the same as non-violent IT/SV" + NEG clips are designed as **hard negatives** — they sound intense and may have raised voices, crying, or confrontational tone, but no actual violence occurs. Their purpose is to train models to distinguish acoustic distress from violence. + + Models that rely only on loudness or emotional tone will misclassify NEG clips. This is by design. + +--- + +## `has_violence` — the correct derivation + +`has_violence` is a **derived convenience field** computed from the strong-label events, not from typology: + +```python +has_violence = any(e["tier1_category"] != "NONE" for e in events) +``` + +This means: + +- `NEG` clips are **always** `has_violence: false`, regardless of `max_intensity` — by definition, every event in a NEG clip lands `tier1_category: "NONE"`. +- A `NEU` clip with even one stray non-NONE event would be `has_violence: true` (shouldn't happen in a well-labelled corpus, but the rule is defensive). + +!!! danger "Do not re-derive `has_violence` from typology + intensity" + ```python + # WRONG — will misclassify every NEG clip + has_violence = typology in ("SV", "IT") + + # CORRECT + has_violence = any(e["tier1_category"] != "NONE" for e in events) + ``` + The taxonomy columns are the ground truth. `has_violence` exists only for fast filtering and baseline modelling — never use it as the sole training label. + +--- + +## Tier 1 categories (event-level) + +Each `EventLabel` in the `.jsonl` file has a `tier1_category`: + +| Category | Description | Example contexts | +|----------|-------------|-----------------| +| `VERB` | Verbal violence — threats, shouting, demeaning language | Arguments, intimidation | +| `DIST` | Distress vocalisations — screaming, crying under duress | Peak escalation turns | +| `PHYS` | Physical violence cues — impact sounds, struggle | Severe violence scenes | +| `EMOT` | Emotional manipulation — guilt-tripping, gaslighting | IT/coercive control | +| `ACOU` | Acoustic events — object impacts, slams, falls | Background events in Tier B | +| `NONE` | No violence — ambient speech, neutral turns | All NEU/NEG events | + +??? info "ACOU vs DIST" + `ACOU` captures **non-vocal acoustic cues** — a door slam, an object falling, an impact sound. These appear in Tier B clips as `background_events` in the `acoustic_scene` block. + + `DIST` captures **vocal distress** — screams, panic vocalisations, crying under coercion. + +--- + +## Tier 2 subtypes (event-level) + +| Tier 1 | Tier 2 subtype | Description | +|--------|----------------|-------------| +| VERB | `VERB_SHOUT` | Raised or shouted speech | +| VERB | `VERB_THREAT` | Direct verbal threats | +| VERB | `VERB_INSULT` | Demeaning or insulting language | +| DIST | `DIST_SCREAM` | Distress scream or panic vocalisation | +| DIST | `DIST_CRY` | Crying or sobbing under duress | +| PHYS | `PHYS_HARD` | Hard physical impact cue | +| PHYS | `PHYS_SOFT` | Softer physical contact cue | +| EMOT | `EMOT_GASLIGHT` | Gaslighting or reality-denial | +| EMOT | `EMOT_GUILT` | Guilt-tripping or emotional coercion | +| ACOU | `ACOU_SLAM` | Object slam or door slam | +| ACOU | `ACOU_FALL` | Object falling or thrown | +| NONE | `NONE_AMBIENT` | Regular ambient speech or neutral turn | + +--- + +## Intensity scale (turn-level) + +Intensity is scored 1–5 per dialogue turn. It controls prosody generation (pitch, rate, volume) and determines which tier1/tier2 labels are applied. + +| Score | Label | Description | Prosody profile | +|-------|-------|-------------|----------------| +| 1 | Low tension | Calm conversation, mild undercurrent | Near-neutral | +| 2 | Moderate tension | Noticeable friction, raised voices | Slightly raised pitch/rate | +| 3 | Active conflict | Clear verbal aggression or intimidation | Elevated pitch, faster rate | +| 4 | Escalated violence | Physical or high-intensity verbal violence | High pitch, fast rate, volume up | +| 5 | Extreme / life-threatening | Severe physical violence, panic | Maximally expressive (capped) | + +??? info "The prosody cap at I4–I5" + At intensity 4–5, the LLM-generated prosody values are capped before SSML rendering to prevent Whisper transcription failures and maintain naturalness. The cap values are: + + - **Pitch:** max +2.0 semitones (post-cap) + - **Rate:** range [0.85, 1.20] (post-cap) + + Any cap activation is recorded in `generation_metadata.effective_prosody_caps` per turn. You'll see many activations at I4–I5 in delivery-003 — this is expected. The cap was calibrated in a listening test in May 2026 (SynthBanshee PR #87). + +--- + +## Distribution in delivery-003 + +| Typology | Clips | Projects | Tiers | +|----------|------:|---------|-------| +| SV | 5 | she_proves (3) + elephant (2) | A (3) + B (2) | +| IT | 5 | she_proves (3) + elephant (2) | A (3) + B (2) | +| NEG | 5 | she_proves (3) + elephant (2) | A (3) + B (2) | +| NEU | 5 | she_proves (3) + elephant (2) | A (3) + B (2) | + +Intensity distribution across all 20 clips: + +| Max intensity | Clips | +|:---:|:---:| +| 5 | 10 | +| 3 | 4 | +| 2 | 6 | diff --git a/mkdocs.yml b/mkdocs.yml new file mode 100644 index 0000000..11dbc73 --- /dev/null +++ b/mkdocs.yml @@ -0,0 +1,91 @@ +site_name: avdp-synth-corpus +site_description: Synthetic Hebrew audio corpus for the Audio Violence Detection Pipeline — consumer guide for She-Proves and Elephant in the Room teams +site_url: https://datahackil.github.io/avdp-synth-corpus/ +repo_url: https://github.com/DataHackIL/avdp-synth-corpus +repo_name: DataHackIL/avdp-synth-corpus +edit_uri: edit/main/docs/ + +theme: + name: material + logo: assets/logo.svg + favicon: assets/logo.svg + palette: + - scheme: default + primary: teal + accent: cyan + toggle: + icon: material/brightness-7 + name: Switch to dark mode + - scheme: slate + primary: teal + accent: cyan + toggle: + icon: material/brightness-4 + name: Switch to light mode + features: + - navigation.tabs + - navigation.tabs.sticky + - navigation.sections + - navigation.expand + - navigation.indexes + - navigation.top + - toc.follow + - search.suggest + - search.highlight + - search.share + - content.code.copy + - content.code.annotate + - content.tabs.link + - announce.dismiss + +markdown_extensions: + - admonition + - pymdownx.details + - pymdownx.superfences: + custom_fences: + - name: mermaid + class: mermaid + format: !!python/name:pymdownx.superfences.fence_code_format + - pymdownx.highlight: + anchor_linenums: true + line_spans: __span + pygments_lang_class: true + - pymdownx.inlinehilite + - pymdownx.snippets + - pymdownx.tabbed: + alternate_style: true + - pymdownx.emoji: + emoji_index: !!python/name:material.extensions.emoji.twemoji + emoji_generator: !!python/name:material.extensions.emoji.to_svg + - tables + - attr_list + - md_in_html + - toc: + permalink: true + - def_list + +plugins: + - search: + lang: en + +nav: + - Home: index.md + - Getting Started: getting-started.md + - Team Guides: + - She-Proves: she-proves.md + - Elephant in the Room: elephant.md + - Reference: + - Label Taxonomy: taxonomy.md + - Schema Reference: schema.md + - Audio Format: audio-format.md + - Deliveries: deliveries.md + +extra: + social: + - icon: fontawesome/brands/github + link: https://github.com/DataHackIL/avdp-synth-corpus + name: avdp-synth-corpus on GitHub + - icon: fontawesome/brands/github + link: https://github.com/DataHackIL/SynthBanshee + name: SynthBanshee pipeline on GitHub + generator: false