diff --git a/README.md b/README.md index d714125..c265476 100644 --- a/README.md +++ b/README.md @@ -1,162 +1,142 @@ # avdp-synth-corpus -Synthetic Hebrew audio dataset for the **Audio Violence Detection Pipeline (AVDP)**, generated by the [SynthBanshee](https://github.com/DataHackIL/SynthBanshee) pipeline. +[![Docs](https://img.shields.io/badge/docs-live-0f766e.svg)](https://datahackil.github.io/avdp-synth-corpus/) +[![Deploy docs](https://github.com/DataHackIL/avdp-synth-corpus/actions/workflows/docs.yml/badge.svg)](https://github.com/DataHackIL/avdp-synth-corpus/actions/workflows/docs.yml) +[![CI](https://github.com/DataHackIL/avdp-synth-corpus/actions/workflows/ci.yml/badge.svg)](https://github.com/DataHackIL/avdp-synth-corpus/actions/workflows/ci.yml) +[![license: MIT](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE) +![audio: 16 kHz mono PCM](https://img.shields.io/badge/audio-16kHz%20mono%20PCM-blue.svg) -This is a **data-only repository**. It contains no application code. All pipeline logic, configuration, documentation, and tests live in SynthBanshee. +Created by [Shay Palachy Affek](http://www.shaypalachy.com/). -> **If you are a Claude Code agent or AI assistant:** read [`CLAUDE.md`](CLAUDE.md) before -> making any changes. Key rules: never rename/modify/delete files in `assets/`; never edit -> `.wav` files by hand; always update `DELIVERIES.md` when adding clips; never drop -> `has_violence` from metadata or manifests. +**avdp-synth-corpus** is the public synthetic Hebrew audio corpus for the **Audio Violence Detection +Pipeline (AVDP)**. It contains generated clips, transcripts, metadata, strong labels, delivery +notes, and cache assets produced by the +[SynthBanshee](https://github.com/DataHackIL/SynthBanshee) pipeline. ---- +This is a **data-only repository**. It contains no generation code; all pipeline logic, +configuration, tests, and implementation docs live in SynthBanshee. -## What is this data for? +**Start with the consumer guide:** [datahackil.github.io/avdp-synth-corpus](https://datahackil.github.io/avdp-synth-corpus/) -AVDP is an AI safety initiative run by [DataHack](https://datahack.org.il) with two downstream products: +## Data Preview -- **She-Proves** — passively monitors a smartphone for domestic violence incidents and preserves audio evidence for legal use -- **Elephant in the Room (הפיל שבחדר)** — a Raspberry Pi–class device in clinic/welfare offices that alerts security when a social worker is under threat +The repository currently contains a small provisional test batch: **20 synthetic Hebrew clips**, +about **41.6 minutes** total, with one `.wav`, `.txt`, `.json`, and `.jsonl` file per clip. -The clips in this repository are **synthetic** (`is_synthetic: true` in all metadata). They are generated by a text-to-speech pipeline using Microsoft Azure Cognitive Services Hebrew neural voices. A real-data pipeline (actor recordings) is planned for a later phase; those recordings will live in a separate repository. +![Waveform of sp_sv_a_0001_00 with strong-label event boundaries](docs/assets/sp_sv_a_0001_00_waveform.png) ---- +The preview above shows `sp_sv_a_0001_00`, a Severe Violence scene with strong-label event +boundaries overlaid on the waveform. It lets a new consumer see the basic shape of the corpus before +browsing the files: 16 kHz mono audio, Hebrew transcript, weak clip metadata, and time-aligned event +labels. -## Repository layout +| Field | Current value | +|---|---| +| Language | Hebrew (`he-IL`) | +| Audio format | 16 kHz, mono, 16-bit PCM WAV | +| Current batch size | 20 clips, 20 transcripts, 20 metadata files, 20 label files | +| Product contexts | She-Proves and Elephant in the Room | +| Data source | Synthetic TTS only; no real human recordings | +| Generator | [DataHackIL/SynthBanshee](https://github.com/DataHackIL/SynthBanshee) | +| Delivery log | [`DELIVERIES.md`](DELIVERIES.md) and `deliveries/{slug}/` | -``` +## Where to Start + +| Need | Link | +|---|---| +| First-time consumer guide | [Docs site](https://datahackil.github.io/avdp-synth-corpus/) | +| Load one clip | [Start here](https://datahackil.github.io/avdp-synth-corpus/getting-started/) | +| Avoid common data mistakes | [Common mistakes](https://datahackil.github.io/avdp-synth-corpus/gotchas/) | +| Understand labels | [Label taxonomy](https://datahackil.github.io/avdp-synth-corpus/taxonomy/) | +| Decode metadata fields | [Schema reference](https://datahackil.github.io/avdp-synth-corpus/schema/) | +| Check current deliveries | [Delivery history](DELIVERIES.md) | +| Inspect generation code | [SynthBanshee](https://github.com/DataHackIL/SynthBanshee) | + +## Repository Layout + +```text assets/ - speech/ # Per-utterance WAV cache, named by SHA-256 of the full rendered - │ # SSML string. Never modify or rename these files — SynthBanshee - │ # uses the hash as the cache key. Deleting a file forces a paid - │ # re-synthesis; adding a file with a wrong name silently breaks - │ # cache lookups. - │ dirty/ # Pre-preprocessing ("dirty") WAV files, retained per spec. - │ # Named {clip_id}_dirty.wav — not by hash. - scripts/ # Per-scene script generation cache, named by SHA-256 of all - # generation inputs. Same rules as assets/speech/. + speech/ # Per-utterance WAV cache, named by SHA-256 of the rendered SSML. + scripts/ # Per-scene script generation cache, named by SHA-256 of inputs. data/ - he/ # Language code (ISO 639-1). All current clips are Hebrew. - {speaker_id}/ # Speaker persona ID, e.g. agg_m_30-45_001 + he/ + {speaker_id}/ {clip_id}.wav # 16 kHz, mono, 16-bit PCM WAV - {clip_id}.txt # Per-turn transcript with onset/offset markers - {clip_id}.json # ClipMetadata (weak labels, speaker info, is_synthetic, etc.) - {clip_id}.jsonl # Per-event EventLabel records (strong labels) -``` - -Every `.wav` must have a matching `.txt`, `.json`, and `.jsonl`. A clip without all four files is invalid and will be rejected by `synthbanshee validate`. - ---- + {clip_id}.txt # Per-turn Hebrew transcript with onset/offset markers + {clip_id}.json # ClipMetadata: weak labels, speaker info, provenance, etc. + {clip_id}.jsonl # Per-event EventLabel records: strong labels and timings -## Clip ID and filename conventions +deliveries/ + {slug}/ # Per-delivery notes and structured metadata +``` -- All filenames (and filesystem path components) are **ASCII only**, lowercase, no spaces. -- Format: `{scene_id_lower}_{take_number:02d}` — e.g. `sp_it_a_0001_00`. The same id appears uppercase in YAML `scene_id`. -- The on-disk speaker directory is `speaker_id.lower()` of the scene's first listed speaker. The `speakers[].speaker_id` *value* in the `.json` stays uppercase (`AGG_M_30-45_001`); only the directory name is lowercase (`agg_m_30-45_001/`). -- **Single source of truth for per-surface casing rules:** [SynthBanshee `docs/spec.md` §2.5 — Identifier casing (per surface)](https://github.com/DataHackIL/SynthBanshee/blob/main/docs/spec.md#25-filename-constraints). -- **No Hebrew text in filenames or JSON keys/values** — Hebrew belongs in `.txt` transcript files only. +Every `.wav` must have a matching `.txt`, `.json`, and `.jsonl`. A clip without all four files is +invalid and should be regenerated through SynthBanshee rather than edited by hand. ---- +## Clip and Label Contract -## Label taxonomy +Filenames are ASCII-only, lowercase, and space-free. Clip ids use the format +`{scene_id_lower}_{take_number:02d}`, for example `sp_it_a_0001_00`. -Labels follow a three-level hierarchy defined in `configs/taxonomy.yaml` in the SynthBanshee repo: +Labels follow the AVDP taxonomy: | Level | Field | Examples | -|-------|-------|---------| -| Violence typology (scene-level) | `violence_typology` | `SV`, `IT`, `NEG`, `NEU` | -| Tier 1 category (event-level) | `tier1_category` | `PHYS`, `VERB`, `DIST`, `ACOU`, `EMOT`, `NONE` | -| Tier 2 subtype (event-level) | `tier2_subtype` | `VERB_THREAT`, `DIST_SCREAM`, `PHYS_HARD` | +|---|---|---| +| Scene typology | `violence_typology` | `SV`, `IT`, `NEG`, `NEU` | +| Event category | `tier1_category` | `PHYS`, `VERB`, `DIST`, `ACOU`, `EMOT`, `NONE` | +| Event subtype | `tier2_subtype` | `PHYS_HARD`, `VERB_THREAT`, `DIST_SCREAM` | -`has_violence` is a **derived convenience field** computed from the strong-label events, not from typology or intensity. The rule is pinned in [SynthBanshee `docs/spec.md` §5.1](https://github.com/DataHackIL/SynthBanshee/blob/main/docs/spec.md#51-per-clip-metadata-json) and lives in `synthbanshee/labels/generator.py`: +`has_violence` is a derived convenience field, computed from event labels: ```python has_violence = any(e.tier1_category != "NONE" for e in events) ``` -This means `NEG` (Negative / Confusor) clips are correctly `has_violence: false` even at `max_intensity ≥ 3` — by definition NEG is "acoustically intense but non-violent" so every event lands `tier1_category: "NONE"`. Do **not** re-derive `has_violence` from typology + intensity alone; you will disagree with the data on every NEG row. The taxonomy columns are the ground truth — `has_violence` is for fast filtering and baseline modelling only, never the sole training label. - -Intensity is scored 1–5 per turn: - -| Score | Label | Description | -|-------|-------|-------------| -| 1 | Low tension | Calm conversation, mild undercurrent | -| 2 | Moderate tension | Noticeable friction, raised voices | -| 3 | Active conflict | Clear verbal aggression or intimidation | -| 4 | Escalated violence | Physical or high-intensity verbal violence | -| 5 | Extreme / life-threatening | Severe physical violence, panic, imminent danger | - ---- - -## Audio format - -All clips must conform to: - -- **Sample rate:** 16 kHz -- **Channels:** Mono -- **Bit depth:** 16-bit PCM -- **Peak normalization:** target `−2.0 dBFS` (configurable, range `[−12.0, −1.5]`) via single global gain, then safety limiter at `≤ −1.0 dBFS`. The measured peak lands in `preprocessing_applied.normalized_dbfs`; the configured target lands in `generation_metadata.loudness_target_peak_dbfs`. See [spec §3](https://github.com/DataHackIL/SynthBanshee/blob/main/docs/spec.md#3-audio-format-requirements) and §5.1 field notes. -- **Silence padding:** ≥ 0.5 s ambient baseline before and after target speech +Do not re-derive `has_violence` from typology or intensity alone. Negative/confusor clips can be +acoustically intense while still having `has_violence: false`. ---- +## Product Contexts -## Pipeline versions and data quality +AVDP is an AI-safety initiative run by [DataHack](https://datahack.org.il). The current corpus +supports two downstream research contexts: -Clips carry a `generation_metadata` block in their `.json` file when the generator recorded pipeline provenance (`pipeline_version`, `tts_backend` per speaker, `voice_family` per speaker, `mix_mode_used`, `normalization_strategy`, `loudness_target_peak_dbfs`, `breathiness_applied`, `effective_prosody_caps`). Older clips may have it as `null` — treat absence as "unknown", not as failure. See [spec §5.1](https://github.com/DataHackIL/SynthBanshee/blob/main/docs/spec.md#51-per-clip-metadata-json) field notes. +- **She-Proves**: smartphone-oriented domestic-violence incident detection research. +- **Elephant in the Room** (`הפיל שבחדר`): clinic and welfare-office threat detection research. -**Per-delivery quality posture lives in `deliveries/{slug}/notes.md`.** Each delivery records the SynthBanshee commit, milestone state, prosody / acoustic QA findings, and any known limitations specific to that batch. Consumer teams reading the corpus should always start from the delivery notes for the clips they're working with rather than assuming a single global quality bar. +The clips in this repository are synthetic (`is_synthetic: true` in metadata). They are useful for +data-loading, label handling, QA, and early model-development workflows, but they are not legal +evidence, user recordings, or a substitute for real validation data. ---- +## Contributor and Agent Rules -## How clips get here +If you are a Claude Code agent or another AI assistant, read [`CLAUDE.md`](CLAUDE.md) before making +changes. -SynthBanshee writes directly to this repository when the following environment variables are set (configured in `.envrc` of the SynthBanshee repo): +Key rules: -| Variable | Points to | -|----------|-----------| -| `SYNTHBANSHEE_CACHE_DIR` | `assets/speech/` | -| `SYNTHBANSHEE_SCRIPT_CACHE_DIR` | `assets/scripts/` | -| `SYNTHBANSHEE_DATA_DIR` | `data/he/` | - -Do not write to this repository by hand. All files should be produced by `synthbanshee generate` or `synthbanshee generate-batch`. Manual edits to `.wav` files will invalidate the SHA-256 cache keys and break re-synthesis detection. - ---- +- Never rename, modify, or delete files in `assets/`. +- Never edit `.wav` files by hand. +- Always update [`DELIVERIES.md`](DELIVERIES.md) when adding clips. +- Never drop `has_violence` from metadata or manifests. +- Use SynthBanshee for generation and validation rather than hand-writing corpus files. ## Validation -To verify that a clip is spec-compliant, use the SynthBanshee CLI from the SynthBanshee repo: +Run validation from the SynthBanshee repository: ```bash synthbanshee validate data/he/{speaker_id}/{clip_id}.wav -``` - -To run QA over an entire dataset directory: - -```bash synthbanshee qa-report data/he/ ``` ---- - -## Delivery history +## Related Repositories -All data deliveries are logged in **[DELIVERIES.md](DELIVERIES.md)** — one row per merged PR, -with clip counts, duration, prosody QA results, and known limitations. -Per-delivery notes and structured metadata live under `deliveries/{slug}/`. - ---- - -## Agent and contributor guidelines - -See [`CLAUDE.md`](CLAUDE.md) for the full rules governing this repository — cache integrity, -label policy, delivery log conventions, and what not to do. - ---- +| Repo | Purpose | +|---|---| +| [DataHackIL/SynthBanshee](https://github.com/DataHackIL/SynthBanshee) | Pipeline code, configs, templates, tests, and implementation docs | +| [DataHackIL/avdp-synth-corpus](https://github.com/DataHackIL/avdp-synth-corpus) | Generated synthetic corpus, transcripts, labels, delivery records, and cache assets | -## Related repositories +## Credits -| Repo | Purpose | -|------|---------| -| [DataHackIL/SynthBanshee](https://github.com/DataHackIL/SynthBanshee) | Pipeline code, configs, templates, tests, documentation | -| This repo | Generated data and asset cache only | +Created by [Shay Palachy Affek ](http://www.shaypalachy.com/) [[GitHub](https://github.com/shaypal5)]