AVDP Synthetic Dataset — Full Specification

Project: Audio Violence Dataset Project (AVDP) Initiatives: She-Proves · Elephant in the Room Organization: DataHack / DataForBetter (datahack.org.il) Status: Draft v0.1 — for review by AI team leads (Bar, Livnat, Asya) Date: 2026-04-06 Companion documents: design_approaches.md, implementation_plan.md

1. Overview

This document defines the complete schema, taxonomy, metadata architecture, preprocessing requirements, and split strategy for all synthetic audio data generated by the AVDP framework. It is the authoritative reference for both the generation pipeline and the downstream model training pipelines.

All requirements in §2–§7 are binding on both the generator (the synthetic framework) and the consumer (AI team training pipelines). Requirements in §8 are advisory recommendations for the AI teams.

2. File Structure & Naming Conventions

2.1 Directory Layout

data/
  {language_code}/
    {speaker_id}/
      {clip_id}.wav
      {clip_id}.txt         ← transcript (UTF-8, Hebrew in he-IL clips)
      {clip_id}.json        ← per-clip metadata (see §5)
      {clip_id}.jsonl       ← per-clip strong labels, one event per line (see §5.2)

metadata/
  {split}_manifest.csv      ← clip-level manifest for train/val/test splits (see §5.3)
  {split}_labels_weak.jsonl ← weak (clip-level) labels

configs/
  scenes/                   ← scene YAML configs used to generate each clip
  speakers/                 ← speaker persona definitions
  acoustic_scenes/          ← acoustic environment configs

assets/                     ← source assets (TTS outputs, SFX, IR files)
  speech/
  sfx/
  ambient/
  noise/

The {speaker_id} path component is derived (not literal); see §2.5 for the per-surface casing rules.

2.2 Language Codes

Code	Language	Notes
`he`	Hebrew (he-IL)	Primary language for all project audio
`he_noisy`	Hebrew with significant acoustic degradation	Tier B/C clips only

2.3 Speaker ID Format

{role_code}_{gender}_{age_band}_{instance:03d}

Examples:

AGG_M_30-45_001 — Male aggressor, age 30–45, instance 1
VIC_F_25-40_002 — Female victim, age 25–40, instance 2
BYS_F_6-10_001 — Female bystander/child, age 6–10, instance 1
NEU_M_40-55_003 — Neutral/control speaker, instance 3

Role codes: AGG (aggressor) · VIC (victim) · BYS (bystander) · NEU (neutral/non-violent speaker)

speaker_id is uppercase as a value (configs, JSON, manifest, runtime). The matching directory name is speaker_id.lower() — see the casing table in §2.5.

2.4 Clip ID Format

{project_code}_{violence_type_code}_{tier}_{scene_id:04d}_{segment:02d}

clip_id is derived at write time from the scene's uppercase scene_id YAML field via scene_id.lower() (hyphens → underscores) plus the _NN segment suffix; it does not appear verbatim in any YAML. Examples (on-disk form):

sp_it_b_0023_00 — She-Proves, Intimate Terrorism, Tier B, scene 23, segment 0
el_sv_a_0105_03 — Elephant in the Room, Situational Violence, Tier A, scene 105, segment 3
sp_neg_c_0412_00 — She-Proves, Negative / Confusor, Tier C, scene 412

Project codes: SP (She-Proves) · EL (Elephant in the Room)

2.5 Filename Constraints

ASCII characters only. No spaces, no UTF-8 characters above U+00A1.
Maximum filename length: 128 characters.
All filenames (and filesystem path components) are lowercase.
Every .wav file must have a corresponding .txt, .json, and .jsonl file with the identical stem in the same directory.

Identifier casing (per surface)

The same logical id appears in multiple places with different casing rules. This is the contract:

Surface	Case	Example
YAML `scene_id` / `speaker_id` / `speakers[].speaker_id`	UPPERCASE	`SP_IT_B_0023`, `AGG_M_30-45_001`
On-disk filename stem (`clip_id`)	lowercase	`sp_it_b_0023_00`
On-disk speaker directory (`{speaker_id}/`)	lowercase	`agg_m_30-45_001/`
JSON `clip_id`	lowercase	`sp_it_b_0023_00`
JSON `speakers[].speaker_id`	UPPERCASE	`AGG_M_30-45_001`
TXT `[CLIP_ID: …]`	lowercase	`sp_it_b_0023_00`
TXT `[SPEAKER: …]`	UPPERCASE	`AGG_M_30-45_001`
JSONL `event_id` / `clip_id`	mixed (see §5.2)	`sp_it_b_0023_00_EVT_007`
Manifest `speaker_ids` column (pipe-separated)	UPPERCASE	`AGG_M_30-45_001\|VIC_F_25-40_002`

Consumers reconstructing paths from metadata must apply .lower() to speakers[0].speaker_id (or read wav_path from the manifest, which is already correct).

3. Audio Format Requirements

Parameter	Requirement	Notes
Container	WAV (PCM)	No lossy formats in the dataset
Sample rate	16,000 Hz	Resample before delivery; retain originals at native rate
Bit depth	16-bit PCM
Channels	Mono	Downmix to mono before delivery
Amplitude normalization	Peak-normalize to target (−2.0 dBFS default, range `[−12.0, −1.5]`) via single global gain, then peak-limit at ≤ −1.0 dBFS	Single-gain normalization preserves per-turn RMS contrast (M3a); the 0.5 dB margin between target upper bound and limiter ceiling guarantees the limiter is a no-op in normal flow (#78)
Silence padding	≥ 0.5 s of ambient baseline before and after target speech
SNR at acquisition	≥ 15 dB (Tier A)	Tier B/C may degrade controllably below 15 dB; log actual SNR in metadata
Max clip duration	300 s (5 min)	Longer source scenes must be segmented
Min clip duration (labeled)	3.0 s	Clips shorter than 3 s are excluded from the label set

3.1 Preprocessing Pipeline (ordered)

All clips must pass through this pipeline before delivery. The "dirty" pre-pipeline file must be retained in assets/ for robustness testing.

Implementation: synthbanshee/augment/preprocessing.py:preprocess(). PRs that change either the implementation or this section MUST update the other in the same change.

Resample — convert to 16,000 Hz using a polyphase filter. Skipped when the input is already at 16,000 Hz.
Downmix — stereo → mono via channel averaging.
High-pass filter at 80 Hz (Butterworth order 2, in second-order-sections (SOS) form) to remove DC and sub-bass rumble.
Conditional Wiener denoising — off by default; toggle via a boolean flag on PreprocessingConfig. Used for Tier B/C clips with real added noise after acoustic augmentation.
Loudness normalization (#78) — two stages:
- 5a. Peak-normalize to target. Apply a single global gain so the absolute peak lands at PreprocessingConfig.target_peak_dbfs (default −2.0 dBFS). A single gain preserves per-turn RMS ratios exactly, so the within-scene loudness trajectory established by per-turn RMS gain (M3a) survives — only the absolute level shifts. This step replaces the M3b "limiter only, never scale up" behaviour: pre-#78 the spec had only an upper bound on peak, leaving the absolute level unspecified; two clips could legitimately sit 6 dB apart and both be in-spec.
- 5b. Safety limiter. Attenuate any sample exceeding −1.0 dBFS. For in-spec target values (target_peak_dbfs ∈ [−12.0, −1.5]) this is a guaranteed no-op (0.5 dB margin); it remains as defence-in-depth against upstream over-range samples.
Silence pad — verify ≥ 0.5 s ambient baseline at head and tail; add if absent
Validate — assert: sample rate == 16000, channels == 1, no NaN/Inf samples, no UTF-8 above U+00A1 in metadata strings

4. Annotation Taxonomy

Binary Violence / Non-Violence labels are prohibited. Every labeled event must carry a hierarchical tag from the taxonomy below.

4.1 Violence Typology (Scene-Level)

Applied once per scene/clip, not per event.

Code	Label	Description
`SV`	Situational Violence	Episodic aggression arising from a specific conflict; not part of a chronic control pattern
`IT`	Intimate Terrorism	Chronic coercive control; pattern of domination, intimidation, and systematic suppression
`NEG`	Negative / Confusor	Acoustically intense but non-violent; used for hard-negative (Tier C) clips
`NEU`	Neutral	Everyday interaction with no aggression; control clips

4.2 Violence Category (Event-Level, Tier 1)

Code	Category	Notes
`PHYS`	Physical violence	Any physical contact or weapon use
`VERB`	Verbal aggression	Speech-based aggression, threats, humiliation
`DIST`	Distress signal	Victim or bystander distress vocalizations
`ACOU`	Acoustic event	Non-speech sounds associated with violence
`EMOT`	Emotional / coercive control	Controlling speech patterns; gaslighting; manipulation
`NONE`	No violence	Background, neutral speech, or confusor events

4.3 Event Subtype (Event-Level, Tier 2)

Each Tier 1 category has defined subtypes. Annotators must specify the most specific applicable subtype.

PHYS — Physical violence

Code	Subtype	Description
`PHYS_SOFT`	Soft physical contact	Slap, push, grab
`PHYS_HARD`	Hard physical contact	Kick, punch, strike
`PHYS_WEAP`	Weapon / object use	Object thrown, weapon wielded
`PHYS_MOVE`	Forced movement	Dragging, restraining

VERB — Verbal aggression

Code	Subtype	Description
`VERB_HUMIL`	Taunting / humiliation	Insults, degradation, mocking
`VERB_THREAT`	Explicit threat	Direct threats of harm
`VERB_SHOUT`	Shouting / rage	Elevated volume without specific threat content
`VERB_COER`	Coercive demand	Commanding, prohibiting with implicit power

DIST — Distress signal

Code	Subtype	Description
`DIST_SCREAM`	Panic scream	High-intensity fear vocalization
`DIST_PLEAD`	Pleading / submission	"Please stop", capitulation language
`DIST_CRY`	Crying / sobbing	Audible distress crying
`DIST_BREATH`	Stressed breathing	Hyperventilation, audible fear breathing
`DIST_CHILD`	Child distress	Child crying or screaming in scene

ACOU — Acoustic event

Code	Subtype	Description
`ACOU_BREAK`	Breaking / shattering	Glass, crockery, object breaking
`ACOU_SLAM`	Slam / impact	Door slam, fist on surface
`ACOU_THROW`	Object thrown	Distinct throw-and-land sound
`ACOU_FOOT`	Rapid footsteps	Running, stamping
`ACOU_FALL`	Body/object fall	Person or large object falling

EMOT — Emotional / coercive control

Code	Subtype	Description
`EMOT_GASLIT`	Gaslighting	Reality denial, "that never happened"
`EMOT_ISOL`	Isolation / control	Prohibiting contact, monitoring behavior
`EMOT_ECON`	Economic control	Financial threats, deprivation language
`EMOT_LEGAL`	Legal threat	Threatening custody, police, deportation

NONE — Non-violent / confusor

Code	Subtype	Description
`NONE_ARGU`	De-escalating argument	Heated but non-violent exchange that ends calmly
`NONE_SPORT`	Sports/entertainment yelling	TV, sports, excited vocal outburst
`NONE_CHILD`	Child play noise	Loud children, non-distress
`NONE_CRY_SAFE`	Non-violence crying	Crying from grief, frustration, joy
`NONE_LAUGH`	Laughter / excitement	Acoustically similar to distress
`NONE_CLINIC`	Animated clinic interaction	Social worker + client, agitated but not violent
`NONE_AMBIENT`	Background ambience	TV, traffic, appliances, conversation

4.4 Severity / Intensity Scale

Applied per event or per scene segment. Scale aligns with the scripting instructions.

Level	Label	Description
1	Low tension	Calm conversation, mild undercurrent of tension
2	Moderate tension	Noticeable friction, raised voices without aggression
3	Active conflict	Clear verbal aggression or intimidation
4	Escalated violence	Physical or high-intensity verbal violence
5	Extreme / life-threatening	Severe physical violence, panic, imminent danger

4.5 Speaker Role Tags

Code	Role	Notes
`AGG`	Aggressor	The party initiating or sustaining the violence
`VIC`	Victim	The party against whom violence is directed
`BYS`	Bystander	Third party present (child, neighbor, colleague)
`UNK`	Unknown	Speaker cannot be attributed with confidence

4.6 Emotional State Tags

Applied per speaker turn, not per clip.

anger · fear · panic · distress · neutral · contempt · submission · grief · confusion · defiance

5. Metadata Architecture

5.1 Per-Clip Metadata (JSON)

Every clip has a companion {clip_id}.json file with this schema:

{
  "clip_id": "sp_it_b_0023_00",
  "project": "she_proves",
  "language": "he",
  "violence_typology": "IT",
  "tier": "B",
  "duration_seconds": 247.3,
  "sample_rate": 16000,
  "channels": 1,
  "snr_db_estimated": 19.4,
  "scene_config": "configs/scenes/she_proves_tier_b/sp_it_b_0023.yaml",
  "random_seed": 42,
  "generation_date": "2026-04-10",
  "generator_version": "0.1.0",
  "is_synthetic": true,
  "acoustic_scene": {
    "room_type": "apartment_kitchen",
    "device": "phone_in_pocket",
    "ir_source": "pyroomacoustics",
    "background_events": [
      {"type": "tv_ambient", "onset": 0.0, "level_db": -30}
    ]
  },
  "speakers": [
    {
      "speaker_id": "AGG_M_30-45_001",
      "role": "AGG",
      "gender": "male",
      "age_range": "30-45",
      "tts_voice_id": "he-IL-AvriNeural",
      "voice_family": "Avri"
    },
    {
      "speaker_id": "VIC_F_25-40_002",
      "role": "VIC",
      "gender": "female",
      "age_range": "25-40",
      "tts_voice_id": "he-IL-HilaNeural"
    }
  ],
  "weak_label": {
    "has_violence": true,
    "violence_categories": ["VERB", "PHYS", "DIST"],
    "max_intensity": 5,
    "violence_typology": "IT"
  },
  "preprocessing_applied": {
    "resampled_to_16k": true,
    "downmixed_to_mono": true,
    "spectral_filtered": true,
    "denoised": true,
    "normalized_dbfs": -2.013,
    "silence_padded": true
  },
  "generation_metadata": {
    "pipeline_version": "0.1.0",
    "tts_backend": {"AGG_M_30-45_001": "azure", "VIC_F_25-40_002": "azure"},
    "voice_family": {"AGG_M_30-45_001": "Avri", "VIC_F_25-40_002": "he-IL-HilaNeural"},
    "mix_mode_used": "sequential",
    "normalization_strategy": "per_turn_rms_v2_target_peak",
    "loudness_target_peak_dbfs": -2.0,
    "breathiness_applied": false,
    "effective_prosody_caps": []
  },
  "dirty_file_path": "assets/speech/dirty/sp_it_b_0023_00_dirty.wav",
  "transcript_path": "data/he/agg_m_30-45_001/sp_it_b_0023_00.txt",
  "quality_flags": [],
  "annotator_confidence": 1.0,
  "iaa_reviewed": false
}

Field notes

preprocessing_applied.normalized_dbfs is the measured post-preprocess peak (pair with generation_metadata.loudness_target_peak_dbfs to compute drift from target — see labels/schema.py for the docstring that pins this split).
tts_engine was removed in #109. The TTS provider is now recorded per-speaker in generation_metadata.tts_backend (e.g. {"AGG_M_30-45_001": "azure", "VIC_F_25-40_002": "google"}); read backend diversity from the structured map. Pre-#109 corpus snapshots still carry the field — consumers should tolerate but ignore it.
generation_metadata is optional: a JSON object when the generator recorded pipeline provenance, null otherwise. Treat absence as "unknown", not as failure. generator_version alone is not a reliable presence signal.
speakers[].voice_family is optional: a stable family handle (e.g. "Avri") when the speaker YAML overrides it, omitted otherwise. Consumers should fall back to tts_voice_id.
weak_label.has_violence is derived, not asserted: any(e.tier1_category != "NONE" for e in events) — see synthbanshee/labels/generator.py. Corollaries: empty events → False; NEG typology clips are False (every event lands tier1_category: "NONE" by §4.1); violence_typology and has_violence may disagree (e.g. SV with False if no violent tier1 fired). The events are the ground truth; the flag is convenience. External docs and downstream code must mirror this rule — re-deriving from typology or intensity alone produces disagreement on every NEG row.

quality_flags valid values: low_snr · clipping · short_silence_pad · label_uncertainty · iaa_disagreement · synthetic_artifact

5.2 Per-Event Strong Labels (JSONL)

One record per labeled event, stored per-clip as {clip_id}.jsonl in the same directory as the corresponding .wav, .txt, and .json files. The pipeline writes this file automatically during Stage 4b (Strong Label Writer). Each line is a JSON object:

{
  "event_id": "sp_it_b_0023_00_EVT_007",
  "clip_id": "sp_it_b_0023_00",
  "onset": 143.82,
  "offset": 146.05,
  "tier1_category": "PHYS",
  "tier2_subtype": "PHYS_HARD",
  "intensity": 5,
  "speaker_id": "AGG_M_30-45_001",
  "speaker_role": "AGG",
  "emotional_state": "anger",
  "confidence": 0.95,
  "label_source": "auto",
  "iaa_reviewed": false,
  "notes": "punch impact followed by object fall"
}

event_id is generated as {clip_id}_EVT_{idx:03d} — the clip_id prefix is lowercase per §2.5 and EVT is a literal uppercase token (it is not a casing inconsistency). speaker_id and speaker_role are values, not filenames, so they remain uppercase per §2.5's casing table.

label_source values: auto (derived from scene config/script) · human (manual annotation) · auto_reviewed (auto + human validation pass)

5.3 Weak Label Manifest (CSV)

One manifest CSV per generation run, written to the output directory (e.g. data/he/manifest.csv). One row per clip, for fast dataset loading:

clip_id, project, violence_typology, tier, duration_seconds, speaker_ids, has_violence, max_intensity, quality_flags, split, wav_path, strong_labels_path

speaker_ids: pipe-separated list of speaker_id values (e.g. AGG_M_30-45_001|VIC_F_25-40_002)
quality_flags: comma-separated list of flag strings, empty string if none
split: train | val | test, or empty string if unassigned
wav_path: path to the .wav file; .txt, .json, and .jsonl share the same stem
strong_labels_path: path to the per-clip .jsonl strong-labels file, or empty string if absent
language is implicit in the data/{language_code}/ directory structure and omitted from the manifest
violence_categories and the redundant txt_path/json_path columns from the original spec are superseded by this schema

5.4 Analysis Windows

For model training, clips are analyzed using:

Window length: 3.0 seconds
Hop length: 1.0 second (66% overlap)
Minimum event duration for strong labeling: 0.1 seconds (below this, event is noted in clip metadata but not given a strong label)
Ambiguity culling: Remove data from the ends of manually-labeled events to the 10 dB downpoints, ensuring only high-energy samples are included in training

6. Inter-Annotator Agreement (IAA) Protocol

This section applies to any human annotation pass (Tier A auto-labels reviewed by human; all Tier B/C labels).

6.1 Coverage Requirement

A minimum of 20% of all labeled segments must undergo independent second-pass review by a different annotator.

6.2 Agreement Targets (Cohen's Kappa)

Event category	Target κ	Minimum acceptable κ
Physical events (`PHYS_*`)	κ ≥ 0.65	κ ≥ 0.55
Verbal aggression (`VERB_*`)	κ ≥ 0.60	κ ≥ 0.50
Distress signals (`DIST_*`)	κ ≥ 0.60	κ ≥ 0.50
Acoustic events (`ACOU_*`)	κ ≥ 0.70	κ ≥ 0.60
Emotional state	κ ≥ 0.55	κ ≥ 0.45
Intensity level (±1 tolerance)	κ ≥ 0.60	κ ≥ 0.50

6.3 Conflict Resolution

Tier 2 label disagreements that cannot be resolved between annotators are escalated to the field expert reviewer (a qualified social worker or domestic violence specialist from the Rakman Institute team).
Clips with unresolved disagreement receive quality_flags: ["iaa_disagreement"] and are excluded from training splits pending resolution. They may be included in a held-out "uncertain" evaluation set.

6.4 Confidence Scoring

Annotators report a per-event confidence float (0.0–1.0). Events with confidence < 0.6 are flagged with quality_flags: ["label_uncertainty"] and excluded from the primary training split. They are retained in a separate uncertainty partition.

7. Project-Specific Label Variants

7.1 She-Proves

Use case: Long-form passive monitoring on a smartphone; detecting and segmenting rare incident windows within hours of background audio.

Primary modeling tasks:

Incident window detection (binary: violence window / non-violence window over 30–120 s weak label clips)
Event segmentation within flagged windows (strong labels)
Distress/aggression cue detection (classification)
Escalation arc modeling (intensity sequence over time)

Additional metadata fields (She-Proves only):

"she_proves_meta": {
  "scene_phase": "escalation",
  "phase_options": ["baseline", "tension", "escalation", "peak", "aftermath", "de-escalation"],
  "incident_window": true,
  "incident_onset_in_clip": 98.4,
  "incident_offset_in_clip": 187.2,
  "recording_device_profile": "phone_in_pocket",
  "ambient_duration_before_incident": 98.4
}

Scene duration targets:

Tier A/B scenes: 3–6 minutes (reflecting real passive recording windows)
At least 60% of scene duration should be "baseline" or "tension" (pre-incident) to reflect realistic base-rate sparsity
Target ratio: ≥ 3 non-violence control clips per violence clip

Key acoustic conditions to cover:

Phone in pocket / bag / on table (each ≥ 20% of scenes)
Near-field (< 1 m) and far-field (3–5 m) speaker positions
Home environments: bedroom, kitchen, living room, hallway

7.2 Elephant in the Room

Use case: Fixed Raspberry Pi–class device in a clinic or welfare office; detecting imminent assault on a social worker in near-real-time.

Primary modeling tasks:

Imminent attack / aggression detection (binary alert)
Escalation forecasting over short time horizons (5–30 s)
Acoustic event fusion: shouting + impact + distress = alert composite
"Alert now" vs "monitor" policy output

Additional metadata fields (Elephant in the Room only):

"elephant_meta": {
  "scene_type": "office_encounter",
  "scene_type_options": ["intake_interview", "benefits_dispute", "crisis_visit", "routine_followup"],
  "alert_triggered": true,
  "alert_onset": 87.3,
  "pre_alert_duration": 87.3,
  "post_alert_duration": 45.1,
  "attack_type": "PHYS_HARD",
  "recording_device_profile": "pi_budget_mic",
  "room_type": "clinic_office"
}

Scene duration targets:

Tier A/B scenes: 1–4 minutes (reflecting typical encounter length)
Alert event should occur within the final 40% of scene duration
Target ratio: ≥ 2 non-alert confusor clips per alert clip (animated encounters without attack)

Key acoustic conditions to cover:

Fixed-position budget microphone (Raspberry Pi HAT mic or USB equivalent)
Office environments: small clinic room, open welfare office, corridor
Background: HVAC hum, distant phone ringing, door open/close
Multiple confusor types: agitated but non-violent beneficiary, crying client, raised voices in adjacent room

8. Dataset Split Strategy

8.1 Primary Splits

Split	Size	Purpose
`train`	70%	Model training
`val`	15%	Hyperparameter tuning, early stopping
`test_synth`	15%	Evaluation on synthetic data (measures in-distribution performance)

A separate held-out partition is maintained for actor recordings and real data (future phases) and is never mixed with synthetic splits.

8.2 Stratification Variables

Splits must be stratified on all of the following simultaneously:

Project (she_proves / elephant_in_the_room)
Violence typology (SV / IT / NEG / NEU)
Tier (A / B / C)
Max intensity band (1–2 / 3 / 4–5)
Room type (broad categories)

8.3 Speaker-Disjoint Splits

No speaker persona (TTS voice ID) may appear in more than one split. This is critical: if the same voice appears in train and test, the model may overfit to voice identity rather than acoustic event content. Assign each speaker persona to a split before generating scenes.

8.4 Scene-Disjoint Splits

No scene config (or its close variants) may appear in more than one split. Script templates may be shared across splits only if specific content is different (different slots filled, different intensity arcs).

8.5 Target Class Balance per Project

She-Proves:

Violence window clips: ≥ 30% of total, ≤ 50%
Negative/confusor (Tier C) clips: ≥ 20% of total
Neutral control clips: ≥ 15% of total

Elephant in the Room:

Alert clips (attack present): ≥ 25% of total, ≤ 45%
Animated non-alert clips: ≥ 25% of total
Neutral/routine clips: ≥ 15% of total

9. Negative Samples & Acoustic Confusors (Tier C)

The confusor set is critical for both projects. A model that cannot distinguish these from true violence will have unacceptable false alarm rates in deployment.

Required Confusor Types

Confusor type	Code	Relevant for
Heated argument that de-escalates before violence	`NONE_ARGU`	Both
Sports/TV yelling	`NONE_SPORT`	She-Proves
Loud children (play, tantrums)	`NONE_CHILD`	She-Proves
Crying from non-violence (grief, frustration)	`NONE_CRY_SAFE`	Both
Laughter that acoustically resembles screaming	`NONE_LAUGH`	Both
Animated clinic interaction (agitation without attack)	`NONE_CLINIC`	Elephant in the Room
Social worker client in distress (no aggression)	`NONE_CLINIC`	Elephant in the Room
Hebrew radio / TV drama	`NONE_AMBIENT`	She-Proves
Cooking sounds (chopping, pan, breaking crockery accidentally)	`ACOU_BREAK` + `NONE`	She-Proves

Minimum Confusor Coverage

Tier C clips must constitute ≥ 20% of each project's total dataset
Each confusor type in the table above must have ≥ 50 clips in the training split

10. Transcript Format

Each .txt transcript file follows this format (UTF-8):

[CLIP_ID: sp_it_b_0023_00]
[SPEAKER: AGG_M_30-45_001 | ROLE: AGG | ONSET: 0.0 | OFFSET: 4.2]
אמרתי לך לא ללכת לשם!
[ACTION: VERB_SHOUT | INTENSITY: 4]

[SPEAKER: VIC_F_25-40_002 | ROLE: VIC | ONSET: 4.5 | OFFSET: 7.1]
בבקשה, אל תתחיל שוב...
[ACTION: DIST_PLEAD | INTENSITY: 4]

[ACTION: ACOU_BREAK | ONSET: 7.8 | OFFSET: 8.1 | INTENSITY: 5]
[SPEAKER: VIC_F_25-40_002 | ROLE: VIC | ONSET: 8.2 | OFFSET: 9.0]
די! תפסיק!
[ACTION: DIST_SCREAM | INTENSITY: 5]

Transcript redundancy constraint: no word or character sequence may repeat more than 3 consecutive times (prevents TTS/ASR service failures).

11. Liveness & Synthetic Origin Tracking

All synthetic clips must carry a machine-readable marker of their synthetic origin to prevent inadvertent contamination of real-data evaluation sets.

is_synthetic: true in JSON metadata (mandatory)
generator_version field must match the version tag of the generation code
The manifest CSV includes an is_synthetic column

When the framework later transitions to include actor recordings, is_synthetic is set to false and actor_session_id is added to the metadata schema.

Document prepared for DataHack AVDP — not for distribution outside the project team.

FilesExpand file tree

spec.md

Latest commit

History