Skip to content

Latest commit

 

History

History
617 lines (468 loc) · 27.9 KB

File metadata and controls

617 lines (468 loc) · 27.9 KB

AVDP Synthetic Dataset — Full Specification

Project: Audio Violence Dataset Project (AVDP) Initiatives: She-Proves · Elephant in the Room Organization: DataHack / DataForBetter (datahack.org.il) Status: Draft v0.1 — for review by AI team leads (Bar, Livnat, Asya) Date: 2026-04-06 Companion documents: design_approaches.md, implementation_plan.md


1. Overview

This document defines the complete schema, taxonomy, metadata architecture, preprocessing requirements, and split strategy for all synthetic audio data generated by the AVDP framework. It is the authoritative reference for both the generation pipeline and the downstream model training pipelines.

All requirements in §2–§7 are binding on both the generator (the synthetic framework) and the consumer (AI team training pipelines). Requirements in §8 are advisory recommendations for the AI teams.


2. File Structure & Naming Conventions

2.1 Directory Layout

data/
  {language_code}/
    {speaker_id}/
      {clip_id}.wav
      {clip_id}.txt         ← transcript (UTF-8, Hebrew in he-IL clips)
      {clip_id}.json        ← per-clip metadata (see §5)
      {clip_id}.jsonl       ← per-clip strong labels, one event per line (see §5.2)

metadata/
  {split}_manifest.csv      ← clip-level manifest for train/val/test splits (see §5.3)
  {split}_labels_weak.jsonl ← weak (clip-level) labels

configs/
  scenes/                   ← scene YAML configs used to generate each clip
  speakers/                 ← speaker persona definitions
  acoustic_scenes/          ← acoustic environment configs

assets/                     ← source assets (TTS outputs, SFX, IR files)
  speech/
  sfx/
  ambient/
  noise/

The {speaker_id} path component is derived (not literal); see §2.5 for the per-surface casing rules.

2.2 Language Codes

Code Language Notes
he Hebrew (he-IL) Primary language for all project audio
he_noisy Hebrew with significant acoustic degradation Tier B/C clips only

2.3 Speaker ID Format

{role_code}_{gender}_{age_band}_{instance:03d}

Examples:

  • AGG_M_30-45_001 — Male aggressor, age 30–45, instance 1
  • VIC_F_25-40_002 — Female victim, age 25–40, instance 2
  • BYS_F_6-10_001 — Female bystander/child, age 6–10, instance 1
  • NEU_M_40-55_003 — Neutral/control speaker, instance 3

Role codes: AGG (aggressor) · VIC (victim) · BYS (bystander) · NEU (neutral/non-violent speaker)

speaker_id is uppercase as a value (configs, JSON, manifest, runtime). The matching directory name is speaker_id.lower() — see the casing table in §2.5.

2.4 Clip ID Format

{project_code}_{violence_type_code}_{tier}_{scene_id:04d}_{segment:02d}

clip_id is derived at write time from the scene's uppercase scene_id YAML field via scene_id.lower() (hyphens → underscores) plus the _NN segment suffix; it does not appear verbatim in any YAML. Examples (on-disk form):

  • sp_it_b_0023_00 — She-Proves, Intimate Terrorism, Tier B, scene 23, segment 0
  • el_sv_a_0105_03 — Elephant in the Room, Situational Violence, Tier A, scene 105, segment 3
  • sp_neg_c_0412_00 — She-Proves, Negative / Confusor, Tier C, scene 412

Project codes: SP (She-Proves) · EL (Elephant in the Room)

2.5 Filename Constraints

  • ASCII characters only. No spaces, no UTF-8 characters above U+00A1.
  • Maximum filename length: 128 characters.
  • All filenames (and filesystem path components) are lowercase.
  • Every .wav file must have a corresponding .txt, .json, and .jsonl file with the identical stem in the same directory.

Identifier casing (per surface)

The same logical id appears in multiple places with different casing rules. This is the contract:

Surface Case Example
YAML scene_id / speaker_id / speakers[].speaker_id UPPERCASE SP_IT_B_0023, AGG_M_30-45_001
On-disk filename stem (clip_id) lowercase sp_it_b_0023_00
On-disk speaker directory ({speaker_id}/) lowercase agg_m_30-45_001/
JSON clip_id lowercase sp_it_b_0023_00
JSON speakers[].speaker_id UPPERCASE AGG_M_30-45_001
TXT [CLIP_ID: …] lowercase sp_it_b_0023_00
TXT [SPEAKER: …] UPPERCASE AGG_M_30-45_001
JSONL event_id / clip_id mixed (see §5.2) sp_it_b_0023_00_EVT_007
Manifest speaker_ids column (pipe-separated) UPPERCASE AGG_M_30-45_001|VIC_F_25-40_002

Consumers reconstructing paths from metadata must apply .lower() to speakers[0].speaker_id (or read wav_path from the manifest, which is already correct).


3. Audio Format Requirements

Parameter Requirement Notes
Container WAV (PCM) No lossy formats in the dataset
Sample rate 16,000 Hz Resample before delivery; retain originals at native rate
Bit depth 16-bit PCM
Channels Mono Downmix to mono before delivery
Amplitude normalization Peak-normalize to target (−2.0 dBFS default, range [−12.0, −1.5]) via single global gain, then peak-limit at ≤ −1.0 dBFS Single-gain normalization preserves per-turn RMS contrast (M3a); the 0.5 dB margin between target upper bound and limiter ceiling guarantees the limiter is a no-op in normal flow (#78)
Silence padding ≥ 0.5 s of ambient baseline before and after target speech
SNR at acquisition ≥ 15 dB (Tier A) Tier B/C may degrade controllably below 15 dB; log actual SNR in metadata
Max clip duration 300 s (5 min) Longer source scenes must be segmented
Min clip duration (labeled) 3.0 s Clips shorter than 3 s are excluded from the label set

3.1 Preprocessing Pipeline (ordered)

All clips must pass through this pipeline before delivery. The "dirty" pre-pipeline file must be retained in assets/ for robustness testing.

Implementation: synthbanshee/augment/preprocessing.py:preprocess(). PRs that change either the implementation or this section MUST update the other in the same change.

  1. Resample — convert to 16,000 Hz using a polyphase filter. Skipped when the input is already at 16,000 Hz.
  2. Downmix — stereo → mono via channel averaging.
  3. High-pass filter at 80 Hz (Butterworth order 2, in second-order-sections (SOS) form) to remove DC and sub-bass rumble.
  4. Conditional Wiener denoising — off by default; toggle via a boolean flag on PreprocessingConfig. Used for Tier B/C clips with real added noise after acoustic augmentation.
  5. Loudness normalization (#78) — two stages:
    • 5a. Peak-normalize to target. Apply a single global gain so the absolute peak lands at PreprocessingConfig.target_peak_dbfs (default −2.0 dBFS). A single gain preserves per-turn RMS ratios exactly, so the within-scene loudness trajectory established by per-turn RMS gain (M3a) survives — only the absolute level shifts. This step replaces the M3b "limiter only, never scale up" behaviour: pre-#78 the spec had only an upper bound on peak, leaving the absolute level unspecified; two clips could legitimately sit 6 dB apart and both be in-spec.
    • 5b. Safety limiter. Attenuate any sample exceeding −1.0 dBFS. For in-spec target values (target_peak_dbfs ∈ [−12.0, −1.5]) this is a guaranteed no-op (0.5 dB margin); it remains as defence-in-depth against upstream over-range samples.
  6. Silence pad — verify ≥ 0.5 s ambient baseline at head and tail; add if absent
  7. Validate — assert: sample rate == 16000, channels == 1, no NaN/Inf samples, no UTF-8 above U+00A1 in metadata strings

4. Annotation Taxonomy

Binary Violence / Non-Violence labels are prohibited. Every labeled event must carry a hierarchical tag from the taxonomy below.

4.1 Violence Typology (Scene-Level)

Applied once per scene/clip, not per event.

Code Label Description
SV Situational Violence Episodic aggression arising from a specific conflict; not part of a chronic control pattern
IT Intimate Terrorism Chronic coercive control; pattern of domination, intimidation, and systematic suppression
NEG Negative / Confusor Acoustically intense but non-violent; used for hard-negative (Tier C) clips
NEU Neutral Everyday interaction with no aggression; control clips

4.2 Violence Category (Event-Level, Tier 1)

Code Category Notes
PHYS Physical violence Any physical contact or weapon use
VERB Verbal aggression Speech-based aggression, threats, humiliation
DIST Distress signal Victim or bystander distress vocalizations
ACOU Acoustic event Non-speech sounds associated with violence
EMOT Emotional / coercive control Controlling speech patterns; gaslighting; manipulation
NONE No violence Background, neutral speech, or confusor events

4.3 Event Subtype (Event-Level, Tier 2)

Each Tier 1 category has defined subtypes. Annotators must specify the most specific applicable subtype.

PHYS — Physical violence

Code Subtype Description
PHYS_SOFT Soft physical contact Slap, push, grab
PHYS_HARD Hard physical contact Kick, punch, strike
PHYS_WEAP Weapon / object use Object thrown, weapon wielded
PHYS_MOVE Forced movement Dragging, restraining

VERB — Verbal aggression

Code Subtype Description
VERB_HUMIL Taunting / humiliation Insults, degradation, mocking
VERB_THREAT Explicit threat Direct threats of harm
VERB_SHOUT Shouting / rage Elevated volume without specific threat content
VERB_COER Coercive demand Commanding, prohibiting with implicit power

DIST — Distress signal

Code Subtype Description
DIST_SCREAM Panic scream High-intensity fear vocalization
DIST_PLEAD Pleading / submission "Please stop", capitulation language
DIST_CRY Crying / sobbing Audible distress crying
DIST_BREATH Stressed breathing Hyperventilation, audible fear breathing
DIST_CHILD Child distress Child crying or screaming in scene

ACOU — Acoustic event

Code Subtype Description
ACOU_BREAK Breaking / shattering Glass, crockery, object breaking
ACOU_SLAM Slam / impact Door slam, fist on surface
ACOU_THROW Object thrown Distinct throw-and-land sound
ACOU_FOOT Rapid footsteps Running, stamping
ACOU_FALL Body/object fall Person or large object falling

EMOT — Emotional / coercive control

Code Subtype Description
EMOT_GASLIT Gaslighting Reality denial, "that never happened"
EMOT_ISOL Isolation / control Prohibiting contact, monitoring behavior
EMOT_ECON Economic control Financial threats, deprivation language
EMOT_LEGAL Legal threat Threatening custody, police, deportation

NONE — Non-violent / confusor

Code Subtype Description
NONE_ARGU De-escalating argument Heated but non-violent exchange that ends calmly
NONE_SPORT Sports/entertainment yelling TV, sports, excited vocal outburst
NONE_CHILD Child play noise Loud children, non-distress
NONE_CRY_SAFE Non-violence crying Crying from grief, frustration, joy
NONE_LAUGH Laughter / excitement Acoustically similar to distress
NONE_CLINIC Animated clinic interaction Social worker + client, agitated but not violent
NONE_AMBIENT Background ambience TV, traffic, appliances, conversation

4.4 Severity / Intensity Scale

Applied per event or per scene segment. Scale aligns with the scripting instructions.

Level Label Description
1 Low tension Calm conversation, mild undercurrent of tension
2 Moderate tension Noticeable friction, raised voices without aggression
3 Active conflict Clear verbal aggression or intimidation
4 Escalated violence Physical or high-intensity verbal violence
5 Extreme / life-threatening Severe physical violence, panic, imminent danger

4.5 Speaker Role Tags

Code Role Notes
AGG Aggressor The party initiating or sustaining the violence
VIC Victim The party against whom violence is directed
BYS Bystander Third party present (child, neighbor, colleague)
UNK Unknown Speaker cannot be attributed with confidence

4.6 Emotional State Tags

Applied per speaker turn, not per clip.

anger · fear · panic · distress · neutral · contempt · submission · grief · confusion · defiance


5. Metadata Architecture

5.1 Per-Clip Metadata (JSON)

Every clip has a companion {clip_id}.json file with this schema:

{
  "clip_id": "sp_it_b_0023_00",
  "project": "she_proves",
  "language": "he",
  "violence_typology": "IT",
  "tier": "B",
  "duration_seconds": 247.3,
  "sample_rate": 16000,
  "channels": 1,
  "snr_db_estimated": 19.4,
  "scene_config": "configs/scenes/she_proves_tier_b/sp_it_b_0023.yaml",
  "random_seed": 42,
  "generation_date": "2026-04-10",
  "generator_version": "0.1.0",
  "is_synthetic": true,
  "acoustic_scene": {
    "room_type": "apartment_kitchen",
    "device": "phone_in_pocket",
    "ir_source": "pyroomacoustics",
    "background_events": [
      {"type": "tv_ambient", "onset": 0.0, "level_db": -30}
    ]
  },
  "speakers": [
    {
      "speaker_id": "AGG_M_30-45_001",
      "role": "AGG",
      "gender": "male",
      "age_range": "30-45",
      "tts_voice_id": "he-IL-AvriNeural",
      "voice_family": "Avri"
    },
    {
      "speaker_id": "VIC_F_25-40_002",
      "role": "VIC",
      "gender": "female",
      "age_range": "25-40",
      "tts_voice_id": "he-IL-HilaNeural"
    }
  ],
  "weak_label": {
    "has_violence": true,
    "violence_categories": ["VERB", "PHYS", "DIST"],
    "max_intensity": 5,
    "violence_typology": "IT"
  },
  "preprocessing_applied": {
    "resampled_to_16k": true,
    "downmixed_to_mono": true,
    "spectral_filtered": true,
    "denoised": true,
    "normalized_dbfs": -2.013,
    "silence_padded": true
  },
  "generation_metadata": {
    "pipeline_version": "0.1.0",
    "tts_backend": {"AGG_M_30-45_001": "azure", "VIC_F_25-40_002": "azure"},
    "voice_family": {"AGG_M_30-45_001": "Avri", "VIC_F_25-40_002": "he-IL-HilaNeural"},
    "mix_mode_used": "sequential",
    "normalization_strategy": "per_turn_rms_v2_target_peak",
    "loudness_target_peak_dbfs": -2.0,
    "breathiness_applied": false,
    "effective_prosody_caps": []
  },
  "dirty_file_path": "assets/speech/dirty/sp_it_b_0023_00_dirty.wav",
  "transcript_path": "data/he/agg_m_30-45_001/sp_it_b_0023_00.txt",
  "quality_flags": [],
  "annotator_confidence": 1.0,
  "iaa_reviewed": false
}

Field notes

  • preprocessing_applied.normalized_dbfs is the measured post-preprocess peak (pair with generation_metadata.loudness_target_peak_dbfs to compute drift from target — see labels/schema.py for the docstring that pins this split).
  • tts_engine was removed in #109. The TTS provider is now recorded per-speaker in generation_metadata.tts_backend (e.g. {"AGG_M_30-45_001": "azure", "VIC_F_25-40_002": "google"}); read backend diversity from the structured map. Pre-#109 corpus snapshots still carry the field — consumers should tolerate but ignore it.
  • generation_metadata is optional: a JSON object when the generator recorded pipeline provenance, null otherwise. Treat absence as "unknown", not as failure. generator_version alone is not a reliable presence signal.
  • speakers[].voice_family is optional: a stable family handle (e.g. "Avri") when the speaker YAML overrides it, omitted otherwise. Consumers should fall back to tts_voice_id.
  • weak_label.has_violence is derived, not asserted: any(e.tier1_category != "NONE" for e in events) — see synthbanshee/labels/generator.py. Corollaries: empty eventsFalse; NEG typology clips are False (every event lands tier1_category: "NONE" by §4.1); violence_typology and has_violence may disagree (e.g. SV with False if no violent tier1 fired). The events are the ground truth; the flag is convenience. External docs and downstream code must mirror this rule — re-deriving from typology or intensity alone produces disagreement on every NEG row.

quality_flags valid values: low_snr · clipping · short_silence_pad · label_uncertainty · iaa_disagreement · synthetic_artifact

5.2 Per-Event Strong Labels (JSONL)

One record per labeled event, stored per-clip as {clip_id}.jsonl in the same directory as the corresponding .wav, .txt, and .json files. The pipeline writes this file automatically during Stage 4b (Strong Label Writer). Each line is a JSON object:

{
  "event_id": "sp_it_b_0023_00_EVT_007",
  "clip_id": "sp_it_b_0023_00",
  "onset": 143.82,
  "offset": 146.05,
  "tier1_category": "PHYS",
  "tier2_subtype": "PHYS_HARD",
  "intensity": 5,
  "speaker_id": "AGG_M_30-45_001",
  "speaker_role": "AGG",
  "emotional_state": "anger",
  "confidence": 0.95,
  "label_source": "auto",
  "iaa_reviewed": false,
  "notes": "punch impact followed by object fall"
}

event_id is generated as {clip_id}_EVT_{idx:03d} — the clip_id prefix is lowercase per §2.5 and EVT is a literal uppercase token (it is not a casing inconsistency). speaker_id and speaker_role are values, not filenames, so they remain uppercase per §2.5's casing table.

label_source values: auto (derived from scene config/script) · human (manual annotation) · auto_reviewed (auto + human validation pass)

5.3 Weak Label Manifest (CSV)

One manifest CSV per generation run, written to the output directory (e.g. data/he/manifest.csv). One row per clip, for fast dataset loading:

clip_id, project, violence_typology, tier, duration_seconds, speaker_ids, has_violence, max_intensity, quality_flags, split, wav_path, strong_labels_path
  • speaker_ids: pipe-separated list of speaker_id values (e.g. AGG_M_30-45_001|VIC_F_25-40_002)
  • quality_flags: comma-separated list of flag strings, empty string if none
  • split: train | val | test, or empty string if unassigned
  • wav_path: path to the .wav file; .txt, .json, and .jsonl share the same stem
  • strong_labels_path: path to the per-clip .jsonl strong-labels file, or empty string if absent
  • language is implicit in the data/{language_code}/ directory structure and omitted from the manifest
  • violence_categories and the redundant txt_path/json_path columns from the original spec are superseded by this schema

5.4 Analysis Windows

For model training, clips are analyzed using:

  • Window length: 3.0 seconds
  • Hop length: 1.0 second (66% overlap)
  • Minimum event duration for strong labeling: 0.1 seconds (below this, event is noted in clip metadata but not given a strong label)
  • Ambiguity culling: Remove data from the ends of manually-labeled events to the 10 dB downpoints, ensuring only high-energy samples are included in training

6. Inter-Annotator Agreement (IAA) Protocol

This section applies to any human annotation pass (Tier A auto-labels reviewed by human; all Tier B/C labels).

6.1 Coverage Requirement

A minimum of 20% of all labeled segments must undergo independent second-pass review by a different annotator.

6.2 Agreement Targets (Cohen's Kappa)

Event category Target κ Minimum acceptable κ
Physical events (PHYS_*) κ ≥ 0.65 κ ≥ 0.55
Verbal aggression (VERB_*) κ ≥ 0.60 κ ≥ 0.50
Distress signals (DIST_*) κ ≥ 0.60 κ ≥ 0.50
Acoustic events (ACOU_*) κ ≥ 0.70 κ ≥ 0.60
Emotional state κ ≥ 0.55 κ ≥ 0.45
Intensity level (±1 tolerance) κ ≥ 0.60 κ ≥ 0.50

6.3 Conflict Resolution

  • Tier 2 label disagreements that cannot be resolved between annotators are escalated to the field expert reviewer (a qualified social worker or domestic violence specialist from the Rakman Institute team).
  • Clips with unresolved disagreement receive quality_flags: ["iaa_disagreement"] and are excluded from training splits pending resolution. They may be included in a held-out "uncertain" evaluation set.

6.4 Confidence Scoring

Annotators report a per-event confidence float (0.0–1.0). Events with confidence < 0.6 are flagged with quality_flags: ["label_uncertainty"] and excluded from the primary training split. They are retained in a separate uncertainty partition.


7. Project-Specific Label Variants

7.1 She-Proves

Use case: Long-form passive monitoring on a smartphone; detecting and segmenting rare incident windows within hours of background audio.

Primary modeling tasks:

  • Incident window detection (binary: violence window / non-violence window over 30–120 s weak label clips)
  • Event segmentation within flagged windows (strong labels)
  • Distress/aggression cue detection (classification)
  • Escalation arc modeling (intensity sequence over time)

Additional metadata fields (She-Proves only):

"she_proves_meta": {
  "scene_phase": "escalation",
  "phase_options": ["baseline", "tension", "escalation", "peak", "aftermath", "de-escalation"],
  "incident_window": true,
  "incident_onset_in_clip": 98.4,
  "incident_offset_in_clip": 187.2,
  "recording_device_profile": "phone_in_pocket",
  "ambient_duration_before_incident": 98.4
}

Scene duration targets:

  • Tier A/B scenes: 3–6 minutes (reflecting real passive recording windows)
  • At least 60% of scene duration should be "baseline" or "tension" (pre-incident) to reflect realistic base-rate sparsity
  • Target ratio: ≥ 3 non-violence control clips per violence clip

Key acoustic conditions to cover:

  • Phone in pocket / bag / on table (each ≥ 20% of scenes)
  • Near-field (< 1 m) and far-field (3–5 m) speaker positions
  • Home environments: bedroom, kitchen, living room, hallway

7.2 Elephant in the Room

Use case: Fixed Raspberry Pi–class device in a clinic or welfare office; detecting imminent assault on a social worker in near-real-time.

Primary modeling tasks:

  • Imminent attack / aggression detection (binary alert)
  • Escalation forecasting over short time horizons (5–30 s)
  • Acoustic event fusion: shouting + impact + distress = alert composite
  • "Alert now" vs "monitor" policy output

Additional metadata fields (Elephant in the Room only):

"elephant_meta": {
  "scene_type": "office_encounter",
  "scene_type_options": ["intake_interview", "benefits_dispute", "crisis_visit", "routine_followup"],
  "alert_triggered": true,
  "alert_onset": 87.3,
  "pre_alert_duration": 87.3,
  "post_alert_duration": 45.1,
  "attack_type": "PHYS_HARD",
  "recording_device_profile": "pi_budget_mic",
  "room_type": "clinic_office"
}

Scene duration targets:

  • Tier A/B scenes: 1–4 minutes (reflecting typical encounter length)
  • Alert event should occur within the final 40% of scene duration
  • Target ratio: ≥ 2 non-alert confusor clips per alert clip (animated encounters without attack)

Key acoustic conditions to cover:

  • Fixed-position budget microphone (Raspberry Pi HAT mic or USB equivalent)
  • Office environments: small clinic room, open welfare office, corridor
  • Background: HVAC hum, distant phone ringing, door open/close
  • Multiple confusor types: agitated but non-violent beneficiary, crying client, raised voices in adjacent room

8. Dataset Split Strategy

8.1 Primary Splits

Split Size Purpose
train 70% Model training
val 15% Hyperparameter tuning, early stopping
test_synth 15% Evaluation on synthetic data (measures in-distribution performance)

A separate held-out partition is maintained for actor recordings and real data (future phases) and is never mixed with synthetic splits.

8.2 Stratification Variables

Splits must be stratified on all of the following simultaneously:

  • Project (she_proves / elephant_in_the_room)
  • Violence typology (SV / IT / NEG / NEU)
  • Tier (A / B / C)
  • Max intensity band (1–2 / 3 / 4–5)
  • Room type (broad categories)

8.3 Speaker-Disjoint Splits

No speaker persona (TTS voice ID) may appear in more than one split. This is critical: if the same voice appears in train and test, the model may overfit to voice identity rather than acoustic event content. Assign each speaker persona to a split before generating scenes.

8.4 Scene-Disjoint Splits

No scene config (or its close variants) may appear in more than one split. Script templates may be shared across splits only if specific content is different (different slots filled, different intensity arcs).

8.5 Target Class Balance per Project

She-Proves:

  • Violence window clips: ≥ 30% of total, ≤ 50%
  • Negative/confusor (Tier C) clips: ≥ 20% of total
  • Neutral control clips: ≥ 15% of total

Elephant in the Room:

  • Alert clips (attack present): ≥ 25% of total, ≤ 45%
  • Animated non-alert clips: ≥ 25% of total
  • Neutral/routine clips: ≥ 15% of total

9. Negative Samples & Acoustic Confusors (Tier C)

The confusor set is critical for both projects. A model that cannot distinguish these from true violence will have unacceptable false alarm rates in deployment.

Required Confusor Types

Confusor type Code Relevant for
Heated argument that de-escalates before violence NONE_ARGU Both
Sports/TV yelling NONE_SPORT She-Proves
Loud children (play, tantrums) NONE_CHILD She-Proves
Crying from non-violence (grief, frustration) NONE_CRY_SAFE Both
Laughter that acoustically resembles screaming NONE_LAUGH Both
Animated clinic interaction (agitation without attack) NONE_CLINIC Elephant in the Room
Social worker client in distress (no aggression) NONE_CLINIC Elephant in the Room
Hebrew radio / TV drama NONE_AMBIENT She-Proves
Cooking sounds (chopping, pan, breaking crockery accidentally) ACOU_BREAK + NONE She-Proves

Minimum Confusor Coverage

  • Tier C clips must constitute ≥ 20% of each project's total dataset
  • Each confusor type in the table above must have ≥ 50 clips in the training split

10. Transcript Format

Each .txt transcript file follows this format (UTF-8):

[CLIP_ID: sp_it_b_0023_00]
[SPEAKER: AGG_M_30-45_001 | ROLE: AGG | ONSET: 0.0 | OFFSET: 4.2]
אמרתי לך לא ללכת לשם!
[ACTION: VERB_SHOUT | INTENSITY: 4]

[SPEAKER: VIC_F_25-40_002 | ROLE: VIC | ONSET: 4.5 | OFFSET: 7.1]
בבקשה, אל תתחיל שוב...
[ACTION: DIST_PLEAD | INTENSITY: 4]

[ACTION: ACOU_BREAK | ONSET: 7.8 | OFFSET: 8.1 | INTENSITY: 5]
[SPEAKER: VIC_F_25-40_002 | ROLE: VIC | ONSET: 8.2 | OFFSET: 9.0]
די! תפסיק!
[ACTION: DIST_SCREAM | INTENSITY: 5]

Transcript redundancy constraint: no word or character sequence may repeat more than 3 consecutive times (prevents TTS/ASR service failures).


11. Liveness & Synthetic Origin Tracking

All synthetic clips must carry a machine-readable marker of their synthetic origin to prevent inadvertent contamination of real-data evaluation sets.

  • is_synthetic: true in JSON metadata (mandatory)
  • generator_version field must match the version tag of the generation code
  • The manifest CSV includes an is_synthetic column

When the framework later transitions to include actor recordings, is_synthetic is set to false and actor_session_id is added to the metadata schema.


Document prepared for DataHack AVDP — not for distribution outside the project team.