Project: Audio Violence Dataset Project (AVDP)
Initiatives: She-Proves · Elephant in the Room
Organization: DataHack / DataForBetter (datahack.org.il)
Status: Draft v0.1 — for review by AI team leads (Bar, Livnat, Asya)
Date: 2026-04-06
Companion documents: design_approaches.md, implementation_plan.md
This document defines the complete schema, taxonomy, metadata architecture, preprocessing requirements, and split strategy for all synthetic audio data generated by the AVDP framework. It is the authoritative reference for both the generation pipeline and the downstream model training pipelines.
All requirements in §2–§7 are binding on both the generator (the synthetic framework) and the consumer (AI team training pipelines). Requirements in §8 are advisory recommendations for the AI teams.
data/
{language_code}/
{speaker_id}/
{clip_id}.wav
{clip_id}.txt ← transcript (UTF-8, Hebrew in he-IL clips)
{clip_id}.json ← per-clip metadata (see §5)
{clip_id}.jsonl ← per-clip strong labels, one event per line (see §5.2)
metadata/
{split}_manifest.csv ← clip-level manifest for train/val/test splits (see §5.3)
{split}_labels_weak.jsonl ← weak (clip-level) labels
configs/
scenes/ ← scene YAML configs used to generate each clip
speakers/ ← speaker persona definitions
acoustic_scenes/ ← acoustic environment configs
assets/ ← source assets (TTS outputs, SFX, IR files)
speech/
sfx/
ambient/
noise/
The {speaker_id} path component is derived (not literal); see §2.5 for the per-surface casing rules.
| Code | Language | Notes |
|---|---|---|
he |
Hebrew (he-IL) | Primary language for all project audio |
he_noisy |
Hebrew with significant acoustic degradation | Tier B/C clips only |
{role_code}_{gender}_{age_band}_{instance:03d}
Examples:
AGG_M_30-45_001— Male aggressor, age 30–45, instance 1VIC_F_25-40_002— Female victim, age 25–40, instance 2BYS_F_6-10_001— Female bystander/child, age 6–10, instance 1NEU_M_40-55_003— Neutral/control speaker, instance 3
Role codes: AGG (aggressor) · VIC (victim) · BYS (bystander) · NEU (neutral/non-violent speaker)
speaker_id is uppercase as a value (configs, JSON, manifest, runtime). The matching directory name is speaker_id.lower() — see the casing table in §2.5.
{project_code}_{violence_type_code}_{tier}_{scene_id:04d}_{segment:02d}
clip_id is derived at write time from the scene's uppercase scene_id YAML field via scene_id.lower() (hyphens → underscores) plus the _NN segment suffix; it does not appear verbatim in any YAML. Examples (on-disk form):
sp_it_b_0023_00— She-Proves, Intimate Terrorism, Tier B, scene 23, segment 0el_sv_a_0105_03— Elephant in the Room, Situational Violence, Tier A, scene 105, segment 3sp_neg_c_0412_00— She-Proves, Negative / Confusor, Tier C, scene 412
Project codes: SP (She-Proves) · EL (Elephant in the Room)
- ASCII characters only. No spaces, no UTF-8 characters above U+00A1.
- Maximum filename length: 128 characters.
- All filenames (and filesystem path components) are lowercase.
- Every
.wavfile must have a corresponding.txt,.json, and.jsonlfile with the identical stem in the same directory.
The same logical id appears in multiple places with different casing rules. This is the contract:
| Surface | Case | Example |
|---|---|---|
YAML scene_id / speaker_id / speakers[].speaker_id |
UPPERCASE | SP_IT_B_0023, AGG_M_30-45_001 |
On-disk filename stem (clip_id) |
lowercase | sp_it_b_0023_00 |
On-disk speaker directory ({speaker_id}/) |
lowercase | agg_m_30-45_001/ |
JSON clip_id |
lowercase | sp_it_b_0023_00 |
JSON speakers[].speaker_id |
UPPERCASE | AGG_M_30-45_001 |
TXT [CLIP_ID: …] |
lowercase | sp_it_b_0023_00 |
TXT [SPEAKER: …] |
UPPERCASE | AGG_M_30-45_001 |
JSONL event_id / clip_id |
mixed (see §5.2) | sp_it_b_0023_00_EVT_007 |
Manifest speaker_ids column (pipe-separated) |
UPPERCASE | AGG_M_30-45_001|VIC_F_25-40_002 |
Consumers reconstructing paths from metadata must apply .lower() to speakers[0].speaker_id (or read wav_path from the manifest, which is already correct).
| Parameter | Requirement | Notes |
|---|---|---|
| Container | WAV (PCM) | No lossy formats in the dataset |
| Sample rate | 16,000 Hz | Resample before delivery; retain originals at native rate |
| Bit depth | 16-bit PCM | |
| Channels | Mono | Downmix to mono before delivery |
| Amplitude normalization | Peak-normalize to target (−2.0 dBFS default, range [−12.0, −1.5]) via single global gain, then peak-limit at ≤ −1.0 dBFS |
Single-gain normalization preserves per-turn RMS contrast (M3a); the 0.5 dB margin between target upper bound and limiter ceiling guarantees the limiter is a no-op in normal flow (#78) |
| Silence padding | ≥ 0.5 s of ambient baseline before and after target speech | |
| SNR at acquisition | ≥ 15 dB (Tier A) | Tier B/C may degrade controllably below 15 dB; log actual SNR in metadata |
| Max clip duration | 300 s (5 min) | Longer source scenes must be segmented |
| Min clip duration (labeled) | 3.0 s | Clips shorter than 3 s are excluded from the label set |
All clips must pass through this pipeline before delivery. The "dirty" pre-pipeline file must be retained in assets/ for robustness testing.
Implementation: synthbanshee/augment/preprocessing.py:preprocess(). PRs that change either the implementation or this section MUST update the other in the same change.
- Resample — convert to 16,000 Hz using a polyphase filter. Skipped when the input is already at 16,000 Hz.
- Downmix — stereo → mono via channel averaging.
- High-pass filter at 80 Hz (Butterworth order 2, in second-order-sections (SOS) form) to remove DC and sub-bass rumble.
- Conditional Wiener denoising — off by default; toggle via a boolean flag on
PreprocessingConfig. Used for Tier B/C clips with real added noise after acoustic augmentation. - Loudness normalization (#78) — two stages:
- 5a. Peak-normalize to target. Apply a single global gain so the absolute peak lands at
PreprocessingConfig.target_peak_dbfs(default −2.0 dBFS). A single gain preserves per-turn RMS ratios exactly, so the within-scene loudness trajectory established by per-turn RMS gain (M3a) survives — only the absolute level shifts. This step replaces the M3b "limiter only, never scale up" behaviour: pre-#78 the spec had only an upper bound on peak, leaving the absolute level unspecified; two clips could legitimately sit 6 dB apart and both be in-spec. - 5b. Safety limiter. Attenuate any sample exceeding −1.0 dBFS. For in-spec target values (
target_peak_dbfs ∈ [−12.0, −1.5]) this is a guaranteed no-op (0.5 dB margin); it remains as defence-in-depth against upstream over-range samples.
- 5a. Peak-normalize to target. Apply a single global gain so the absolute peak lands at
- Silence pad — verify ≥ 0.5 s ambient baseline at head and tail; add if absent
- Validate — assert: sample rate == 16000, channels == 1, no NaN/Inf samples, no UTF-8 above U+00A1 in metadata strings
Binary Violence / Non-Violence labels are prohibited. Every labeled event must carry a hierarchical tag from the taxonomy below.
Applied once per scene/clip, not per event.
| Code | Label | Description |
|---|---|---|
SV |
Situational Violence | Episodic aggression arising from a specific conflict; not part of a chronic control pattern |
IT |
Intimate Terrorism | Chronic coercive control; pattern of domination, intimidation, and systematic suppression |
NEG |
Negative / Confusor | Acoustically intense but non-violent; used for hard-negative (Tier C) clips |
NEU |
Neutral | Everyday interaction with no aggression; control clips |
| Code | Category | Notes |
|---|---|---|
PHYS |
Physical violence | Any physical contact or weapon use |
VERB |
Verbal aggression | Speech-based aggression, threats, humiliation |
DIST |
Distress signal | Victim or bystander distress vocalizations |
ACOU |
Acoustic event | Non-speech sounds associated with violence |
EMOT |
Emotional / coercive control | Controlling speech patterns; gaslighting; manipulation |
NONE |
No violence | Background, neutral speech, or confusor events |
Each Tier 1 category has defined subtypes. Annotators must specify the most specific applicable subtype.
PHYS — Physical violence
| Code | Subtype | Description |
|---|---|---|
PHYS_SOFT |
Soft physical contact | Slap, push, grab |
PHYS_HARD |
Hard physical contact | Kick, punch, strike |
PHYS_WEAP |
Weapon / object use | Object thrown, weapon wielded |
PHYS_MOVE |
Forced movement | Dragging, restraining |
VERB — Verbal aggression
| Code | Subtype | Description |
|---|---|---|
VERB_HUMIL |
Taunting / humiliation | Insults, degradation, mocking |
VERB_THREAT |
Explicit threat | Direct threats of harm |
VERB_SHOUT |
Shouting / rage | Elevated volume without specific threat content |
VERB_COER |
Coercive demand | Commanding, prohibiting with implicit power |
DIST — Distress signal
| Code | Subtype | Description |
|---|---|---|
DIST_SCREAM |
Panic scream | High-intensity fear vocalization |
DIST_PLEAD |
Pleading / submission | "Please stop", capitulation language |
DIST_CRY |
Crying / sobbing | Audible distress crying |
DIST_BREATH |
Stressed breathing | Hyperventilation, audible fear breathing |
DIST_CHILD |
Child distress | Child crying or screaming in scene |
ACOU — Acoustic event
| Code | Subtype | Description |
|---|---|---|
ACOU_BREAK |
Breaking / shattering | Glass, crockery, object breaking |
ACOU_SLAM |
Slam / impact | Door slam, fist on surface |
ACOU_THROW |
Object thrown | Distinct throw-and-land sound |
ACOU_FOOT |
Rapid footsteps | Running, stamping |
ACOU_FALL |
Body/object fall | Person or large object falling |
EMOT — Emotional / coercive control
| Code | Subtype | Description |
|---|---|---|
EMOT_GASLIT |
Gaslighting | Reality denial, "that never happened" |
EMOT_ISOL |
Isolation / control | Prohibiting contact, monitoring behavior |
EMOT_ECON |
Economic control | Financial threats, deprivation language |
EMOT_LEGAL |
Legal threat | Threatening custody, police, deportation |
NONE — Non-violent / confusor
| Code | Subtype | Description |
|---|---|---|
NONE_ARGU |
De-escalating argument | Heated but non-violent exchange that ends calmly |
NONE_SPORT |
Sports/entertainment yelling | TV, sports, excited vocal outburst |
NONE_CHILD |
Child play noise | Loud children, non-distress |
NONE_CRY_SAFE |
Non-violence crying | Crying from grief, frustration, joy |
NONE_LAUGH |
Laughter / excitement | Acoustically similar to distress |
NONE_CLINIC |
Animated clinic interaction | Social worker + client, agitated but not violent |
NONE_AMBIENT |
Background ambience | TV, traffic, appliances, conversation |
Applied per event or per scene segment. Scale aligns with the scripting instructions.
| Level | Label | Description |
|---|---|---|
| 1 | Low tension | Calm conversation, mild undercurrent of tension |
| 2 | Moderate tension | Noticeable friction, raised voices without aggression |
| 3 | Active conflict | Clear verbal aggression or intimidation |
| 4 | Escalated violence | Physical or high-intensity verbal violence |
| 5 | Extreme / life-threatening | Severe physical violence, panic, imminent danger |
| Code | Role | Notes |
|---|---|---|
AGG |
Aggressor | The party initiating or sustaining the violence |
VIC |
Victim | The party against whom violence is directed |
BYS |
Bystander | Third party present (child, neighbor, colleague) |
UNK |
Unknown | Speaker cannot be attributed with confidence |
Applied per speaker turn, not per clip.
anger · fear · panic · distress · neutral · contempt · submission · grief · confusion · defiance
Every clip has a companion {clip_id}.json file with this schema:
{
"clip_id": "sp_it_b_0023_00",
"project": "she_proves",
"language": "he",
"violence_typology": "IT",
"tier": "B",
"duration_seconds": 247.3,
"sample_rate": 16000,
"channels": 1,
"snr_db_estimated": 19.4,
"scene_config": "configs/scenes/she_proves_tier_b/sp_it_b_0023.yaml",
"random_seed": 42,
"generation_date": "2026-04-10",
"generator_version": "0.1.0",
"is_synthetic": true,
"acoustic_scene": {
"room_type": "apartment_kitchen",
"device": "phone_in_pocket",
"ir_source": "pyroomacoustics",
"background_events": [
{"type": "tv_ambient", "onset": 0.0, "level_db": -30}
]
},
"speakers": [
{
"speaker_id": "AGG_M_30-45_001",
"role": "AGG",
"gender": "male",
"age_range": "30-45",
"tts_voice_id": "he-IL-AvriNeural",
"voice_family": "Avri"
},
{
"speaker_id": "VIC_F_25-40_002",
"role": "VIC",
"gender": "female",
"age_range": "25-40",
"tts_voice_id": "he-IL-HilaNeural"
}
],
"weak_label": {
"has_violence": true,
"violence_categories": ["VERB", "PHYS", "DIST"],
"max_intensity": 5,
"violence_typology": "IT"
},
"preprocessing_applied": {
"resampled_to_16k": true,
"downmixed_to_mono": true,
"spectral_filtered": true,
"denoised": true,
"normalized_dbfs": -2.013,
"silence_padded": true
},
"generation_metadata": {
"pipeline_version": "0.1.0",
"tts_backend": {"AGG_M_30-45_001": "azure", "VIC_F_25-40_002": "azure"},
"voice_family": {"AGG_M_30-45_001": "Avri", "VIC_F_25-40_002": "he-IL-HilaNeural"},
"mix_mode_used": "sequential",
"normalization_strategy": "per_turn_rms_v2_target_peak",
"loudness_target_peak_dbfs": -2.0,
"breathiness_applied": false,
"effective_prosody_caps": []
},
"dirty_file_path": "assets/speech/dirty/sp_it_b_0023_00_dirty.wav",
"transcript_path": "data/he/agg_m_30-45_001/sp_it_b_0023_00.txt",
"quality_flags": [],
"annotator_confidence": 1.0,
"iaa_reviewed": false
}Field notes
preprocessing_applied.normalized_dbfsis the measured post-preprocess peak (pair withgeneration_metadata.loudness_target_peak_dbfsto compute drift from target — seelabels/schema.pyfor the docstring that pins this split).tts_enginewas removed in #109. The TTS provider is now recorded per-speaker ingeneration_metadata.tts_backend(e.g.{"AGG_M_30-45_001": "azure", "VIC_F_25-40_002": "google"}); read backend diversity from the structured map. Pre-#109 corpus snapshots still carry the field — consumers should tolerate but ignore it.generation_metadatais optional: a JSON object when the generator recorded pipeline provenance,nullotherwise. Treat absence as "unknown", not as failure.generator_versionalone is not a reliable presence signal.speakers[].voice_familyis optional: a stable family handle (e.g."Avri") when the speaker YAML overrides it, omitted otherwise. Consumers should fall back totts_voice_id.weak_label.has_violenceis derived, not asserted:any(e.tier1_category != "NONE" for e in events)— seesynthbanshee/labels/generator.py. Corollaries: emptyevents→False;NEGtypology clips areFalse(every event landstier1_category: "NONE"by §4.1);violence_typologyandhas_violencemay disagree (e.g.SVwithFalseif no violent tier1 fired). The events are the ground truth; the flag is convenience. External docs and downstream code must mirror this rule — re-deriving from typology or intensity alone produces disagreement on every NEG row.
quality_flags valid values: low_snr · clipping · short_silence_pad · label_uncertainty · iaa_disagreement · synthetic_artifact
One record per labeled event, stored per-clip as {clip_id}.jsonl in the same directory as the corresponding .wav, .txt, and .json files. The pipeline writes this file automatically during Stage 4b (Strong Label Writer). Each line is a JSON object:
{
"event_id": "sp_it_b_0023_00_EVT_007",
"clip_id": "sp_it_b_0023_00",
"onset": 143.82,
"offset": 146.05,
"tier1_category": "PHYS",
"tier2_subtype": "PHYS_HARD",
"intensity": 5,
"speaker_id": "AGG_M_30-45_001",
"speaker_role": "AGG",
"emotional_state": "anger",
"confidence": 0.95,
"label_source": "auto",
"iaa_reviewed": false,
"notes": "punch impact followed by object fall"
}event_id is generated as {clip_id}_EVT_{idx:03d} — the clip_id prefix is lowercase per §2.5 and EVT is a literal uppercase token (it is not a casing inconsistency). speaker_id and speaker_role are values, not filenames, so they remain uppercase per §2.5's casing table.
label_source values: auto (derived from scene config/script) · human (manual annotation) · auto_reviewed (auto + human validation pass)
One manifest CSV per generation run, written to the output directory (e.g. data/he/manifest.csv).
One row per clip, for fast dataset loading:
clip_id, project, violence_typology, tier, duration_seconds, speaker_ids, has_violence, max_intensity, quality_flags, split, wav_path, strong_labels_path
speaker_ids: pipe-separated list ofspeaker_idvalues (e.g.AGG_M_30-45_001|VIC_F_25-40_002)quality_flags: comma-separated list of flag strings, empty string if nonesplit:train|val|test, or empty string if unassignedwav_path: path to the.wavfile;.txt,.json, and.jsonlshare the same stemstrong_labels_path: path to the per-clip.jsonlstrong-labels file, or empty string if absentlanguageis implicit in thedata/{language_code}/directory structure and omitted from the manifestviolence_categoriesand the redundanttxt_path/json_pathcolumns from the original spec are superseded by this schema
For model training, clips are analyzed using:
- Window length: 3.0 seconds
- Hop length: 1.0 second (66% overlap)
- Minimum event duration for strong labeling: 0.1 seconds (below this, event is noted in clip metadata but not given a strong label)
- Ambiguity culling: Remove data from the ends of manually-labeled events to the 10 dB downpoints, ensuring only high-energy samples are included in training
This section applies to any human annotation pass (Tier A auto-labels reviewed by human; all Tier B/C labels).
A minimum of 20% of all labeled segments must undergo independent second-pass review by a different annotator.
| Event category | Target κ | Minimum acceptable κ |
|---|---|---|
Physical events (PHYS_*) |
κ ≥ 0.65 | κ ≥ 0.55 |
Verbal aggression (VERB_*) |
κ ≥ 0.60 | κ ≥ 0.50 |
Distress signals (DIST_*) |
κ ≥ 0.60 | κ ≥ 0.50 |
Acoustic events (ACOU_*) |
κ ≥ 0.70 | κ ≥ 0.60 |
| Emotional state | κ ≥ 0.55 | κ ≥ 0.45 |
| Intensity level (±1 tolerance) | κ ≥ 0.60 | κ ≥ 0.50 |
- Tier 2 label disagreements that cannot be resolved between annotators are escalated to the field expert reviewer (a qualified social worker or domestic violence specialist from the Rakman Institute team).
- Clips with unresolved disagreement receive
quality_flags: ["iaa_disagreement"]and are excluded from training splits pending resolution. They may be included in a held-out "uncertain" evaluation set.
Annotators report a per-event confidence float (0.0–1.0). Events with confidence < 0.6 are flagged with quality_flags: ["label_uncertainty"] and excluded from the primary training split. They are retained in a separate uncertainty partition.
Use case: Long-form passive monitoring on a smartphone; detecting and segmenting rare incident windows within hours of background audio.
Primary modeling tasks:
- Incident window detection (binary: violence window / non-violence window over 30–120 s weak label clips)
- Event segmentation within flagged windows (strong labels)
- Distress/aggression cue detection (classification)
- Escalation arc modeling (intensity sequence over time)
Additional metadata fields (She-Proves only):
"she_proves_meta": {
"scene_phase": "escalation",
"phase_options": ["baseline", "tension", "escalation", "peak", "aftermath", "de-escalation"],
"incident_window": true,
"incident_onset_in_clip": 98.4,
"incident_offset_in_clip": 187.2,
"recording_device_profile": "phone_in_pocket",
"ambient_duration_before_incident": 98.4
}Scene duration targets:
- Tier A/B scenes: 3–6 minutes (reflecting real passive recording windows)
- At least 60% of scene duration should be "baseline" or "tension" (pre-incident) to reflect realistic base-rate sparsity
- Target ratio: ≥ 3 non-violence control clips per violence clip
Key acoustic conditions to cover:
- Phone in pocket / bag / on table (each ≥ 20% of scenes)
- Near-field (< 1 m) and far-field (3–5 m) speaker positions
- Home environments: bedroom, kitchen, living room, hallway
Use case: Fixed Raspberry Pi–class device in a clinic or welfare office; detecting imminent assault on a social worker in near-real-time.
Primary modeling tasks:
- Imminent attack / aggression detection (binary alert)
- Escalation forecasting over short time horizons (5–30 s)
- Acoustic event fusion: shouting + impact + distress = alert composite
- "Alert now" vs "monitor" policy output
Additional metadata fields (Elephant in the Room only):
"elephant_meta": {
"scene_type": "office_encounter",
"scene_type_options": ["intake_interview", "benefits_dispute", "crisis_visit", "routine_followup"],
"alert_triggered": true,
"alert_onset": 87.3,
"pre_alert_duration": 87.3,
"post_alert_duration": 45.1,
"attack_type": "PHYS_HARD",
"recording_device_profile": "pi_budget_mic",
"room_type": "clinic_office"
}Scene duration targets:
- Tier A/B scenes: 1–4 minutes (reflecting typical encounter length)
- Alert event should occur within the final 40% of scene duration
- Target ratio: ≥ 2 non-alert confusor clips per alert clip (animated encounters without attack)
Key acoustic conditions to cover:
- Fixed-position budget microphone (Raspberry Pi HAT mic or USB equivalent)
- Office environments: small clinic room, open welfare office, corridor
- Background: HVAC hum, distant phone ringing, door open/close
- Multiple confusor types: agitated but non-violent beneficiary, crying client, raised voices in adjacent room
| Split | Size | Purpose |
|---|---|---|
train |
70% | Model training |
val |
15% | Hyperparameter tuning, early stopping |
test_synth |
15% | Evaluation on synthetic data (measures in-distribution performance) |
A separate held-out partition is maintained for actor recordings and real data (future phases) and is never mixed with synthetic splits.
Splits must be stratified on all of the following simultaneously:
- Project (
she_proves/elephant_in_the_room) - Violence typology (
SV/IT/NEG/NEU) - Tier (
A/B/C) - Max intensity band (1–2 / 3 / 4–5)
- Room type (broad categories)
No speaker persona (TTS voice ID) may appear in more than one split. This is critical: if the same voice appears in train and test, the model may overfit to voice identity rather than acoustic event content. Assign each speaker persona to a split before generating scenes.
No scene config (or its close variants) may appear in more than one split. Script templates may be shared across splits only if specific content is different (different slots filled, different intensity arcs).
She-Proves:
- Violence window clips: ≥ 30% of total, ≤ 50%
- Negative/confusor (Tier C) clips: ≥ 20% of total
- Neutral control clips: ≥ 15% of total
Elephant in the Room:
- Alert clips (attack present): ≥ 25% of total, ≤ 45%
- Animated non-alert clips: ≥ 25% of total
- Neutral/routine clips: ≥ 15% of total
The confusor set is critical for both projects. A model that cannot distinguish these from true violence will have unacceptable false alarm rates in deployment.
| Confusor type | Code | Relevant for |
|---|---|---|
| Heated argument that de-escalates before violence | NONE_ARGU |
Both |
| Sports/TV yelling | NONE_SPORT |
She-Proves |
| Loud children (play, tantrums) | NONE_CHILD |
She-Proves |
| Crying from non-violence (grief, frustration) | NONE_CRY_SAFE |
Both |
| Laughter that acoustically resembles screaming | NONE_LAUGH |
Both |
| Animated clinic interaction (agitation without attack) | NONE_CLINIC |
Elephant in the Room |
| Social worker client in distress (no aggression) | NONE_CLINIC |
Elephant in the Room |
| Hebrew radio / TV drama | NONE_AMBIENT |
She-Proves |
| Cooking sounds (chopping, pan, breaking crockery accidentally) | ACOU_BREAK + NONE |
She-Proves |
- Tier C clips must constitute ≥ 20% of each project's total dataset
- Each confusor type in the table above must have ≥ 50 clips in the training split
Each .txt transcript file follows this format (UTF-8):
[CLIP_ID: sp_it_b_0023_00]
[SPEAKER: AGG_M_30-45_001 | ROLE: AGG | ONSET: 0.0 | OFFSET: 4.2]
אמרתי לך לא ללכת לשם!
[ACTION: VERB_SHOUT | INTENSITY: 4]
[SPEAKER: VIC_F_25-40_002 | ROLE: VIC | ONSET: 4.5 | OFFSET: 7.1]
בבקשה, אל תתחיל שוב...
[ACTION: DIST_PLEAD | INTENSITY: 4]
[ACTION: ACOU_BREAK | ONSET: 7.8 | OFFSET: 8.1 | INTENSITY: 5]
[SPEAKER: VIC_F_25-40_002 | ROLE: VIC | ONSET: 8.2 | OFFSET: 9.0]
די! תפסיק!
[ACTION: DIST_SCREAM | INTENSITY: 5]
Transcript redundancy constraint: no word or character sequence may repeat more than 3 consecutive times (prevents TTS/ASR service failures).
All synthetic clips must carry a machine-readable marker of their synthetic origin to prevent inadvertent contamination of real-data evaluation sets.
is_synthetic: truein JSON metadata (mandatory)generator_versionfield must match the version tag of the generation code- The manifest CSV includes an
is_syntheticcolumn
When the framework later transitions to include actor recordings, is_synthetic is set to false and actor_session_id is added to the metadata schema.
Document prepared for DataHack AVDP — not for distribution outside the project team.