Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
name: CI

on:
pull_request:
types: [opened, synchronize, reopened]

permissions:
contents: read
actions: read
pull-requests: write

jobs:
pr-agent-context:
uses: shaypal5/pr-agent-context/.github/workflows/pr-agent-context.yml@v4
with:
include_review_comments: true
include_outdated_review_threads: true
31 changes: 31 additions & 0 deletions .github/workflows/pr-agent-context-refresh.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
name: PR Agent Context Refresh

on:
pull_request:
types: [synchronize]
pull_request_review:
types: [submitted, dismissed]
pull_request_review_comment:
types: [created, edited]
check_run:
types: [completed]

concurrency:
group: pr-agent-context-${{ github.event.pull_request.number || github.event.check_run.pull_requests[0].number || github.run_id }}
cancel-in-progress: true

permissions:
contents: read
actions: read
pull-requests: write

jobs:
pr-agent-context-refresh:
uses: shaypal5/pr-agent-context/.github/workflows/pr-agent-context.yml@v4
with:
execution_mode: refresh
publish_mode: append
include_review_comments: true
include_outdated_review_threads: true
wait_for_reviews_to_settle: true
hide_previous_managed_comments_on_append: true
134 changes: 134 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# avdp-synth-corpus — Claude Code Context

## What this repo is

**Data-only repository.** No application code lives here. All pipeline logic, configuration,
templates, and tests are in [DataHackIL/SynthBanshee](https://github.com/DataHackIL/SynthBanshee).

Contents:
- `data/he/` — spec-compliant generated clips (`.wav` + `.txt` + `.json` + `.jsonl` per clip)
- `assets/speech/` — per-utterance TTS WAV cache (SHA-256 keyed)
- `assets/scripts/` — per-scene LLM script cache (SHA-256 keyed)
- `deliveries/` — per-delivery metadata and notes
- `DELIVERIES.md` — master delivery log

---

## Hard rules — read before touching any file

### Cache integrity (assets/)

`assets/speech/` and `assets/scripts/` files are named by **SHA-256 of their generation inputs**.
SynthBanshee uses the filename as the cache key.

- **Never rename, modify, or reformat** a file in `assets/`. Renaming breaks cache lookups silently.
- **Never delete** a cache file. Deletion forces a paid Azure TTS re-synthesis call.
- **Never add** a file with an incorrect name. A wrong hash name silently poisons the cache.
- The only safe operations are: add a correctly-named file (produced by SynthBanshee), or leave it alone.

### Clip file integrity (data/)

- Every `.wav` must have a matching `.txt`, `.json`, and `.jsonl` with the same stem.
- **Never edit `.wav` files by hand.** All audio is produced by `synthbanshee generate` or
`synthbanshee generate-batch`. Manual edits silently break the SHA-256 cache linkage in
`assets/speech/`.
- **ASCII-only filenames**, lowercase, no spaces. No Hebrew text in filenames or JSON
keys/values. Hebrew belongs in `.txt` transcript files only.
- `is_synthetic: true` must be present in all `.json` clip metadata.

### Delivery log

Every merged PR that adds clips must have:
- A row in `DELIVERIES.md`
- A `deliveries/{NNN-slug}/metadata.yaml` and `deliveries/{NNN-slug}/notes.md`

Do not merge data without updating the delivery log.

---

## Label policy

### `has_violence` is a derived convenience field — keep it

`has_violence` appears in `weak_label.has_violence` (clip `.json`) and as a column in
`manifest.csv`. It is **derived** from the hierarchical taxonomy, not assigned independently:

```
has_violence = (violence_typology in {"SV", "IT", "NEG"}) and (max_intensity >= 3)
```

**Do not remove it.** AI teams use it for binary baseline models and stratified train/val/test
splits. Removing it forces every downstream user to re-derive it from `violence_typology` and
`max_intensity` — which introduces inconsistency risk.

### The taxonomy is the ground truth

The label hierarchy (in `configs/taxonomy.yaml` in SynthBanshee) is the authoritative source:

| Level | Field | Examples |
|-------|-------|---------|
| Violence typology | `violence_typology` | `SV`, `IT`, `NEG`, `NEU` |
| Tier 1 category | `tier1_category` | `PHYS`, `VERB`, `DIST`, `ACOU`, `EMOT`, `NONE` |
| Tier 2 subtype | `tier2_subtype` | `VERB_THREAT`, `DIST_SCREAM`, `PHYS_HARD` |

**What is prohibited:** replacing the full taxonomy with only a binary flag. `has_violence` must
always appear alongside `violence_typology`, `tier1_category`, `tier2_subtype`, and
`max_intensity` — never as a substitute for them.

---

## Audio format (all clips must conform)

| Property | Value |
|----------|-------|
| Sample rate | 16 kHz |
| Channels | Mono |
| Bit depth | 16-bit PCM WAV |
| Peak level | −1.0 dBFS (peak-normalized) |
| Silence padding | ≥ 0.5 s before and after target speech |
| Language | Hebrew (he-IL) |

---

## How clips get here

SynthBanshee routes output here via three environment variables:

| Variable | Points to |
|----------|-----------|
| `SYNTHBANSHEE_DATA_DIR` | `data/he/` |
| `SYNTHBANSHEE_CACHE_DIR` | `assets/speech/` |
| `SYNTHBANSHEE_SCRIPT_CACHE_DIR` | `assets/scripts/` |

Run `synthbanshee generate-batch -r <run_config.yaml>` from the SynthBanshee repo with these
variables set. Never write clip files to this repo by hand.

---

## Validation

From the SynthBanshee repo:

```bash
# Single clip
synthbanshee validate data/he/{speaker_id}/{clip_id}.wav

# Whole directory
synthbanshee qa-report data/he/
```

A clip is valid iff all four files are present, the WAV is spec-compliant, the `.json` parses
as `ClipMetadata` with `is_synthetic=true`, and the filename stem is ASCII lowercase.

---

## What NOT to do

- Don't modify or rename files in `assets/` — cache integrity depends on exact filenames
- Don't edit `.wav` files manually
- Don't add clips without updating `DELIVERIES.md` and `deliveries/{slug}/`
- Don't use Hebrew text in filenames or JSON string fields
- Don't drop `has_violence` from metadata or manifests
- Don't treat `has_violence` as the sole label — always preserve the full taxonomy alongside it
- Don't use lossy audio formats (MP3, AAC) anywhere
- Don't commit `.env` files or credentials
17 changes: 17 additions & 0 deletions DELIVERIES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Delivery Log

One row per data delivery (merged PR). Each entry links to a per-delivery notes file with
clip counts, prosody QA results, known limitations, and the SynthBanshee commit that produced it.

| # | Date | Slug | Project | Tier | Clips | Duration | Typologies | Pipeline milestone | Status | PR |
|---|------|------|---------|------|------:|------:|------------|-------------------|--------|-----|
| 001 | 2026-04-15 | [debug-run-1](deliveries/001-debug-run-1/notes.md) | she_proves | A | 1 | 2m 36s | IT | v1 baseline (pre-V3) | superseded | [#1](https://github.com/DataHackIL/avdp-synth-corpus/pull/1) |
| 002 | 2026-04-15 | [m2a-wettest](deliveries/002-m2a-wettest/notes.md) | she_proves | A | 8 | ~17m | SV, IT, NEG, NEU | M2a SSML prosody | provisional | [#2](https://github.com/DataHackIL/avdp-synth-corpus/pull/2) |

## Status definitions

| Status | Meaning |
|--------|---------|
| `provisional` | Wet-test batch; not yet approved for model training |
| `approved` | QA passed; cleared for training use |
| `superseded` | Replaced by a later delivery with the same scenes at higher quality |
22 changes: 21 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,11 @@ Synthetic Hebrew audio dataset for the **Audio Violence Detection Pipeline (AVDP

This is a **data-only repository**. It contains no application code. All pipeline logic, configuration, documentation, and tests live in SynthBanshee.

> **If you are a Claude Code agent or AI assistant:** read [`CLAUDE.md`](CLAUDE.md) before
> making any changes. Key rules: never rename/modify/delete files in `assets/`; never edit
> `.wav` files by hand; always update `DELIVERIES.md` when adding clips; never drop
> `has_violence` from metadata or manifests.

---

## What is this data for?
Expand Down Expand Up @@ -63,7 +68,7 @@ Labels follow a three-level hierarchy defined in `configs/taxonomy.yaml` in the
| Tier 1 category (event-level) | `tier1_category` | `PHYS`, `VERB`, `DIST`, `ACOU`, `EMOT`, `NONE` |
| Tier 2 subtype (event-level) | `tier2_subtype` | `VERB_THREAT`, `DIST_SCREAM`, `PHYS_HARD` |

There are **no binary Violence/Non-Violence labels** anywhere in this dataset. The spec explicitly prohibits them.
`has_violence` is a **derived convenience field** computed from the hierarchical taxonomy (`violence_typology`, `violence_categories`, `max_intensity`). It is provided for fast filtering and baseline modelling. The taxonomy columns are the ground truth — `has_violence` must never be the only label used for training.

Intensity is scored 1–5 per turn:

Expand Down Expand Up @@ -134,6 +139,21 @@ synthbanshee qa-report data/he/

---

## Delivery history

All data deliveries are logged in **[DELIVERIES.md](DELIVERIES.md)** — one row per merged PR,
with clip counts, duration, prosody QA results, and known limitations.
Per-delivery notes and structured metadata live under `deliveries/{slug}/`.

---

## Agent and contributor guidelines

See [`CLAUDE.md`](CLAUDE.md) for the full rules governing this repository — cache integrity,
label policy, delivery log conventions, and what not to do.

---

## Related repositories

| Repo | Purpose |
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
{
"turns": [
{
"emotional_state": "neutral",
"intensity": 1,
"pause_before_s": 0.0,
"speaker_id": "VIC_F_25-40_002",
"text": "\u05ea\u05d2\u05d9\u05d3, \u05e8\u05e6\u05d9\u05ea\u05d9 \u05dc\u05d3\u05d1\u05e8 \u05d0\u05d9\u05ea\u05da \u05e2\u05dc \u05dc\u05d5\u05d7 \u05d4\u05d6\u05de\u05e0\u05d9\u05dd \u05e9\u05dc \u05d4\u05e9\u05d1\u05d5\u05e2 \u05d4\u05d1\u05d0. \u05d9\u05e9 \u05db\u05de\u05d4 \u05e9\u05d9\u05e0\u05d5\u05d9\u05d9\u05dd \u05d1\u05d0\u05d9\u05e1\u05d5\u05e4\u05d9\u05dd \u05de\u05d1\u05d9\u05ea \u05d4\u05e1\u05e4\u05e8."
},
{
"emotional_state": "neutral",
"intensity": 1,
"pause_before_s": 0.8,
"speaker_id": "AGG_M_30-45_001",
"text": "\u05db\u05df, \u05d1\u05d8\u05d7. \u05de\u05d4 \u05d4\u05e9\u05ea\u05e0\u05d4?"
},
{
"emotional_state": "neutral",
"intensity": 1,
"pause_before_s": 0.3,
"speaker_id": "VIC_F_25-40_002",
"text": "\u05d1\u05d9\u05d5\u05dd \u05e9\u05dc\u05d9\u05e9\u05d9 \u05d9\u05e9 \u05dc\u05d9 \u05d9\u05e9\u05d9\u05d1\u05d4 \u05d1\u05e2\u05d1\u05d5\u05d3\u05d4 \u05e2\u05d3 \u05d7\u05de\u05e9, \u05d0\u05d6 \u05d0\u05e0\u05d9 \u05dc\u05d0 \u05d0\u05e1\u05e4\u05d9\u05e7 \u05dc\u05d0\u05e1\u05d5\u05e3 \u05d0\u05ea \u05e0\u05d5\u05e2\u05d4 \u05de\u05d4\u05d2\u05df \u05d1\u05d0\u05e8\u05d1\u05e2. \u05d0\u05ea\u05d4 \u05d9\u05db\u05d5\u05dc \u05dc\u05e7\u05d7\u05ea \u05d0\u05ea \u05d6\u05d4?"
},
{
"emotional_state": "calm",
"intensity": 1,
"pause_before_s": 1.0,
"speaker_id": "AGG_M_30-45_001",
"text": "\u05e9\u05dc\u05d9\u05e9\u05d9? \u05e8\u05d2\u05e2, \u05ea\u05e0\u05d9 \u05dc\u05d9 \u05dc\u05d1\u05d3\u05d5\u05e7. \u05db\u05df, \u05d1\u05d9\u05d5\u05dd \u05e9\u05dc\u05d9\u05e9\u05d9 \u05d0\u05e0\u05d9 \u05d0\u05de\u05d5\u05e8 \u05dc\u05e1\u05d9\u05d9\u05dd \u05d1\u05e9\u05dc\u05d5\u05e9 \u05d5\u05d7\u05e6\u05d9, \u05d0\u05d6 \u05d6\u05d4 \u05d1\u05e1\u05d3\u05e8 \u05d2\u05de\u05d5\u05e8. \u05d0\u05e0\u05d9 \u05d0\u05d3\u05d0\u05d2 \u05dc\u05d0\u05e1\u05d5\u05e3 \u05d0\u05d5\u05ea\u05d4."
},
{
"emotional_state": "calm",
"intensity": 1,
"pause_before_s": 0.4,
"speaker_id": "VIC_F_25-40_002",
"text": "\u05de\u05e2\u05d5\u05dc\u05d4. \u05d5\u05de\u05d4 \u05e2\u05dd \u05d9\u05d5\u05dd \u05e8\u05d1\u05d9\u05e2\u05d9? \u05db\u05d9 \u05d2\u05dd \u05d1\u05e8\u05d1\u05d9\u05e2\u05d9 \u05d9\u05e9 \u05dc\u05d9 \u05de\u05e9\u05d4\u05d5, \u05d0\u05d1\u05dc \u05d6\u05d4 \u05e8\u05e7 \u05e2\u05d3 \u05e9\u05dc\u05d5\u05e9 \u05d5\u05d7\u05e6\u05d9 \u05d0\u05d6 \u05d0\u05e0\u05d9 \u05d7\u05d5\u05e9\u05d1\u05ea \u05e9\u05d0\u05e0\u05d9 \u05d0\u05e1\u05ea\u05d3\u05e8."
},
{
"emotional_state": "neutral",
"intensity": 1,
"pause_before_s": 0.5,
"speaker_id": "AGG_M_30-45_001",
"text": "\u05d1\u05e8\u05d1\u05d9\u05e2\u05d9 \u05d3\u05d5\u05d5\u05e7\u05d0 \u05d9\u05e9 \u05dc\u05d9 \u05e4\u05d2\u05d9\u05e9\u05d4 \u05de\u05d7\u05d5\u05e5 \u05dc\u05de\u05e9\u05e8\u05d3, \u05d0\u05d6 \u05d0\u05dd \u05d0\u05ea \u05d9\u05db\u05d5\u05dc\u05d4, \u05e2\u05d3\u05d9\u05e3 \u05e9\u05d0\u05ea \u05ea\u05d0\u05e1\u05e4\u05d9."
},
{
"emotional_state": "neutral",
"intensity": 2,
"pause_before_s": 0.3,
"speaker_id": "VIC_F_25-40_002",
"text": "\u05d1\u05e1\u05d3\u05e8, \u05d0\u05e0\u05d9 \u05d0\u05e1\u05ea\u05d3\u05e8. \u05e8\u05d2\u05e2, \u05d0\u05d1\u05dc \u05d7\u05db\u05d4 \u2014 \u05d1\u05e8\u05d1\u05d9\u05e2\u05d9 \u05e0\u05d5\u05e2\u05d4 \u05d9\u05d5\u05e6\u05d0\u05ea \u05d1\u05e9\u05ea\u05d9\u05d9\u05dd \u05db\u05d9 \u05d9\u05e9 \u05dc\u05d4\u05dd \u05e7\u05d9\u05e6\u05d5\u05e8 \u05d9\u05d5\u05dd, \u05e0\u05db\u05d5\u05df? \u05d6\u05d4 \u05e7\u05e6\u05ea \u05d1\u05e2\u05d9\u05d9\u05ea\u05d9 \u05d1\u05e9\u05d1\u05d9\u05dc\u05d9."
},
{
"emotional_state": "calm",
"intensity": 2,
"pause_before_s": 0.6,
"speaker_id": "AGG_M_30-45_001",
"text": "\u05d0\u05d4, \u05e0\u05db\u05d5\u05df, \u05e9\u05db\u05d7\u05ea\u05d9 \u05de\u05d6\u05d4. \u05d0\u05d5\u05dc\u05d9 \u05d0\u05de\u05d0 \u05e9\u05dc\u05d9 \u05d9\u05db\u05d5\u05dc\u05d4 \u05dc\u05d0\u05e1\u05d5\u05e3 \u05d0\u05d5\u05ea\u05d4 \u05d1\u05d9\u05d5\u05dd \u05d4\u05d6\u05d4? \u05d4\u05d9\u05d0 \u05de\u05de\u05d9\u05dc\u05d0 \u05d2\u05e8\u05d4 \u05e7\u05e8\u05d5\u05d1 \u05dc\u05d2\u05df."
},
{
"emotional_state": "calm",
"intensity": 1,
"pause_before_s": 0.4,
"speaker_id": "VIC_F_25-40_002",
"text": "\u05d6\u05d4 \u05e8\u05e2\u05d9\u05d5\u05df \u05d8\u05d5\u05d1. \u05ea\u05de\u05e1\u05d5\u05e8 \u05dc\u05d4 \u05e9\u05e0\u05e9\u05de\u05d7 \u05d0\u05dd \u05d4\u05d9\u05d0 \u05ea\u05d5\u05db\u05dc? \u05e8\u05e7 \u05ea\u05d5\u05d5\u05d3\u05d0 \u05e9\u05d4\u05d9\u05d0 \u05d9\u05d5\u05d3\u05e2\u05ea \u05e9\u05d4\u05d9\u05e6\u05d9\u05d0\u05d4 \u05d1\u05e9\u05ea\u05d9\u05d9\u05dd \u05d5\u05dc\u05d0 \u05d1\u05d0\u05e8\u05d1\u05e2."
},
{
"emotional_state": "neutral",
"intensity": 1,
"pause_before_s": 0.7,
"speaker_id": "AGG_M_30-45_001",
"text": "\u05d0\u05e0\u05d9 \u05d0\u05ea\u05e7\u05e9\u05e8 \u05d0\u05dc\u05d9\u05d4 \u05d4\u05e2\u05e8\u05d1. \u05d0\u05d2\u05d1, \u05d0\u05d9\u05da \u05d4\u05d5\u05dc\u05da \u05dc\u05e0\u05d5\u05e2\u05d4 \u05d1\u05d2\u05df? \u05d4\u05de\u05d5\u05e8\u05d4 \u05d0\u05de\u05e8\u05d4 \u05de\u05e9\u05d4\u05d5 \u05e2\u05dc \u05d4\u05d4\u05e1\u05ea\u05d2\u05dc\u05d5\u05ea \u05e9\u05dc\u05d4?"
},
{
"emotional_state": "calm",
"intensity": 1,
"pause_before_s": 0.5,
"speaker_id": "VIC_F_25-40_002",
"text": "\u05d3\u05d5\u05d5\u05e7\u05d0 \u05db\u05df, \u05d3\u05d9\u05d1\u05e8\u05ea\u05d9 \u05e2\u05dd \u05d4\u05d2\u05e0\u05e0\u05ea \u05d1\u05d9\u05d5\u05dd \u05d7\u05de\u05d9\u05e9\u05d9. \u05d4\u05d9\u05d0 \u05d0\u05de\u05e8\u05d4 \u05e9\u05e0\u05d5\u05e2\u05d4 \u05de\u05d0\u05d5\u05d3 \u05d7\u05d1\u05e8\u05d5\u05ea\u05d9\u05ea \u05d5\u05de\u05e9\u05ea\u05dc\u05d1\u05ea \u05d9\u05e4\u05d4. \u05d0\u05e0\u05d9 \u05de\u05de\u05e9 \u05e9\u05de\u05d7\u05d4."
},
{
"emotional_state": "calm",
"intensity": 1,
"pause_before_s": 1.2,
"speaker_id": "AGG_M_30-45_001",
"text": "\u05db\u05d9\u05e3 \u05dc\u05e9\u05de\u05d5\u05e2, \u05d1\u05d0\u05de\u05ea. \u05d0\u05d6 \u05d1\u05d5\u05d0\u05d9 \u05e0\u05e1\u05db\u05dd: \u05e9\u05dc\u05d9\u05e9\u05d9 \u05d0\u05e0\u05d9 \u05d0\u05d5\u05e1\u05e3, \u05e8\u05d1\u05d9\u05e2\u05d9 \u05d0\u05de\u05d0 \u05e9\u05dc\u05d9, \u05d5\u05d4\u05e9\u05d0\u05e8 \u05db\u05e8\u05d2\u05d9\u05dc?"
},
{
"emotional_state": "calm",
"intensity": 1,
"pause_before_s": 0.3,
"speaker_id": "VIC_F_25-40_002",
"text": "\u05d1\u05d3\u05d9\u05d5\u05e7. \u05d5\u05d4\u05e9\u05d0\u05e8 \u05db\u05e8\u05d2\u05d9\u05dc \u2014 \u05d0\u05e0\u05d9 \u05d1\u05d9\u05d5\u05dd \u05e8\u05d0\u05e9\u05d5\u05df \u05d5\u05e8\u05d1\u05d9\u05e2\u05d9, \u05d5\u05d0\u05ea\u05d4 \u05d1\u05d9\u05d5\u05dd \u05d7\u05de\u05d9\u05e9\u05d9. \u05ea\u05d5\u05d3\u05d4 \u05e9\u05e1\u05d9\u05d3\u05e8\u05e0\u05d5 \u05d0\u05ea \u05d6\u05d4."
},
{
"emotional_state": "calm",
"intensity": 1,
"pause_before_s": 0.5,
"speaker_id": "AGG_M_30-45_001",
"text": "\u05d1\u05db\u05d9\u05e3. \u05d0\u05e0\u05d9 \u05e9\u05dd \u05ea\u05d6\u05db\u05d5\u05e8\u05ea \u05d1\u05d8\u05dc\u05e4\u05d5\u05df \u05db\u05db\u05d4 \u05dc\u05d0 \u05d0\u05e9\u05db\u05d7."
}
]
}
Loading
Loading