feat(delivery-003): 20-clip multi-project, multi-voice toy corpus by shaypal5 · Pull Request #4 · DataHackIL/avdp-synth-corpus

shaypal5 · 2026-05-11T22:00:14Z

Summary

Replaces delivery 002. First handoff target for the She-Proves and Elephant consumer teams: a 20-clip toy corpus designed to bootstrap consumer-side schema parsers, manifest loaders, and taxonomy validators — not for model training itself.

Project	Tier	Speaker dir	Clips	TTS Backend
she_proves	A	`agg_m_30-45_001/`	10 (2 IT, 2 SV, 3 NEG, 3 NEU)	Azure (Avri/Hila)
she_proves	A	`agg_m_30-45_002/`	2 (1 IT, 1 SV — Google-pair shadow scenes)	Google Chirp HD (Achird/Achernar)
elephant	B	`ben_m_40-55_003/`	8 (2 IT, 2 SV, 2 NEG, 2 NEU)	Azure (Avri/Hila) + room IR + Pi mic

Total: 20 clips, ~41.7 min. 4 unique voice families across 2 TTS backends. All clips pass synthbanshee validate (0 failures) and synthbanshee qa-report (0% failure rate).

Full delivery record: deliveries/003-multi-project-multi-voice/ — includes metadata.yaml, notes.md with the per-clip table, and the raw qa-report.json.

Synthbanshee changes shipped with this delivery

This delivery is the first to carry four synthbanshee corrections landed in the past day:

DataHackIL/SynthBanshee#102 — preprocessing_applied.normalized_dbfs now records the measured post-preprocess peak (was hardcoded -1.0). Pair with generation_metadata.loudness_target_peak_dbfs to diagnose loudness drift.
DataHackIL/SynthBanshee#103 — docs/spec.md pins the has_violence derivation rule, adds the §2.5 identifier-casing table, and rewrites §5.1 field notes.
DataHackIL/SynthBanshee#105 — adds sp_sv_a_0003 + sp_it_a_0003 Google-pair sister scenes (this delivery's voice-diversity vehicle).
DataHackIL/SynthBanshee#106 — root cause for the long-running [#72](https://github.com/DataHackIL/SynthBanshee/issues/72) ("SSML parsing error 0x80045003, unable to reproduce"): _HINT_DEFAULTS["stress"] was emitting nested <prosody volume="+NdB"> inside outer <prosody volume="+N%">, which Azure rejects. Required to unblock this delivery — without it, 6 of 8 elephant Tier B scenes (every one whose LLM script carries a stress phrase hint at intensity ≥3) failed reliably with Azure SSML parse error.

What changed in this PR (file-by-file summary)

Path	Change
`README.md`	Tightened "Clip ID and filename conventions" to point at the new spec §2.5 casing table; rewrote `has_violence` paragraph with the events-based rule; updated audio-format section for the post-#78 measured-vs-target split; replaced v1-limitations block with a pointer to per-delivery notes
`CLAUDE.md`	Replaced the wrong `has_violence` formula (`typology in {SV,IT,NEG} and max_intensity ≥ 3`) with the events-based rule (`any(e.tier1_category != "NONE")`); audio-format table now distinguishes target (-2.0 dBFS), limiter (-1.0 dBFS), and where each is recorded in metadata
`DELIVERIES.md`	Delivery 002 marked `superseded`; new row for 003
`deliveries/003-multi-project-multi-voice/metadata.yaml`	Structured delivery record
`deliveries/003-multi-project-multi-voice/notes.md`	Per-clip table, voice/backend matrix, closed-vs-open qa-report findings
`deliveries/003-multi-project-multi-voice/qa-report.json`	Raw `synthbanshee qa-report --run-summary` output, committed for audit
`data/he/agg_m_30-45_001/`	10 clips, regenerated; all have `generation_metadata`, `voice_family`, measured `normalized_dbfs`, dirty files
`data/he/agg_m_30-45_002/`	new dir — 2 clips with Google Chirp HD voices
`data/he/ben_m_40-55_003/`	new dir — 8 elephant Tier B clips
`data/he/manifest.csv`	Regenerated; now includes `voice_families` column
`assets/speech/`, `assets/scripts/`	Cache files (committed per the corpus's "never delete a cache file" rule)

QA snapshot — closed since delivery 002

Finding	002	003
`agg_no_escalation`	3 clips	0 (AGG RMS now escalates post-M3)
`warn_no_overlap`	4 clips	0 — `overlap_ratio: 100%` on I4+ (post-M8a)
`warn_emotion_downgrade`	4 clips	0 — `emotion_downgrade_ratio: 0%`
`generation_metadata` absent	0 of 8 had it	20 of 20 carry the block
`dirty_file_path` null	7 of 8	0 of 20 — all retained
`normalized_dbfs` hardcoded `-1.0`	8 of 8	fixed (#102)

Still open (documented honestly)

low_voice_diversity_male / low_voice_diversity_female — corpus now has 2 voice families per gender (up from 1); the run-level threshold is ≥3, so the warnings continue to fire. Partial progress, not a clear.
single_backend — misleading: the corpus actually uses Azure + Google. The qa-report counts clip.tts_engine which is currently hardcoded to "azure_he_IL" in cli.py:_run_generate_pipeline. This is a follow-up synthbanshee labeling bug, not a real diversity finding. The real backend distribution is correctly recorded in generation_metadata.tts_backend per clip and in speakers[].voice_family.
vic_f0_high — 2 clips: sp_it_a_0003_00 and sp_sv_a_0003_00. Google Chirp HD female (Achernar) baselines F0 higher than the Azure Hila reference the M10a thresholds were calibrated against.
Hebrew TTS naturalness items in DataHackIL/SynthBanshee#92 — out of scope for this delivery.

Test plan

synthbanshee validate on each of the 20 clips — all VALID
synthbanshee qa-report --run-summary — failure rate 0.0%, 0 quality-flagged via the failure path (15 carry warning flags, all expected per quality_flagged_clips notes above)
All clip JSONs parse via current ClipMetadata (Pydantic-validated)
All .wav/.txt/.json/.jsonl quartets present (80 of 80 files)
All clip JSONs use relative paths for dirty_file_path and transcript_path (matches the delivery-002 convention)
Manifest wav_path and strong_labels_path are repo-relative
Speaker IDs are uppercase as values (matching the new spec §2.5 casing rule); directory names are lowercase

Tier-3 ASR sanity (local)

The synthbanshee changes (#106 in particular) altered SSML output for any scene that emits a stress phrase hint. Will run synthbanshee qa-report --asr and append the result here as a comment before merge. The Tier-3 Whisper sanity check is local-only per the CLAUDE.md policy.

🤖 Generated with Claude Code

…main Replaces delivery 002. First handoff target for the She-Proves and Elephant consumer teams. ## Contents - **She-Proves Tier A — Azure pair (10 clips)** in `agg_m_30-45_001/`: 2 IT, 2 SV, 3 NEG, 3 NEU (Avri + Hila). - **She-Proves Tier A — Google Chirp HD pair (2 clips)** in `agg_m_30-45_002/`: 1 IT, 1 SV (sister scenes to sp_*_a_0001, authored as PR DataHackIL/SynthBanshee#105). Provides the voice + backend diversity vehicle for this delivery. - **Elephant Tier B (8 clips)** in `ben_m_40-55_003/`: 2 each of IT/SV/NEG/NEU with `acoustic_scene` (clinic_office room IR + pi_budget_mic device + HVAC ambient). Total: 20 clips, ~41.7 min. All pass `synthbanshee validate` and `synthbanshee qa-report` (failure rate 0.0%). Full QA snapshot at [`deliveries/003-multi-project-multi-voice/qa-report.json`](deliveries/003-multi-project-multi-voice/qa-report.json). ## Pipeline corrections delivered This delivery is the first to surface 4 synthbanshee fixes landed in the past day: - DataHackIL/SynthBanshee#102 — `preprocessing_applied.normalized_dbfs` now records the *measured* post-preprocess peak (was hardcoded `-1.0`). Pair with `generation_metadata.loudness_target_peak_dbfs` to diagnose loudness drift; the schema docstring at `labels/schema.py:175` pins the measured-vs-target split. - DataHackIL/SynthBanshee#103 — `docs/spec.md` pins the `has_violence` derivation rule (`any(e.tier1_category != "NONE")`), adds the §2.5 identifier-casing table, rewrites §5.1 field notes. - DataHackIL/SynthBanshee#105 — adds `sp_sv_a_0003` + `sp_it_a_0003` Google-pair shadow scenes. - DataHackIL/SynthBanshee#106 — root cause for #72: `_HINT_DEFAULTS` was emitting nested `<prosody volume="+NdB">` inside outer `<prosody volume="+N%">`, which Azure rejects with SSML parse error 0x80045003. Required to unblock 6 of 8 elephant Tier B scenes; without the fix, every scene whose LLM script carries a `stress` phrase hint at intensity ≥ 3 failed reliably. ## Doc updates in this PR - `README.md`: tightened "Clip ID and filename conventions" to point at SynthBanshee `docs/spec.md` §2.5; rewrote the `has_violence` paragraph to the events-based rule; updated the audio-format section to the measured-vs-target split; replaced the v1-limitations block with a pointer to per-delivery notes. - `CLAUDE.md`: replaced the wrong `has_violence` formula with the events-based rule; expanded the audio-format table to match the spec's measured-vs-target distinction. - `DELIVERIES.md`: delivery 002 marked `superseded`; new row for 003. - `deliveries/003-multi-project-multi-voice/`: - `metadata.yaml` — structured delivery record. - `notes.md` — full per-clip table, voice/backend matrix, closed-vs-open qa-report findings. - `qa-report.json` — raw qa-report output (committed for audit). ## QA snapshot Closed since delivery 002: | Finding | 002 | 003 | |---|---|---| | `agg_no_escalation` | 3 clips | 0 | | `warn_no_overlap` | 4 clips | 0 (overlap_ratio 100% on I4+) | | `warn_emotion_downgrade` | 4 clips | 0 | | `generation_metadata` absent | 0 of 8 had it | 20 of 20 have it | | `dirty_file_path` null | 7 of 8 | 0 of 20 | | `normalized_dbfs` hardcoded `-1.0` | 8 of 8 | fixed (#102) | Still open: `low_voice_diversity_*` (now 2 voices per gender, threshold is ≥3 — partial progress 1 → 2); `single_backend` (misleading; see notes for explanation of the hardcoded `tts_engine` labeling bug); `vic_f0_high` on the 2 Google Chirp HD female-voice clips. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

for more information, see https://pre-commit.ci

github-actions · 2026-05-11T22:02:29Z

pr-agent-context report:

No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #4 in repository https://github.com/DataHackIL/avdp-synth-corpus. Treat this PR as all clear unless new signals appear.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25699894592 attempt 1
Comment timestamp: 2026-05-11T22:01:38.875983+00:00
PR head commit: 200c779aea58e3e46c30cdb33d938ee6b50e6df2

…positive resolved (#5) * data(delivery-003): regen post-#107/#108/#109 — single_backend false positive resolved Schema-shift regen of delivery-003 using SynthBanshee `1ea48f3` (tip of main after PRs #110, #111, #112 merged earlier today). All 20 clips regenerated via `synthbanshee generate-batch` with corpus paths anchored at `--data-root` so clip JSON / manifest paths are repo-relative. ## Why a regen rather than in-place post-processing Two motivations made the regen worthwhile here: 1. **Audio cost was effectively $0.** Azure clips (18 of 20) hit the SHA-256 SSML cache and re-rendered byte-identical WAVs. The two Google Chirp HD clips re-rendered, costing fractions of a cent. Total wall time: ~16s for both batches. 2. **Highest fidelity guarantee.** The regen produces canonical artifacts straight from the pipeline; no manual JSON editing or path-string sed-ing. The generator is now the source of truth for what delivery-003 should look like under the post-#109 schema. ## Schema changes in this delivery vs original 2026-05-12 commit | Field | Before | After | |---|---|---| | `tts_engine` (clip JSON) | always `"azure_he_IL"` (wrong for the 2 Google clips) | **field absent** — Pydantic drops it; backend is `generation_metadata.tts_backend` per speaker (PR #112) | | `transcript_path` (clip JSON) | already relative (corpus PR #4 had normalized) | unchanged contract; now enforced by the generator itself (PR #111) | | `dirty_file_path` (clip JSON) | absolute pytest tmp_path on `sp_neu_a_0001_00` (#107 fingerprint); empty/absent on others | repo-relative POSIX `assets/speech/dirty/...` everywhere (PRs #110, #111) | | `manifest.csv` `wav_path` / `strong_labels_path` | corpus PR #4 had post-processed to relative | now produced relative by the generator (PR #111) | | `qa-report.json` `run_summary.clips_by_tts_engine` | `{"azure_he_IL": 20}` | **renamed** `clips_by_tts_backend` with `{"azure": 18, "google": 2}` (PR #112) | | `qa-report.json` `run_summary.run_warnings` | included `single_backend` (false positive — corpus actually has 2 backends) | `single_backend` **resolved**; only `low_voice_diversity_*` remain (legitimate; threshold is ≥3, corpus has 2) | ## Audio integrity - 19 of 20 clips: WAV bytes byte-identical with the original 2026-05-12 delivery (Azure SSML cache hit). Only metadata JSON changed. - `sp_sv_a_0003_00.wav` (Google Chirp HD): bit-level difference from the original render. Validation still PASSED (16 kHz, mono, peak ≤ −1.0 dBFS, duration ≥ 3 s). Duration shifted by 2.2s (corpus-wide total: 2500.79 → 2498.63 s). ## What changed in this commit - `data/he/**/*.json` (20 files): schema migration. No `tts_engine`; relative paths. - `data/he/agg_m_30-45_002/sp_sv_a_0003_00.{wav,txt,jsonl}`: Google re-render. - `data/he/manifest.csv`: relative paths, 20 rows, all `split: train` (4 speakers across 20 clips, no speaker-disjoint partition possible). - `assets/speech/*.wav` (6 new SSML caches): from the Google clip re-render. - `deliveries/003-multi-project-multi-voice/qa-report.json`: regenerated; new field shape; `single_backend` false positive gone. - `deliveries/003-multi-project-multi-voice/metadata.yaml`: pinned new SynthBanshee commit (`1ea48f3`); added PRs #110/#111/#112 to `related_prs`; new `qa_findings_closed_post_regen_2026_05_12` section; new `regen_2026_05_12` block documenting reason/cost/changes. - `deliveries/003-multi-project-multi-voice/notes.md`: pipeline-version section split into "initial delivery" + "schema-shift regen" subsections; "still open" QA findings list pared down. - `DELIVERIES.md`: pipeline-milestone column updated to include #110/#111/#112. ## Test plan - [x] `synthbanshee qa-report data/he --output deliveries/.../qa-report.json --run-summary` — failure rate 0.0%, 20 clips, no `single_backend` warning - [x] `synthbanshee validate` spot-checked on `sp_sv_a_0003_00.wav` (Google re-render) and `el_sv_b_0001_00.wav` (Azure cache hit) — both VALID - [x] `jq` spot-checks on 3 sampled JSONs confirm `has_tts_engine: false`, repo-relative `transcript_path` and `dirty_file_path`, populated `generation_metadata.tts_backend` - [x] Manifest CSV `wav_path` column verified repo-relative for all 20 rows ## Tier-3 ASR sanity (local) Not applicable. The three SynthBanshee PRs that prompted this regen touch `tests/`, `synthbanshee/cli.py` path-shape, `synthbanshee/package/manifest.py`, `synthbanshee/package/qa.py`, `synthbanshee/labels/`, and docs — none touch `synthbanshee/tts/`, `synthbanshee/script/`, `synthbanshee/augment/`, nor any speaker / scene / acoustic / project YAML config. The audio pipeline itself is unchanged, and the Azure cache hits prove it (19/20 bit-identical WAVs). Per SynthBanshee CLAUDE.md "ASR sanity check policy", the regen is exempt from the local `qa-report --asr` run. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

shaypal5 added enhancement New feature or request data labels May 11, 2026

[pre-commit.ci] auto fixes from pre-commit.com hooks

200c779

for more information, see https://pre-commit.ci

This comment has been minimized.

Sign in to view

shaypal5 merged commit 08a95ec into main May 12, 2026
3 checks passed

shaypal5 deleted the feat/delivery-003-20clip-multi-project branch May 12, 2026 04:32

shaypal5 mentioned this pull request May 12, 2026

data(delivery-003): regen post-#107/#108/#109 — single_backend false positive resolved #5

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(delivery-003): 20-clip multi-project, multi-voice toy corpus#4

feat(delivery-003): 20-clip multi-project, multi-voice toy corpus#4
shaypal5 merged 2 commits into
mainfrom
feat/delivery-003-20clip-multi-project

shaypal5 commented May 11, 2026

Uh oh!

This comment has been minimized.

github-actions Bot commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shaypal5 commented May 11, 2026

Summary

Synthbanshee changes shipped with this delivery

What changed in this PR (file-by-file summary)

QA snapshot — closed since delivery 002

Still open (documented honestly)

Test plan

Tier-3 ASR sanity (local)

Uh oh!

This comment has been minimized.

github-actions Bot commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant