Skip to content

feat(delivery-003): 20-clip multi-project, multi-voice toy corpus#4

Merged
shaypal5 merged 2 commits into
mainfrom
feat/delivery-003-20clip-multi-project
May 12, 2026
Merged

feat(delivery-003): 20-clip multi-project, multi-voice toy corpus#4
shaypal5 merged 2 commits into
mainfrom
feat/delivery-003-20clip-multi-project

Conversation

@shaypal5

Copy link
Copy Markdown
Member

Summary

Replaces delivery 002. First handoff target for the She-Proves and Elephant consumer teams: a 20-clip toy corpus designed to bootstrap consumer-side schema parsers, manifest loaders, and taxonomy validators — not for model training itself.

Project Tier Speaker dir Clips TTS Backend
she_proves A agg_m_30-45_001/ 10 (2 IT, 2 SV, 3 NEG, 3 NEU) Azure (Avri/Hila)
she_proves A agg_m_30-45_002/ 2 (1 IT, 1 SV — Google-pair shadow scenes) Google Chirp HD (Achird/Achernar)
elephant B ben_m_40-55_003/ 8 (2 IT, 2 SV, 2 NEG, 2 NEU) Azure (Avri/Hila) + room IR + Pi mic

Total: 20 clips, ~41.7 min. 4 unique voice families across 2 TTS backends. All clips pass synthbanshee validate (0 failures) and synthbanshee qa-report (0% failure rate).

Full delivery record: deliveries/003-multi-project-multi-voice/ — includes metadata.yaml, notes.md with the per-clip table, and the raw qa-report.json.

Synthbanshee changes shipped with this delivery

This delivery is the first to carry four synthbanshee corrections landed in the past day:

  • DataHackIL/SynthBanshee#102preprocessing_applied.normalized_dbfs now records the measured post-preprocess peak (was hardcoded -1.0). Pair with generation_metadata.loudness_target_peak_dbfs to diagnose loudness drift.
  • DataHackIL/SynthBanshee#103docs/spec.md pins the has_violence derivation rule, adds the §2.5 identifier-casing table, and rewrites §5.1 field notes.
  • DataHackIL/SynthBanshee#105 — adds sp_sv_a_0003 + sp_it_a_0003 Google-pair sister scenes (this delivery's voice-diversity vehicle).
  • DataHackIL/SynthBanshee#106root cause for the long-running [#72](https://github.com/DataHackIL/SynthBanshee/issues/72) ("SSML parsing error 0x80045003, unable to reproduce"): _HINT_DEFAULTS["stress"] was emitting nested <prosody volume="+NdB"> inside outer <prosody volume="+N%">, which Azure rejects. Required to unblock this delivery — without it, 6 of 8 elephant Tier B scenes (every one whose LLM script carries a stress phrase hint at intensity ≥3) failed reliably with Azure SSML parse error.

What changed in this PR (file-by-file summary)

Path Change
README.md Tightened "Clip ID and filename conventions" to point at the new spec §2.5 casing table; rewrote has_violence paragraph with the events-based rule; updated audio-format section for the post-#78 measured-vs-target split; replaced v1-limitations block with a pointer to per-delivery notes
CLAUDE.md Replaced the wrong has_violence formula (typology in {SV,IT,NEG} and max_intensity ≥ 3) with the events-based rule (any(e.tier1_category != "NONE")); audio-format table now distinguishes target (-2.0 dBFS), limiter (-1.0 dBFS), and where each is recorded in metadata
DELIVERIES.md Delivery 002 marked superseded; new row for 003
deliveries/003-multi-project-multi-voice/metadata.yaml Structured delivery record
deliveries/003-multi-project-multi-voice/notes.md Per-clip table, voice/backend matrix, closed-vs-open qa-report findings
deliveries/003-multi-project-multi-voice/qa-report.json Raw synthbanshee qa-report --run-summary output, committed for audit
data/he/agg_m_30-45_001/ 10 clips, regenerated; all have generation_metadata, voice_family, measured normalized_dbfs, dirty files
data/he/agg_m_30-45_002/ new dir — 2 clips with Google Chirp HD voices
data/he/ben_m_40-55_003/ new dir — 8 elephant Tier B clips
data/he/manifest.csv Regenerated; now includes voice_families column
assets/speech/, assets/scripts/ Cache files (committed per the corpus's "never delete a cache file" rule)

QA snapshot — closed since delivery 002

Finding 002 003
agg_no_escalation 3 clips 0 (AGG RMS now escalates post-M3)
warn_no_overlap 4 clips 0overlap_ratio: 100% on I4+ (post-M8a)
warn_emotion_downgrade 4 clips 0emotion_downgrade_ratio: 0%
generation_metadata absent 0 of 8 had it 20 of 20 carry the block
dirty_file_path null 7 of 8 0 of 20 — all retained
normalized_dbfs hardcoded -1.0 8 of 8 fixed (#102)

Still open (documented honestly)

  • low_voice_diversity_male / low_voice_diversity_female — corpus now has 2 voice families per gender (up from 1); the run-level threshold is ≥3, so the warnings continue to fire. Partial progress, not a clear.
  • single_backendmisleading: the corpus actually uses Azure + Google. The qa-report counts clip.tts_engine which is currently hardcoded to "azure_he_IL" in cli.py:_run_generate_pipeline. This is a follow-up synthbanshee labeling bug, not a real diversity finding. The real backend distribution is correctly recorded in generation_metadata.tts_backend per clip and in speakers[].voice_family.
  • vic_f0_high — 2 clips: sp_it_a_0003_00 and sp_sv_a_0003_00. Google Chirp HD female (Achernar) baselines F0 higher than the Azure Hila reference the M10a thresholds were calibrated against.
  • Hebrew TTS naturalness items in DataHackIL/SynthBanshee#92 — out of scope for this delivery.

Test plan

  • synthbanshee validate on each of the 20 clips — all VALID
  • synthbanshee qa-report --run-summary — failure rate 0.0%, 0 quality-flagged via the failure path (15 carry warning flags, all expected per quality_flagged_clips notes above)
  • All clip JSONs parse via current ClipMetadata (Pydantic-validated)
  • All .wav/.txt/.json/.jsonl quartets present (80 of 80 files)
  • All clip JSONs use relative paths for dirty_file_path and transcript_path (matches the delivery-002 convention)
  • Manifest wav_path and strong_labels_path are repo-relative
  • Speaker IDs are uppercase as values (matching the new spec §2.5 casing rule); directory names are lowercase

Tier-3 ASR sanity (local)

The synthbanshee changes (#106 in particular) altered SSML output for any scene that emits a stress phrase hint. Will run synthbanshee qa-report --asr and append the result here as a comment before merge. The Tier-3 Whisper sanity check is local-only per the CLAUDE.md policy.

🤖 Generated with Claude Code

…main

Replaces delivery 002.  First handoff target for the She-Proves and
Elephant consumer teams.

## Contents

- **She-Proves Tier A — Azure pair (10 clips)** in `agg_m_30-45_001/`:
  2 IT, 2 SV, 3 NEG, 3 NEU (Avri + Hila).
- **She-Proves Tier A — Google Chirp HD pair (2 clips)** in
  `agg_m_30-45_002/`: 1 IT, 1 SV (sister scenes to sp_*_a_0001,
  authored as PR DataHackIL/SynthBanshee#105).  Provides the
  voice + backend diversity vehicle for this delivery.
- **Elephant Tier B (8 clips)** in `ben_m_40-55_003/`: 2 each of
  IT/SV/NEG/NEU with `acoustic_scene` (clinic_office room IR +
  pi_budget_mic device + HVAC ambient).

Total: 20 clips, ~41.7 min.  All pass `synthbanshee validate` and
`synthbanshee qa-report` (failure rate 0.0%).  Full QA snapshot at
[`deliveries/003-multi-project-multi-voice/qa-report.json`](deliveries/003-multi-project-multi-voice/qa-report.json).

## Pipeline corrections delivered

This delivery is the first to surface 4 synthbanshee fixes landed in
the past day:

- DataHackIL/SynthBanshee#102 — `preprocessing_applied.normalized_dbfs`
  now records the *measured* post-preprocess peak (was hardcoded
  `-1.0`).  Pair with `generation_metadata.loudness_target_peak_dbfs`
  to diagnose loudness drift; the schema docstring at
  `labels/schema.py:175` pins the measured-vs-target split.
- DataHackIL/SynthBanshee#103 — `docs/spec.md` pins the
  `has_violence` derivation rule (`any(e.tier1_category != "NONE")`),
  adds the §2.5 identifier-casing table, rewrites §5.1 field notes.
- DataHackIL/SynthBanshee#105 — adds `sp_sv_a_0003` + `sp_it_a_0003`
  Google-pair shadow scenes.
- DataHackIL/SynthBanshee#106 — root cause for #72: `_HINT_DEFAULTS`
  was emitting nested `<prosody volume="+NdB">` inside outer
  `<prosody volume="+N%">`, which Azure rejects with SSML parse
  error 0x80045003.  Required to unblock 6 of 8 elephant Tier B
  scenes; without the fix, every scene whose LLM script carries a
  `stress` phrase hint at intensity ≥ 3 failed reliably.

## Doc updates in this PR

- `README.md`: tightened "Clip ID and filename conventions" to
  point at SynthBanshee `docs/spec.md` §2.5; rewrote the
  `has_violence` paragraph to the events-based rule; updated the
  audio-format section to the measured-vs-target split; replaced
  the v1-limitations block with a pointer to per-delivery notes.
- `CLAUDE.md`: replaced the wrong `has_violence` formula with the
  events-based rule; expanded the audio-format table to match the
  spec's measured-vs-target distinction.
- `DELIVERIES.md`: delivery 002 marked `superseded`; new row for 003.
- `deliveries/003-multi-project-multi-voice/`:
  - `metadata.yaml` — structured delivery record.
  - `notes.md` — full per-clip table, voice/backend matrix,
    closed-vs-open qa-report findings.
  - `qa-report.json` — raw qa-report output (committed for audit).

## QA snapshot

Closed since delivery 002:

| Finding | 002 | 003 |
|---|---|---|
| `agg_no_escalation` | 3 clips | 0 |
| `warn_no_overlap` | 4 clips | 0 (overlap_ratio 100% on I4+) |
| `warn_emotion_downgrade` | 4 clips | 0 |
| `generation_metadata` absent | 0 of 8 had it | 20 of 20 have it |
| `dirty_file_path` null | 7 of 8 | 0 of 20 |
| `normalized_dbfs` hardcoded `-1.0` | 8 of 8 | fixed (#102) |

Still open: `low_voice_diversity_*` (now 2 voices per gender, threshold
is ≥3 — partial progress 1 → 2); `single_backend` (misleading; see
notes for explanation of the hardcoded `tts_engine` labeling bug);
`vic_f0_high` on the 2 Google Chirp HD female-voice clips.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@shaypal5 shaypal5 added enhancement New feature or request data labels May 11, 2026
@github-actions

This comment has been minimized.

@github-actions

Copy link
Copy Markdown

pr-agent-context report:

No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #4 in repository https://github.com/DataHackIL/avdp-synth-corpus. Treat this PR as all clear unless new signals appear.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25699894592 attempt 1
Comment timestamp: 2026-05-11T22:01:38.875983+00:00
PR head commit: 200c779aea58e3e46c30cdb33d938ee6b50e6df2

@shaypal5 shaypal5 merged commit 08a95ec into main May 12, 2026
3 checks passed
@shaypal5 shaypal5 deleted the feat/delivery-003-20clip-multi-project branch May 12, 2026 04:32
shaypal5 added a commit that referenced this pull request May 12, 2026
…positive resolved (#5)

* data(delivery-003): regen post-#107/#108/#109 — single_backend false positive resolved

Schema-shift regen of delivery-003 using SynthBanshee `1ea48f3` (tip of
main after PRs #110, #111, #112 merged earlier today). All 20 clips
regenerated via `synthbanshee generate-batch` with corpus paths anchored
at `--data-root` so clip JSON / manifest paths are repo-relative.

## Why a regen rather than in-place post-processing

Two motivations made the regen worthwhile here:

1. **Audio cost was effectively $0.** Azure clips (18 of 20) hit the
   SHA-256 SSML cache and re-rendered byte-identical WAVs. The two
   Google Chirp HD clips re-rendered, costing fractions of a cent.
   Total wall time: ~16s for both batches.

2. **Highest fidelity guarantee.** The regen produces canonical artifacts
   straight from the pipeline; no manual JSON editing or path-string
   sed-ing. The generator is now the source of truth for what
   delivery-003 should look like under the post-#109 schema.

## Schema changes in this delivery vs original 2026-05-12 commit

| Field | Before | After |
|---|---|---|
| `tts_engine` (clip JSON) | always `"azure_he_IL"` (wrong for the 2 Google clips) | **field absent** — Pydantic drops it; backend is `generation_metadata.tts_backend` per speaker (PR #112) |
| `transcript_path` (clip JSON) | already relative (corpus PR #4 had normalized) | unchanged contract; now enforced by the generator itself (PR #111) |
| `dirty_file_path` (clip JSON) | absolute pytest tmp_path on `sp_neu_a_0001_00` (#107 fingerprint); empty/absent on others | repo-relative POSIX `assets/speech/dirty/...` everywhere (PRs #110, #111) |
| `manifest.csv` `wav_path` / `strong_labels_path` | corpus PR #4 had post-processed to relative | now produced relative by the generator (PR #111) |
| `qa-report.json` `run_summary.clips_by_tts_engine` | `{"azure_he_IL": 20}` | **renamed** `clips_by_tts_backend` with `{"azure": 18, "google": 2}` (PR #112) |
| `qa-report.json` `run_summary.run_warnings` | included `single_backend` (false positive — corpus actually has 2 backends) | `single_backend` **resolved**; only `low_voice_diversity_*` remain (legitimate; threshold is ≥3, corpus has 2) |

## Audio integrity

- 19 of 20 clips: WAV bytes byte-identical with the original 2026-05-12 delivery (Azure SSML cache hit). Only metadata JSON changed.
- `sp_sv_a_0003_00.wav` (Google Chirp HD): bit-level difference from the original render. Validation still PASSED (16 kHz, mono, peak ≤ −1.0 dBFS, duration ≥ 3 s). Duration shifted by 2.2s (corpus-wide total: 2500.79 → 2498.63 s).

## What changed in this commit

- `data/he/**/*.json` (20 files): schema migration. No `tts_engine`; relative paths.
- `data/he/agg_m_30-45_002/sp_sv_a_0003_00.{wav,txt,jsonl}`: Google re-render.
- `data/he/manifest.csv`: relative paths, 20 rows, all `split: train` (4 speakers across 20 clips, no speaker-disjoint partition possible).
- `assets/speech/*.wav` (6 new SSML caches): from the Google clip re-render.
- `deliveries/003-multi-project-multi-voice/qa-report.json`: regenerated; new field shape; `single_backend` false positive gone.
- `deliveries/003-multi-project-multi-voice/metadata.yaml`: pinned new SynthBanshee commit (`1ea48f3`); added PRs #110/#111/#112 to `related_prs`; new `qa_findings_closed_post_regen_2026_05_12` section; new `regen_2026_05_12` block documenting reason/cost/changes.
- `deliveries/003-multi-project-multi-voice/notes.md`: pipeline-version section split into "initial delivery" + "schema-shift regen" subsections; "still open" QA findings list pared down.
- `DELIVERIES.md`: pipeline-milestone column updated to include #110/#111/#112.

## Test plan

- [x] `synthbanshee qa-report data/he --output deliveries/.../qa-report.json --run-summary` — failure rate 0.0%, 20 clips, no `single_backend` warning
- [x] `synthbanshee validate` spot-checked on `sp_sv_a_0003_00.wav` (Google re-render) and `el_sv_b_0001_00.wav` (Azure cache hit) — both VALID
- [x] `jq` spot-checks on 3 sampled JSONs confirm `has_tts_engine: false`, repo-relative `transcript_path` and `dirty_file_path`, populated `generation_metadata.tts_backend`
- [x] Manifest CSV `wav_path` column verified repo-relative for all 20 rows

## Tier-3 ASR sanity (local)

Not applicable. The three SynthBanshee PRs that prompted this regen
touch `tests/`, `synthbanshee/cli.py` path-shape, `synthbanshee/package/manifest.py`,
`synthbanshee/package/qa.py`, `synthbanshee/labels/`, and docs — none
touch `synthbanshee/tts/`, `synthbanshee/script/`, `synthbanshee/augment/`,
nor any speaker / scene / acoustic / project YAML config. The audio
pipeline itself is unchanged, and the Azure cache hits prove it (19/20
bit-identical WAVs). Per SynthBanshee CLAUDE.md "ASR sanity check
policy", the regen is exempt from the local `qa-report --asr` run.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant