feat(delivery-003): 20-clip multi-project, multi-voice toy corpus#4
Merged
Conversation
…main Replaces delivery 002. First handoff target for the She-Proves and Elephant consumer teams. ## Contents - **She-Proves Tier A — Azure pair (10 clips)** in `agg_m_30-45_001/`: 2 IT, 2 SV, 3 NEG, 3 NEU (Avri + Hila). - **She-Proves Tier A — Google Chirp HD pair (2 clips)** in `agg_m_30-45_002/`: 1 IT, 1 SV (sister scenes to sp_*_a_0001, authored as PR DataHackIL/SynthBanshee#105). Provides the voice + backend diversity vehicle for this delivery. - **Elephant Tier B (8 clips)** in `ben_m_40-55_003/`: 2 each of IT/SV/NEG/NEU with `acoustic_scene` (clinic_office room IR + pi_budget_mic device + HVAC ambient). Total: 20 clips, ~41.7 min. All pass `synthbanshee validate` and `synthbanshee qa-report` (failure rate 0.0%). Full QA snapshot at [`deliveries/003-multi-project-multi-voice/qa-report.json`](deliveries/003-multi-project-multi-voice/qa-report.json). ## Pipeline corrections delivered This delivery is the first to surface 4 synthbanshee fixes landed in the past day: - DataHackIL/SynthBanshee#102 — `preprocessing_applied.normalized_dbfs` now records the *measured* post-preprocess peak (was hardcoded `-1.0`). Pair with `generation_metadata.loudness_target_peak_dbfs` to diagnose loudness drift; the schema docstring at `labels/schema.py:175` pins the measured-vs-target split. - DataHackIL/SynthBanshee#103 — `docs/spec.md` pins the `has_violence` derivation rule (`any(e.tier1_category != "NONE")`), adds the §2.5 identifier-casing table, rewrites §5.1 field notes. - DataHackIL/SynthBanshee#105 — adds `sp_sv_a_0003` + `sp_it_a_0003` Google-pair shadow scenes. - DataHackIL/SynthBanshee#106 — root cause for #72: `_HINT_DEFAULTS` was emitting nested `<prosody volume="+NdB">` inside outer `<prosody volume="+N%">`, which Azure rejects with SSML parse error 0x80045003. Required to unblock 6 of 8 elephant Tier B scenes; without the fix, every scene whose LLM script carries a `stress` phrase hint at intensity ≥ 3 failed reliably. ## Doc updates in this PR - `README.md`: tightened "Clip ID and filename conventions" to point at SynthBanshee `docs/spec.md` §2.5; rewrote the `has_violence` paragraph to the events-based rule; updated the audio-format section to the measured-vs-target split; replaced the v1-limitations block with a pointer to per-delivery notes. - `CLAUDE.md`: replaced the wrong `has_violence` formula with the events-based rule; expanded the audio-format table to match the spec's measured-vs-target distinction. - `DELIVERIES.md`: delivery 002 marked `superseded`; new row for 003. - `deliveries/003-multi-project-multi-voice/`: - `metadata.yaml` — structured delivery record. - `notes.md` — full per-clip table, voice/backend matrix, closed-vs-open qa-report findings. - `qa-report.json` — raw qa-report output (committed for audit). ## QA snapshot Closed since delivery 002: | Finding | 002 | 003 | |---|---|---| | `agg_no_escalation` | 3 clips | 0 | | `warn_no_overlap` | 4 clips | 0 (overlap_ratio 100% on I4+) | | `warn_emotion_downgrade` | 4 clips | 0 | | `generation_metadata` absent | 0 of 8 had it | 20 of 20 have it | | `dirty_file_path` null | 7 of 8 | 0 of 20 | | `normalized_dbfs` hardcoded `-1.0` | 8 of 8 | fixed (#102) | Still open: `low_voice_diversity_*` (now 2 voices per gender, threshold is ≥3 — partial progress 1 → 2); `single_backend` (misleading; see notes for explanation of the hardcoded `tts_engine` labeling bug); `vic_f0_high` on the 2 Google Chirp HD female-voice clips. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
for more information, see https://pre-commit.ci
This comment has been minimized.
This comment has been minimized.
|
pr-agent-context report: No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #4 in repository https://github.com/DataHackIL/avdp-synth-corpus. Treat this PR as all clear unless new signals appear.Run metadata: |
5 tasks
shaypal5
added a commit
that referenced
this pull request
May 12, 2026
…positive resolved (#5) * data(delivery-003): regen post-#107/#108/#109 — single_backend false positive resolved Schema-shift regen of delivery-003 using SynthBanshee `1ea48f3` (tip of main after PRs #110, #111, #112 merged earlier today). All 20 clips regenerated via `synthbanshee generate-batch` with corpus paths anchored at `--data-root` so clip JSON / manifest paths are repo-relative. ## Why a regen rather than in-place post-processing Two motivations made the regen worthwhile here: 1. **Audio cost was effectively $0.** Azure clips (18 of 20) hit the SHA-256 SSML cache and re-rendered byte-identical WAVs. The two Google Chirp HD clips re-rendered, costing fractions of a cent. Total wall time: ~16s for both batches. 2. **Highest fidelity guarantee.** The regen produces canonical artifacts straight from the pipeline; no manual JSON editing or path-string sed-ing. The generator is now the source of truth for what delivery-003 should look like under the post-#109 schema. ## Schema changes in this delivery vs original 2026-05-12 commit | Field | Before | After | |---|---|---| | `tts_engine` (clip JSON) | always `"azure_he_IL"` (wrong for the 2 Google clips) | **field absent** — Pydantic drops it; backend is `generation_metadata.tts_backend` per speaker (PR #112) | | `transcript_path` (clip JSON) | already relative (corpus PR #4 had normalized) | unchanged contract; now enforced by the generator itself (PR #111) | | `dirty_file_path` (clip JSON) | absolute pytest tmp_path on `sp_neu_a_0001_00` (#107 fingerprint); empty/absent on others | repo-relative POSIX `assets/speech/dirty/...` everywhere (PRs #110, #111) | | `manifest.csv` `wav_path` / `strong_labels_path` | corpus PR #4 had post-processed to relative | now produced relative by the generator (PR #111) | | `qa-report.json` `run_summary.clips_by_tts_engine` | `{"azure_he_IL": 20}` | **renamed** `clips_by_tts_backend` with `{"azure": 18, "google": 2}` (PR #112) | | `qa-report.json` `run_summary.run_warnings` | included `single_backend` (false positive — corpus actually has 2 backends) | `single_backend` **resolved**; only `low_voice_diversity_*` remain (legitimate; threshold is ≥3, corpus has 2) | ## Audio integrity - 19 of 20 clips: WAV bytes byte-identical with the original 2026-05-12 delivery (Azure SSML cache hit). Only metadata JSON changed. - `sp_sv_a_0003_00.wav` (Google Chirp HD): bit-level difference from the original render. Validation still PASSED (16 kHz, mono, peak ≤ −1.0 dBFS, duration ≥ 3 s). Duration shifted by 2.2s (corpus-wide total: 2500.79 → 2498.63 s). ## What changed in this commit - `data/he/**/*.json` (20 files): schema migration. No `tts_engine`; relative paths. - `data/he/agg_m_30-45_002/sp_sv_a_0003_00.{wav,txt,jsonl}`: Google re-render. - `data/he/manifest.csv`: relative paths, 20 rows, all `split: train` (4 speakers across 20 clips, no speaker-disjoint partition possible). - `assets/speech/*.wav` (6 new SSML caches): from the Google clip re-render. - `deliveries/003-multi-project-multi-voice/qa-report.json`: regenerated; new field shape; `single_backend` false positive gone. - `deliveries/003-multi-project-multi-voice/metadata.yaml`: pinned new SynthBanshee commit (`1ea48f3`); added PRs #110/#111/#112 to `related_prs`; new `qa_findings_closed_post_regen_2026_05_12` section; new `regen_2026_05_12` block documenting reason/cost/changes. - `deliveries/003-multi-project-multi-voice/notes.md`: pipeline-version section split into "initial delivery" + "schema-shift regen" subsections; "still open" QA findings list pared down. - `DELIVERIES.md`: pipeline-milestone column updated to include #110/#111/#112. ## Test plan - [x] `synthbanshee qa-report data/he --output deliveries/.../qa-report.json --run-summary` — failure rate 0.0%, 20 clips, no `single_backend` warning - [x] `synthbanshee validate` spot-checked on `sp_sv_a_0003_00.wav` (Google re-render) and `el_sv_b_0001_00.wav` (Azure cache hit) — both VALID - [x] `jq` spot-checks on 3 sampled JSONs confirm `has_tts_engine: false`, repo-relative `transcript_path` and `dirty_file_path`, populated `generation_metadata.tts_backend` - [x] Manifest CSV `wav_path` column verified repo-relative for all 20 rows ## Tier-3 ASR sanity (local) Not applicable. The three SynthBanshee PRs that prompted this regen touch `tests/`, `synthbanshee/cli.py` path-shape, `synthbanshee/package/manifest.py`, `synthbanshee/package/qa.py`, `synthbanshee/labels/`, and docs — none touch `synthbanshee/tts/`, `synthbanshee/script/`, `synthbanshee/augment/`, nor any speaker / scene / acoustic / project YAML config. The audio pipeline itself is unchanged, and the Azure cache hits prove it (19/20 bit-identical WAVs). Per SynthBanshee CLAUDE.md "ASR sanity check policy", the regen is exempt from the local `qa-report --asr` run. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces delivery 002. First handoff target for the She-Proves and Elephant consumer teams: a 20-clip toy corpus designed to bootstrap consumer-side schema parsers, manifest loaders, and taxonomy validators — not for model training itself.
agg_m_30-45_001/agg_m_30-45_002/ben_m_40-55_003/Total: 20 clips, ~41.7 min. 4 unique voice families across 2 TTS backends. All clips pass
synthbanshee validate(0 failures) andsynthbanshee qa-report(0% failure rate).Full delivery record:
deliveries/003-multi-project-multi-voice/— includesmetadata.yaml,notes.mdwith the per-clip table, and the rawqa-report.json.Synthbanshee changes shipped with this delivery
This delivery is the first to carry four synthbanshee corrections landed in the past day:
preprocessing_applied.normalized_dbfsnow records the measured post-preprocess peak (was hardcoded-1.0). Pair withgeneration_metadata.loudness_target_peak_dbfsto diagnose loudness drift.docs/spec.mdpins thehas_violencederivation rule, adds the §2.5 identifier-casing table, and rewrites §5.1 field notes.sp_sv_a_0003+sp_it_a_0003Google-pair sister scenes (this delivery's voice-diversity vehicle).[#72](https://github.com/DataHackIL/SynthBanshee/issues/72)("SSML parsing error 0x80045003, unable to reproduce"):_HINT_DEFAULTS["stress"]was emitting nested<prosody volume="+NdB">inside outer<prosody volume="+N%">, which Azure rejects. Required to unblock this delivery — without it, 6 of 8 elephant Tier B scenes (every one whose LLM script carries astressphrase hint at intensity ≥3) failed reliably with Azure SSML parse error.What changed in this PR (file-by-file summary)
README.mdhas_violenceparagraph with the events-based rule; updated audio-format section for the post-#78 measured-vs-target split; replaced v1-limitations block with a pointer to per-delivery notesCLAUDE.mdhas_violenceformula (typology in {SV,IT,NEG} and max_intensity ≥ 3) with the events-based rule (any(e.tier1_category != "NONE")); audio-format table now distinguishes target (-2.0 dBFS), limiter (-1.0 dBFS), and where each is recorded in metadataDELIVERIES.mdsuperseded; new row for 003deliveries/003-multi-project-multi-voice/metadata.yamldeliveries/003-multi-project-multi-voice/notes.mddeliveries/003-multi-project-multi-voice/qa-report.jsonsynthbanshee qa-report --run-summaryoutput, committed for auditdata/he/agg_m_30-45_001/generation_metadata,voice_family, measurednormalized_dbfs, dirty filesdata/he/agg_m_30-45_002/data/he/ben_m_40-55_003/data/he/manifest.csvvoice_familiescolumnassets/speech/,assets/scripts/QA snapshot — closed since delivery 002
agg_no_escalationwarn_no_overlapoverlap_ratio: 100%on I4+ (post-M8a)warn_emotion_downgradeemotion_downgrade_ratio: 0%generation_metadataabsentdirty_file_pathnullnormalized_dbfshardcoded-1.0Still open (documented honestly)
low_voice_diversity_male/low_voice_diversity_female— corpus now has 2 voice families per gender (up from 1); the run-level threshold is ≥3, so the warnings continue to fire. Partial progress, not a clear.single_backend— misleading: the corpus actually uses Azure + Google. The qa-report countsclip.tts_enginewhich is currently hardcoded to"azure_he_IL"incli.py:_run_generate_pipeline. This is a follow-up synthbanshee labeling bug, not a real diversity finding. The real backend distribution is correctly recorded ingeneration_metadata.tts_backendper clip and inspeakers[].voice_family.vic_f0_high— 2 clips:sp_it_a_0003_00andsp_sv_a_0003_00. Google Chirp HD female (Achernar) baselines F0 higher than the Azure Hila reference the M10a thresholds were calibrated against.Test plan
synthbanshee validateon each of the 20 clips — all VALIDsynthbanshee qa-report --run-summary— failure rate 0.0%, 0 quality-flagged via the failure path (15 carry warning flags, all expected perquality_flagged_clipsnotes above)ClipMetadata(Pydantic-validated).wav/.txt/.json/.jsonlquartets present (80 of 80 files)dirty_file_pathandtranscript_path(matches the delivery-002 convention)wav_pathandstrong_labels_pathare repo-relativeTier-3 ASR sanity (local)
The synthbanshee changes (#106 in particular) altered SSML output for any scene that emits a
stressphrase hint. Will runsynthbanshee qa-report --asrand append the result here as a comment before merge. The Tier-3 Whisper sanity check is local-only per theCLAUDE.mdpolicy.🤖 Generated with Claude Code