data(delivery-003): regen post-#107/#108/#109 — single_backend false positive resolved#5
Merged
Merged
Conversation
…positive resolved Schema-shift regen of delivery-003 using SynthBanshee `1ea48f3` (tip of main after PRs #110, #111, #112 merged earlier today). All 20 clips regenerated via `synthbanshee generate-batch` with corpus paths anchored at `--data-root` so clip JSON / manifest paths are repo-relative. ## Why a regen rather than in-place post-processing Two motivations made the regen worthwhile here: 1. **Audio cost was effectively $0.** Azure clips (18 of 20) hit the SHA-256 SSML cache and re-rendered byte-identical WAVs. The two Google Chirp HD clips re-rendered, costing fractions of a cent. Total wall time: ~16s for both batches. 2. **Highest fidelity guarantee.** The regen produces canonical artifacts straight from the pipeline; no manual JSON editing or path-string sed-ing. The generator is now the source of truth for what delivery-003 should look like under the post-#109 schema. ## Schema changes in this delivery vs original 2026-05-12 commit | Field | Before | After | |---|---|---| | `tts_engine` (clip JSON) | always `"azure_he_IL"` (wrong for the 2 Google clips) | **field absent** — Pydantic drops it; backend is `generation_metadata.tts_backend` per speaker (PR #112) | | `transcript_path` (clip JSON) | already relative (corpus PR #4 had normalized) | unchanged contract; now enforced by the generator itself (PR #111) | | `dirty_file_path` (clip JSON) | absolute pytest tmp_path on `sp_neu_a_0001_00` (#107 fingerprint); empty/absent on others | repo-relative POSIX `assets/speech/dirty/...` everywhere (PRs #110, #111) | | `manifest.csv` `wav_path` / `strong_labels_path` | corpus PR #4 had post-processed to relative | now produced relative by the generator (PR #111) | | `qa-report.json` `run_summary.clips_by_tts_engine` | `{"azure_he_IL": 20}` | **renamed** `clips_by_tts_backend` with `{"azure": 18, "google": 2}` (PR #112) | | `qa-report.json` `run_summary.run_warnings` | included `single_backend` (false positive — corpus actually has 2 backends) | `single_backend` **resolved**; only `low_voice_diversity_*` remain (legitimate; threshold is ≥3, corpus has 2) | ## Audio integrity - 19 of 20 clips: WAV bytes byte-identical with the original 2026-05-12 delivery (Azure SSML cache hit). Only metadata JSON changed. - `sp_sv_a_0003_00.wav` (Google Chirp HD): bit-level difference from the original render. Validation still PASSED (16 kHz, mono, peak ≤ −1.0 dBFS, duration ≥ 3 s). Duration shifted by 2.2s (corpus-wide total: 2500.79 → 2498.63 s). ## What changed in this commit - `data/he/**/*.json` (20 files): schema migration. No `tts_engine`; relative paths. - `data/he/agg_m_30-45_002/sp_sv_a_0003_00.{wav,txt,jsonl}`: Google re-render. - `data/he/manifest.csv`: relative paths, 20 rows, all `split: train` (4 speakers across 20 clips, no speaker-disjoint partition possible). - `assets/speech/*.wav` (6 new SSML caches): from the Google clip re-render. - `deliveries/003-multi-project-multi-voice/qa-report.json`: regenerated; new field shape; `single_backend` false positive gone. - `deliveries/003-multi-project-multi-voice/metadata.yaml`: pinned new SynthBanshee commit (`1ea48f3`); added PRs #110/#111/#112 to `related_prs`; new `qa_findings_closed_post_regen_2026_05_12` section; new `regen_2026_05_12` block documenting reason/cost/changes. - `deliveries/003-multi-project-multi-voice/notes.md`: pipeline-version section split into "initial delivery" + "schema-shift regen" subsections; "still open" QA findings list pared down. - `DELIVERIES.md`: pipeline-milestone column updated to include #110/#111/#112. ## Test plan - [x] `synthbanshee qa-report data/he --output deliveries/.../qa-report.json --run-summary` — failure rate 0.0%, 20 clips, no `single_backend` warning - [x] `synthbanshee validate` spot-checked on `sp_sv_a_0003_00.wav` (Google re-render) and `el_sv_b_0001_00.wav` (Azure cache hit) — both VALID - [x] `jq` spot-checks on 3 sampled JSONs confirm `has_tts_engine: false`, repo-relative `transcript_path` and `dirty_file_path`, populated `generation_metadata.tts_backend` - [x] Manifest CSV `wav_path` column verified repo-relative for all 20 rows ## Tier-3 ASR sanity (local) Not applicable. The three SynthBanshee PRs that prompted this regen touch `tests/`, `synthbanshee/cli.py` path-shape, `synthbanshee/package/manifest.py`, `synthbanshee/package/qa.py`, `synthbanshee/labels/`, and docs — none touch `synthbanshee/tts/`, `synthbanshee/script/`, `synthbanshee/augment/`, nor any speaker / scene / acoustic / project YAML config. The audio pipeline itself is unchanged, and the Azure cache hits prove it (19/20 bit-identical WAVs). Per SynthBanshee CLAUDE.md "ASR sanity check policy", the regen is exempt from the local `qa-report --asr` run. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
for more information, see https://pre-commit.ci
|
pr-agent-context report: No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #5 in repository https://github.com/DataHackIL/avdp-synth-corpus. Treat this PR as all clear unless new signals appear.Run metadata: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Schema-shift regen of delivery-003 using SynthBanshee
1ea48f3— tip ofmainafter PRs #110, #111, #112 merged earlier today. All 20 clips regenerated;qa-report.jsonno longer fires the misleadingsingle_backendwarning that was filed as the open QA finding in the original delivery.Why a regen (option 1) rather than in-place post-processing (option 2)
Two motivations made the regen worthwhile:
sed-ing. The generator is now the source of truth for what delivery-003 should look like under the post-#109 schema.Schema changes vs the original 2026-05-12 commit
tts_engine(clip JSON)"azure_he_IL"(wrong for the 2 Google clips)generation_metadata.tts_backendper speaker (#112)dirty_file_path(clip JSON)sp_neu_a_0001_00(#107 fingerprint); empty/absent on othersassets/speech/dirty/...everywhere (#110, #111)transcript_path(clip JSON)manifest.csvwav_path/strong_labels_pathqa-report.jsonrun_summary.clips_by_tts_engine{"azure_he_IL": 20}clips_by_tts_backendwith{"azure": 18, "google": 2}(#112)qa-report.jsonrun_summary.run_warningssingle_backend(false positive)single_backendresolved; only legitimatelow_voice_diversity_*warnings remainAudio integrity
sp_sv_a_0003_00.wav(Google Chirp HD): bit-level difference from the original render — Google doesn't share the content-hash cache the same way Azure does in this codebase. Validation still PASSED (16 kHz, mono, peak ≤ −1.0 dBFS, duration ≥ 3 s). Total corpus duration shifted by 2.2s (2500.79 → 2498.63 s).Files changed per area
data/he/**/*.json)sp_sv_a_0003_00.{wav,txt,jsonl})data/he/manifest.csvsplit: trainassets/speech/*.wav(new SSML caches)deliveries/003-multi-project-multi-voice/qa-report.jsonsingle_backendgonedeliveries/003-multi-project-multi-voice/metadata.yamlregen_2026_05_12+qa_findings_closed_post_regen_2026_05_12sectionsdeliveries/003-multi-project-multi-voice/notes.mdDELIVERIES.mdTest plan
synthbanshee qa-report data/he --output deliveries/003-multi-project-multi-voice/qa-report.json --run-summary— failure rate 0.0%, 20/20 clips passed, nosingle_backendwarningsynthbanshee validatespot-checked onsp_sv_a_0003_00.wav(Google re-render) andel_sv_b_0001_00.wav(Azure cache hit) — both VALIDjqspot-checks on 3 sampled JSONs confirmhas_tts_engine: false, repo-relativetranscript_pathanddirty_file_path, populatedgeneration_metadata.tts_backendwav_pathcolumn verified repo-relative for all 20 rowssp_sv_a_0003_00Google re-renderTier-3 ASR sanity (local)
Not applicable. The three SynthBanshee PRs that prompted this regen touch
tests/,synthbanshee/cli.pypath-shape,synthbanshee/package/manifest.py,synthbanshee/package/qa.py,synthbanshee/labels/, and docs — none touchsynthbanshee/tts/,synthbanshee/script/,synthbanshee/augment/, nor any speaker / scene / acoustic / project YAML config. The audio pipeline itself is unchanged, and the Azure cache hits prove it (19/20 bit-identical WAVs). Per SynthBanshee CLAUDE.md "ASR sanity check policy", the regen is exempt from the local `qa-report --asr` run.🤖 Generated with Claude Code