data(delivery-003): regen post-#107/#108/#109 — single_backend false positive resolved by shaypal5 · Pull Request #5 · DataHackIL/avdp-synth-corpus

shaypal5 · 2026-05-12T19:52:04Z

Summary

Schema-shift regen of delivery-003 using SynthBanshee 1ea48f3 — tip of main after PRs #110, #111, #112 merged earlier today. All 20 clips regenerated; qa-report.json no longer fires the misleading single_backend warning that was filed as the open QA finding in the original delivery.

Why a regen (option 1) rather than in-place post-processing (option 2)

Two motivations made the regen worthwhile:

Audio cost was effectively $0. Azure clips (18 of 20) hit the SHA-256 SSML cache and re-rendered byte-identical WAVs. The two Google Chirp HD clips re-rendered, costing fractions of a cent. Total wall time: ~16s for both batches.
Highest fidelity guarantee. The regen produces canonical artifacts straight from the pipeline; no manual JSON editing or path-string sed-ing. The generator is now the source of truth for what delivery-003 should look like under the post-#109 schema.

Schema changes vs the original 2026-05-12 commit

Field	Before	After
`tts_engine` (clip JSON)	always `"azure_he_IL"` (wrong for the 2 Google clips)	field absent — Pydantic drops it; backend is `generation_metadata.tts_backend` per speaker (#112)
`dirty_file_path` (clip JSON)	absolute pytest tmp_path on `sp_neu_a_0001_00` (#107 fingerprint); empty/absent on others	repo-relative POSIX `assets/speech/dirty/...` everywhere (#110, #111)
`transcript_path` (clip JSON)	corpus PR #4 had post-processed to relative	now produced relative by the generator itself (#111)
`manifest.csv` `wav_path` / `strong_labels_path`	corpus PR #4 had post-processed	now produced relative (#111)
`qa-report.json` `run_summary.clips_by_tts_engine`	`{"azure_he_IL": 20}`	renamed `clips_by_tts_backend` with `{"azure": 18, "google": 2}` (#112)
`qa-report.json` `run_summary.run_warnings`	included `single_backend` (false positive)	`single_backend` resolved; only legitimate `low_voice_diversity_*` warnings remain

Audio integrity

19 of 20 clips: WAV bytes byte-identical with the original delivery-003 (Azure SSML cache hit). Only metadata JSON changed.
sp_sv_a_0003_00.wav (Google Chirp HD): bit-level difference from the original render — Google doesn't share the content-hash cache the same way Azure does in this codebase. Validation still PASSED (16 kHz, mono, peak ≤ −1.0 dBFS, duration ≥ 3 s). Total corpus duration shifted by 2.2s (2500.79 → 2498.63 s).

Files changed per area

Area	Count	Notes
Per-clip JSON (`data/he/*/.json`)	20	Schema migration
Google clip audio + transcript + JSONL (`sp_sv_a_0003_00.{wav,txt,jsonl}`)	3	Re-render
`data/he/manifest.csv`	1	Relative paths; 20 rows; all `split: train`
`assets/speech/*.wav` (new SSML caches)	6	From the Google re-render
`deliveries/003-multi-project-multi-voice/qa-report.json`	1	New schema; `single_backend` gone
`deliveries/003-multi-project-multi-voice/metadata.yaml`	1	Pinned new commit; added `regen_2026_05_12` + `qa_findings_closed_post_regen_2026_05_12` sections
`deliveries/003-multi-project-multi-voice/notes.md`	1	Pipeline section split into "initial" + "schema-shift regen"; closed-findings list refreshed
`DELIVERIES.md`	1	Added #110/#111/#112 to pipeline-milestone column

Test plan

synthbanshee qa-report data/he --output deliveries/003-multi-project-multi-voice/qa-report.json --run-summary — failure rate 0.0%, 20/20 clips passed, no single_backend warning
synthbanshee validate spot-checked on sp_sv_a_0003_00.wav (Google re-render) and el_sv_b_0001_00.wav (Azure cache hit) — both VALID
jq spot-checks on 3 sampled JSONs confirm has_tts_engine: false, repo-relative transcript_path and dirty_file_path, populated generation_metadata.tts_backend
Manifest CSV wav_path column verified repo-relative for all 20 rows
Audio total duration delta vs original (2498.63 vs 2500.79) — accounted for by sp_sv_a_0003_00 Google re-render

Tier-3 ASR sanity (local)

Not applicable. The three SynthBanshee PRs that prompted this regen touch tests/, synthbanshee/cli.py path-shape, synthbanshee/package/manifest.py, synthbanshee/package/qa.py, synthbanshee/labels/, and docs — none touch synthbanshee/tts/, synthbanshee/script/, synthbanshee/augment/, nor any speaker / scene / acoustic / project YAML config. The audio pipeline itself is unchanged, and the Azure cache hits prove it (19/20 bit-identical WAVs). Per SynthBanshee CLAUDE.md "ASR sanity check policy", the regen is exempt from the local `qa-report --asr` run.

🤖 Generated with Claude Code

…positive resolved Schema-shift regen of delivery-003 using SynthBanshee `1ea48f3` (tip of main after PRs #110, #111, #112 merged earlier today). All 20 clips regenerated via `synthbanshee generate-batch` with corpus paths anchored at `--data-root` so clip JSON / manifest paths are repo-relative. ## Why a regen rather than in-place post-processing Two motivations made the regen worthwhile here: 1. **Audio cost was effectively $0.** Azure clips (18 of 20) hit the SHA-256 SSML cache and re-rendered byte-identical WAVs. The two Google Chirp HD clips re-rendered, costing fractions of a cent. Total wall time: ~16s for both batches. 2. **Highest fidelity guarantee.** The regen produces canonical artifacts straight from the pipeline; no manual JSON editing or path-string sed-ing. The generator is now the source of truth for what delivery-003 should look like under the post-#109 schema. ## Schema changes in this delivery vs original 2026-05-12 commit | Field | Before | After | |---|---|---| | `tts_engine` (clip JSON) | always `"azure_he_IL"` (wrong for the 2 Google clips) | **field absent** — Pydantic drops it; backend is `generation_metadata.tts_backend` per speaker (PR #112) | | `transcript_path` (clip JSON) | already relative (corpus PR #4 had normalized) | unchanged contract; now enforced by the generator itself (PR #111) | | `dirty_file_path` (clip JSON) | absolute pytest tmp_path on `sp_neu_a_0001_00` (#107 fingerprint); empty/absent on others | repo-relative POSIX `assets/speech/dirty/...` everywhere (PRs #110, #111) | | `manifest.csv` `wav_path` / `strong_labels_path` | corpus PR #4 had post-processed to relative | now produced relative by the generator (PR #111) | | `qa-report.json` `run_summary.clips_by_tts_engine` | `{"azure_he_IL": 20}` | **renamed** `clips_by_tts_backend` with `{"azure": 18, "google": 2}` (PR #112) | | `qa-report.json` `run_summary.run_warnings` | included `single_backend` (false positive — corpus actually has 2 backends) | `single_backend` **resolved**; only `low_voice_diversity_*` remain (legitimate; threshold is ≥3, corpus has 2) | ## Audio integrity - 19 of 20 clips: WAV bytes byte-identical with the original 2026-05-12 delivery (Azure SSML cache hit). Only metadata JSON changed. - `sp_sv_a_0003_00.wav` (Google Chirp HD): bit-level difference from the original render. Validation still PASSED (16 kHz, mono, peak ≤ −1.0 dBFS, duration ≥ 3 s). Duration shifted by 2.2s (corpus-wide total: 2500.79 → 2498.63 s). ## What changed in this commit - `data/he/**/*.json` (20 files): schema migration. No `tts_engine`; relative paths. - `data/he/agg_m_30-45_002/sp_sv_a_0003_00.{wav,txt,jsonl}`: Google re-render. - `data/he/manifest.csv`: relative paths, 20 rows, all `split: train` (4 speakers across 20 clips, no speaker-disjoint partition possible). - `assets/speech/*.wav` (6 new SSML caches): from the Google clip re-render. - `deliveries/003-multi-project-multi-voice/qa-report.json`: regenerated; new field shape; `single_backend` false positive gone. - `deliveries/003-multi-project-multi-voice/metadata.yaml`: pinned new SynthBanshee commit (`1ea48f3`); added PRs #110/#111/#112 to `related_prs`; new `qa_findings_closed_post_regen_2026_05_12` section; new `regen_2026_05_12` block documenting reason/cost/changes. - `deliveries/003-multi-project-multi-voice/notes.md`: pipeline-version section split into "initial delivery" + "schema-shift regen" subsections; "still open" QA findings list pared down. - `DELIVERIES.md`: pipeline-milestone column updated to include #110/#111/#112. ## Test plan - [x] `synthbanshee qa-report data/he --output deliveries/.../qa-report.json --run-summary` — failure rate 0.0%, 20 clips, no `single_backend` warning - [x] `synthbanshee validate` spot-checked on `sp_sv_a_0003_00.wav` (Google re-render) and `el_sv_b_0001_00.wav` (Azure cache hit) — both VALID - [x] `jq` spot-checks on 3 sampled JSONs confirm `has_tts_engine: false`, repo-relative `transcript_path` and `dirty_file_path`, populated `generation_metadata.tts_backend` - [x] Manifest CSV `wav_path` column verified repo-relative for all 20 rows ## Tier-3 ASR sanity (local) Not applicable. The three SynthBanshee PRs that prompted this regen touch `tests/`, `synthbanshee/cli.py` path-shape, `synthbanshee/package/manifest.py`, `synthbanshee/package/qa.py`, `synthbanshee/labels/`, and docs — none touch `synthbanshee/tts/`, `synthbanshee/script/`, `synthbanshee/augment/`, nor any speaker / scene / acoustic / project YAML config. The audio pipeline itself is unchanged, and the Azure cache hits prove it (19/20 bit-identical WAVs). Per SynthBanshee CLAUDE.md "ASR sanity check policy", the regen is exempt from the local `qa-report --asr` run. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

for more information, see https://pre-commit.ci

github-actions · 2026-05-12T19:55:05Z

pr-agent-context report:

No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #5 in repository https://github.com/DataHackIL/avdp-synth-corpus. Treat this PR as all clear unless new signals appear.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25758545236 attempt 1
Comment timestamp: 2026-05-12T19:54:16.196657+00:00
PR head commit: 3e6335ea75109a41594fb9fcf590f109b27d10c6

shaypal5 added the data label May 12, 2026

This comment has been minimized.

Sign in to view

[pre-commit.ci] auto fixes from pre-commit.com hooks

3e6335e

for more information, see https://pre-commit.ci

shaypal5 merged commit 8f589e0 into main May 12, 2026
3 checks passed

shaypal5 deleted the regen/delivery-003-post-107-108-109 branch May 12, 2026 20:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data(delivery-003): regen post-#107/#108/#109 — single_backend false positive resolved#5

data(delivery-003): regen post-#107/#108/#109 — single_backend false positive resolved#5
shaypal5 merged 2 commits into
mainfrom
regen/delivery-003-post-107-108-109

shaypal5 commented May 12, 2026

Uh oh!

This comment has been minimized.

github-actions Bot commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shaypal5 commented May 12, 2026

Summary

Why a regen (option 1) rather than in-place post-processing (option 2)

Schema changes vs the original 2026-05-12 commit

Audio integrity

Files changed per area

Test plan

Tier-3 ASR sanity (local)

Uh oh!

This comment has been minimized.

github-actions Bot commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant