Skip to content

data(delivery-003): regen post-#107/#108/#109 — single_backend false positive resolved#5

Merged
shaypal5 merged 2 commits into
mainfrom
regen/delivery-003-post-107-108-109
May 12, 2026
Merged

data(delivery-003): regen post-#107/#108/#109 — single_backend false positive resolved#5
shaypal5 merged 2 commits into
mainfrom
regen/delivery-003-post-107-108-109

Conversation

@shaypal5

Copy link
Copy Markdown
Member

Summary

Schema-shift regen of delivery-003 using SynthBanshee 1ea48f3 — tip of main after PRs #110, #111, #112 merged earlier today. All 20 clips regenerated; qa-report.json no longer fires the misleading single_backend warning that was filed as the open QA finding in the original delivery.

Why a regen (option 1) rather than in-place post-processing (option 2)

Two motivations made the regen worthwhile:

  1. Audio cost was effectively $0. Azure clips (18 of 20) hit the SHA-256 SSML cache and re-rendered byte-identical WAVs. The two Google Chirp HD clips re-rendered, costing fractions of a cent. Total wall time: ~16s for both batches.
  2. Highest fidelity guarantee. The regen produces canonical artifacts straight from the pipeline; no manual JSON editing or path-string sed-ing. The generator is now the source of truth for what delivery-003 should look like under the post-#109 schema.

Schema changes vs the original 2026-05-12 commit

Field Before After
tts_engine (clip JSON) always "azure_he_IL" (wrong for the 2 Google clips) field absent — Pydantic drops it; backend is generation_metadata.tts_backend per speaker (#112)
dirty_file_path (clip JSON) absolute pytest tmp_path on sp_neu_a_0001_00 (#107 fingerprint); empty/absent on others repo-relative POSIX assets/speech/dirty/... everywhere (#110, #111)
transcript_path (clip JSON) corpus PR #4 had post-processed to relative now produced relative by the generator itself (#111)
manifest.csv wav_path / strong_labels_path corpus PR #4 had post-processed now produced relative (#111)
qa-report.json run_summary.clips_by_tts_engine {"azure_he_IL": 20} renamed clips_by_tts_backend with {"azure": 18, "google": 2} (#112)
qa-report.json run_summary.run_warnings included single_backend (false positive) single_backend resolved; only legitimate low_voice_diversity_* warnings remain

Audio integrity

  • 19 of 20 clips: WAV bytes byte-identical with the original delivery-003 (Azure SSML cache hit). Only metadata JSON changed.
  • sp_sv_a_0003_00.wav (Google Chirp HD): bit-level difference from the original render — Google doesn't share the content-hash cache the same way Azure does in this codebase. Validation still PASSED (16 kHz, mono, peak ≤ −1.0 dBFS, duration ≥ 3 s). Total corpus duration shifted by 2.2s (2500.79 → 2498.63 s).

Files changed per area

Area Count Notes
Per-clip JSON (data/he/**/*.json) 20 Schema migration
Google clip audio + transcript + JSONL (sp_sv_a_0003_00.{wav,txt,jsonl}) 3 Re-render
data/he/manifest.csv 1 Relative paths; 20 rows; all split: train
assets/speech/*.wav (new SSML caches) 6 From the Google re-render
deliveries/003-multi-project-multi-voice/qa-report.json 1 New schema; single_backend gone
deliveries/003-multi-project-multi-voice/metadata.yaml 1 Pinned new commit; added regen_2026_05_12 + qa_findings_closed_post_regen_2026_05_12 sections
deliveries/003-multi-project-multi-voice/notes.md 1 Pipeline section split into "initial" + "schema-shift regen"; closed-findings list refreshed
DELIVERIES.md 1 Added #110/#111/#112 to pipeline-milestone column

Test plan

  • synthbanshee qa-report data/he --output deliveries/003-multi-project-multi-voice/qa-report.json --run-summary — failure rate 0.0%, 20/20 clips passed, no single_backend warning
  • synthbanshee validate spot-checked on sp_sv_a_0003_00.wav (Google re-render) and el_sv_b_0001_00.wav (Azure cache hit) — both VALID
  • jq spot-checks on 3 sampled JSONs confirm has_tts_engine: false, repo-relative transcript_path and dirty_file_path, populated generation_metadata.tts_backend
  • Manifest CSV wav_path column verified repo-relative for all 20 rows
  • Audio total duration delta vs original (2498.63 vs 2500.79) — accounted for by sp_sv_a_0003_00 Google re-render

Tier-3 ASR sanity (local)

Not applicable. The three SynthBanshee PRs that prompted this regen touch tests/, synthbanshee/cli.py path-shape, synthbanshee/package/manifest.py, synthbanshee/package/qa.py, synthbanshee/labels/, and docs — none touch synthbanshee/tts/, synthbanshee/script/, synthbanshee/augment/, nor any speaker / scene / acoustic / project YAML config. The audio pipeline itself is unchanged, and the Azure cache hits prove it (19/20 bit-identical WAVs). Per SynthBanshee CLAUDE.md "ASR sanity check policy", the regen is exempt from the local `qa-report --asr` run.

🤖 Generated with Claude Code

…positive resolved

Schema-shift regen of delivery-003 using SynthBanshee `1ea48f3` (tip of
main after PRs #110, #111, #112 merged earlier today). All 20 clips
regenerated via `synthbanshee generate-batch` with corpus paths anchored
at `--data-root` so clip JSON / manifest paths are repo-relative.

## Why a regen rather than in-place post-processing

Two motivations made the regen worthwhile here:

1. **Audio cost was effectively $0.** Azure clips (18 of 20) hit the
   SHA-256 SSML cache and re-rendered byte-identical WAVs. The two
   Google Chirp HD clips re-rendered, costing fractions of a cent.
   Total wall time: ~16s for both batches.

2. **Highest fidelity guarantee.** The regen produces canonical artifacts
   straight from the pipeline; no manual JSON editing or path-string
   sed-ing. The generator is now the source of truth for what
   delivery-003 should look like under the post-#109 schema.

## Schema changes in this delivery vs original 2026-05-12 commit

| Field | Before | After |
|---|---|---|
| `tts_engine` (clip JSON) | always `"azure_he_IL"` (wrong for the 2 Google clips) | **field absent** — Pydantic drops it; backend is `generation_metadata.tts_backend` per speaker (PR #112) |
| `transcript_path` (clip JSON) | already relative (corpus PR #4 had normalized) | unchanged contract; now enforced by the generator itself (PR #111) |
| `dirty_file_path` (clip JSON) | absolute pytest tmp_path on `sp_neu_a_0001_00` (#107 fingerprint); empty/absent on others | repo-relative POSIX `assets/speech/dirty/...` everywhere (PRs #110, #111) |
| `manifest.csv` `wav_path` / `strong_labels_path` | corpus PR #4 had post-processed to relative | now produced relative by the generator (PR #111) |
| `qa-report.json` `run_summary.clips_by_tts_engine` | `{"azure_he_IL": 20}` | **renamed** `clips_by_tts_backend` with `{"azure": 18, "google": 2}` (PR #112) |
| `qa-report.json` `run_summary.run_warnings` | included `single_backend` (false positive — corpus actually has 2 backends) | `single_backend` **resolved**; only `low_voice_diversity_*` remain (legitimate; threshold is ≥3, corpus has 2) |

## Audio integrity

- 19 of 20 clips: WAV bytes byte-identical with the original 2026-05-12 delivery (Azure SSML cache hit). Only metadata JSON changed.
- `sp_sv_a_0003_00.wav` (Google Chirp HD): bit-level difference from the original render. Validation still PASSED (16 kHz, mono, peak ≤ −1.0 dBFS, duration ≥ 3 s). Duration shifted by 2.2s (corpus-wide total: 2500.79 → 2498.63 s).

## What changed in this commit

- `data/he/**/*.json` (20 files): schema migration. No `tts_engine`; relative paths.
- `data/he/agg_m_30-45_002/sp_sv_a_0003_00.{wav,txt,jsonl}`: Google re-render.
- `data/he/manifest.csv`: relative paths, 20 rows, all `split: train` (4 speakers across 20 clips, no speaker-disjoint partition possible).
- `assets/speech/*.wav` (6 new SSML caches): from the Google clip re-render.
- `deliveries/003-multi-project-multi-voice/qa-report.json`: regenerated; new field shape; `single_backend` false positive gone.
- `deliveries/003-multi-project-multi-voice/metadata.yaml`: pinned new SynthBanshee commit (`1ea48f3`); added PRs #110/#111/#112 to `related_prs`; new `qa_findings_closed_post_regen_2026_05_12` section; new `regen_2026_05_12` block documenting reason/cost/changes.
- `deliveries/003-multi-project-multi-voice/notes.md`: pipeline-version section split into "initial delivery" + "schema-shift regen" subsections; "still open" QA findings list pared down.
- `DELIVERIES.md`: pipeline-milestone column updated to include #110/#111/#112.

## Test plan

- [x] `synthbanshee qa-report data/he --output deliveries/.../qa-report.json --run-summary` — failure rate 0.0%, 20 clips, no `single_backend` warning
- [x] `synthbanshee validate` spot-checked on `sp_sv_a_0003_00.wav` (Google re-render) and `el_sv_b_0001_00.wav` (Azure cache hit) — both VALID
- [x] `jq` spot-checks on 3 sampled JSONs confirm `has_tts_engine: false`, repo-relative `transcript_path` and `dirty_file_path`, populated `generation_metadata.tts_backend`
- [x] Manifest CSV `wav_path` column verified repo-relative for all 20 rows

## Tier-3 ASR sanity (local)

Not applicable. The three SynthBanshee PRs that prompted this regen
touch `tests/`, `synthbanshee/cli.py` path-shape, `synthbanshee/package/manifest.py`,
`synthbanshee/package/qa.py`, `synthbanshee/labels/`, and docs — none
touch `synthbanshee/tts/`, `synthbanshee/script/`, `synthbanshee/augment/`,
nor any speaker / scene / acoustic / project YAML config. The audio
pipeline itself is unchanged, and the Azure cache hits prove it (19/20
bit-identical WAVs). Per SynthBanshee CLAUDE.md "ASR sanity check
policy", the regen is exempt from the local `qa-report --asr` run.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@shaypal5 shaypal5 added the data label May 12, 2026
@github-actions

This comment has been minimized.

@github-actions

Copy link
Copy Markdown

pr-agent-context report:

No unresolved review comments, failing checks, or actionable patch coverage gaps were found on PR #5 in repository https://github.com/DataHackIL/avdp-synth-corpus. Treat this PR as all clear unless new signals appear.

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25758545236 attempt 1
Comment timestamp: 2026-05-12T19:54:16.196657+00:00
PR head commit: 3e6335ea75109a41594fb9fcf590f109b27d10c6

@shaypal5 shaypal5 merged commit 8f589e0 into main May 12, 2026
3 checks passed
@shaypal5 shaypal5 deleted the regen/delivery-003-post-107-108-109 branch May 12, 2026 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant