Skip to content

Use left-padding rather than right-padding for prefix audio#198

Open
coezbek wants to merge 2 commits into
Zyphra:mainfrom
coezbek:audio_prefix_fix
Open

Use left-padding rather than right-padding for prefix audio#198
coezbek wants to merge 2 commits into
Zyphra:mainfrom
coezbek:audio_prefix_fix

Conversation

@coezbek
Copy link
Copy Markdown
Contributor

@coezbek coezbek commented Mar 20, 2025

The existing code uses right-padding of the audio prefix data in case the audio prefix data isn't a multiple of 512 samples.

This might cause an unintended gap between prefix audio and generated audio.

This patch switches to left padding of the audio prefix.

In this patch I also added a 350ms silence prefix which is exactly 30 audio tokens long (30 * 512) to ensure that clipping does not happen and also because the model seems to use roughly 20-25 tokens look-ahead to warm-up.

@mrdrprofuroboros
Copy link
Copy Markdown

@coezbek

This might cause an unintended gap between prefix audio and generated audio.

the gap is ~11ms (512/44100) at max, right? Have you experienced a click there?

@coezbek
Copy link
Copy Markdown
Contributor Author

coezbek commented Apr 3, 2025

Correct max is 511. And no, I haven't noticed any clicks, because I am using just silence audio anyway as prefix (btw. 350ms silence improved generation quality for me a bit).

But conceptually it just seems wrong to put zeros in the gap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants