Use left-padding rather than right-padding for prefix audio#198
Open
coezbek wants to merge 2 commits into
Open
Use left-padding rather than right-padding for prefix audio#198coezbek wants to merge 2 commits into
coezbek wants to merge 2 commits into
Conversation
the gap is ~11ms (512/44100) at max, right? Have you experienced a click there? |
Contributor
Author
|
Correct max is 511. And no, I haven't noticed any clicks, because I am using just silence audio anyway as prefix (btw. 350ms silence improved generation quality for me a bit). But conceptually it just seems wrong to put zeros in the gap. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The existing code uses right-padding of the audio prefix data in case the audio prefix data isn't a multiple of 512 samples.
This might cause an unintended gap between prefix audio and generated audio.
This patch switches to left padding of the audio prefix.
In this patch I also added a 350ms silence prefix which is exactly 30 audio tokens long (30 * 512) to ensure that clipping does not happen and also because the model seems to use roughly 20-25 tokens look-ahead to warm-up.