❓ Questions
I'm interested in using the encoder to encode an audio fragment of a few seconds into just one codebook vector. However, the model returns a sequence of several audio_codes (of course, it is the only way to succesfully decode the audio afterwards).
How would you recommend using the encoder, and/or pre-postprocessing the audio input or audio_codes to obtain just one audio code "at utterance level"?
Thanks in advance.
❓ Questions
I'm interested in using the encoder to encode an audio fragment of a few seconds into just one codebook vector. However, the model returns a sequence of several
audio_codes(of course, it is the only way to succesfully decode the audio afterwards).How would you recommend using the encoder, and/or pre-postprocessing the audio input or
audio_codesto obtain just one audio code "at utterance level"?Thanks in advance.