Skip to content

Synthesize own text without style transfer gives poor audio results #120

@silkeplessers

Description

@silkeplessers

When trying to synthesize my own text using the pretrained mellotron and waveglow models, I get poor audio quality (very croaky voice).
I use the inference method to not perform style transfer, however, I am also not sure what to pass in as input_style and f0s.
The following code is just to synthesize on speaker id 0 of the pretrained model. Is it normal the audio quality is relatively poor? My end goal is to finetune this model on a speech dataset in another language with 2 speakers.

text = "This is an example sentence."
text_encoded = torch.LongTensor(text_to_sequence(text, hparams.text_cleaners, arpabet_dict))[None, :].cuda()

f0 = torch.zeros([1, 1, 32]).cuda()
speaker_id = torch.LongTensor([0]).cuda()

with torch.no_grad():
    mel_outputs, mel_outputs_postnet, gate_outputs, alignments = mellotron.inference(
        (text_encoded, 0, speaker_id, f0))

with torch.no_grad():
    audio = denoiser(waveglow.infer(mel_outputs_postnet, sigma=0.7), 0.01)[:, 0]
ipd.Audio(audio[0].data.cpu().numpy(), rate=hparams.sampling_rate)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions