There's a considerable mismatch w.r.t. dataset's characteristics between Constituicao and LJSpeech. Audios of the former are longer (20s-40s) while the latter's do not usually go beyond 10s, and I'm not sure whether this fact plays nice with FastSpeech 2's recipe. AAMOF ESPnet's TTS recipe ignores audios longer than 20s by default.

A possible way to go would be re-segment Constituicao to make individual utts shorter. MFA's has been finding SILs in the middle of sentences quite often - in fact the speaker pauses in between titles and end of sentences. A VAD and an FA would be of great help with that.
plot_scripts.zip
There's a considerable mismatch w.r.t. dataset's characteristics between Constituicao and LJSpeech. Audios of the former are longer (20s-40s) while the latter's do not usually go beyond 10s, and I'm not sure whether this fact plays nice with FastSpeech 2's recipe. AAMOF ESPnet's TTS recipe ignores audios longer than 20s by default.
A possible way to go would be re-segment Constituicao to make individual utts shorter. MFA's has been finding SILs in the middle of sentences quite often - in fact the speaker pauses in between titles and end of sentences. A VAD and an FA would be of great help with that.
plot_scripts.zip