Source code for the paper "Regotron: Regularizing the Tacotron2 architecture via monotonic alignment loss"
Regotron is a regularized Tacotron2 version. Specifically, it penalizes the weights in the attention mechanism in order to be monotonic. The essential modification is an additional loss function term which acts as a regularizer.
Our results in LJSpeech Dataset show that Regotron
- builds a monotonic alignment quicker (compared to Taco2)
- is more stable during training (no spily behavior)
- is more robust (less common TTS mistakes)
- improves MOS (compared to Taco2)
- minimal training overhead (+1 loss term) and same inference cost/time
This repo is built upon Nvidia's DeepLearningExamples Tacotron2 implementation. We use an english pretrained WaveGlow vocoder.
The following components are required:
The LJ Speech dataset is also required (or any other speech dataset in the LJSpeech filelist format).
To train Regotron use the following steps
-
Clone the repository.
git clone https://github.com/efthymisgeo/regotron
-
Download and preprocess the dataset. Use the
./scripts/prepare_dataset.shdownload script to automatically download and preprocess the training, validation and test datasets. To run this script, issue:bash scripts/prepare_dataset.sh
Data is downloaded to the
./LJSpeech-1.1directory (on the host). The./LJSpeech-1.1directory is mounted to the/workspace/tacotron2/LJSpeech-1.1location in the NGC container. -
Build the Regotron container.
bash scripts/docker/build.sh
-
Start an interactive session in the NGC container to run training/inference. After you build the container image, you can start an interactive CLI session with:
bash scripts/docker/interactive_mount_paper.sh
The
interactive.shscript requires that the location on the dataset is specified. For example,LJSpeech-1.1. -
To preprocess raw speech data and produce mels for Regotron training, use the
./scripts/prepare_mels.shscript:bash scripts/prepare_mels.sh
The preprocessed mel-spectrograms are stored in the
./LJSpeech-1.1/melsdirectory. -
Train Regotron
bash scripts/multi_regotron.sh
For training Tacotron2 with the setup in the paper
bash scripts/multi_taco2.sh
-
Inference (Generate Speech)
You will need to have already trained Regotron/Tacotron2 by this step, or download a pretrained version from the Nvidia hub or the link in this repo. For vocoder we use pretained WaveGlow. Store Regotron checkpoint under
pretrained_rego, Tacotron2 checkpoint underpretrained_tacotron2and WaveGlow undervocoderfolder.This script generates speech based on the Regotron model
bash generate_wav.sh \ en_phrases \ rego_output_folder \ pretrained_regotron/checkpoint_Tacotron2_1500.pt \ vocoder/nvidia_waveglowpyt_fp32_20190427.pt
tacotron2: has the source code for the Tacotron2/Regotron architecturetacorton2/loss_function.py: has the Regotron loss
--epochs- number of epochs (default: 1501)--learning-rate- learning rate (default: 1e-3)--batch-size- batch size (default FP16: 104)--amp- use mixed precision training--cpu- use CPU with TorchScript for inference--sampling-rate- sampling rate in Hz of input and output audio (22050)--filter-length- (1024)--hop-length- hop length for FFT, i.e., sample stride between consecutive FFTs (256)--win-length- window size for FFT (1024)--mel-fmin- lowest frequency in Hz (0.0)--mel-fmax- highest frequency in Hz (8.000)--anneal-steps- epochs at which to anneal the learning rate (500 1000 1500)--anneal-factor- factor by which to anneal the learning rate (FP16/FP32: 0.3/0.1)
-
--enable-align-loss- use this argument to enable Regotron loss -
--delta-align-$\delta$ , relaxation hyperparam, default=0.01 -
--weight-align-$\lambda$ , monotonic loss weight, default=1e-5