DeepSinger/README.md at main · Ldoun/DeepSinger

End of development indefinitely

My Hish School Graduation project and my very first academic paper implementation project(without any code source)
I was only able to develop the first model(lyric alignment model) only. The data I collected wasn’t clear enough for the next step model
project's purpose was to synthesize Korean songs given users' voices.
developed based on the DeepSinger: Singing Voice Synthesis with Data Mined From the Web

By studying the previous research, I develop everything from the scratch from data collection and model training
data collection and preprocessing
- using scrapy with python I collected audio from youtube and lyrics from Melone.
- using Spleeter I removed the MR from the audio. (The author used the data from the recording room, but I wasn’t able to get my hands on such data)
- normalization of audio volume.
- using some algorithms, remove parts that have no vocals.
- turn lyrics to IPA(international phonetic alphabet)
- filter some song that doesn’t qualify(such as long songs, multi vocals)
The paper is composed of two models. one for finding out what audio frames match the lyrics and the other for the voice synthesis.
lyric alignment model
- Seq2Seq
  - Seq2Seq model with Encoder, Decoder
  - Input: Song → recognize lyrics from it
  - using the model’s attention we can find out the relation between frames and IPA
  - It is important to train a model with a clear attention graph. (look at the below images)
- location sensitive guided attention
  - given previous attention values and the output of the encoder and decoder we calculate the attention
  - as the attention differs from a diagonal line we give the model a bigger loss
- variation of truncated back-propagation through time(training algorithm)
  - when we train the model, the model has to exceed a score of 0.6 in attention for it to step over the next parts of the song.

Provide feedback