Currently, the upper limit on the duration of audio chunks taken as input by Persephone is 10 seconds. This is an issue for the real-world deployment of Persephone, because many documents in archives such as the Pangloss Collection are divided into longer chunks.
Thus, the document “Romanmangan, the fairy from the other world" has a duration of 1,890 seconds, and is divided into 212 sentences. Seventy sentences, amounting to more than half of the total duration of this substantial story, are above the 10-second limit, and thus not used in training.
A suggestion from a reviewer of a paper at SLTU is to perform Voice Activity Detection (VAD), to detect silence and non-silence, and then cut the long waveform at silence part into smaller pieces. This way, we may still use all the data for training.
Currently, the upper limit on the duration of audio chunks taken as input by
Persephoneis 10 seconds. This is an issue for the real-world deployment ofPersephone, because many documents in archives such as the Pangloss Collection are divided into longer chunks.Thus, the document “Romanmangan, the fairy from the other world" has a duration of 1,890 seconds, and is divided into 212 sentences. Seventy sentences, amounting to more than half of the total duration of this substantial story, are above the 10-second limit, and thus not used in training.
A suggestion from a reviewer of a paper at SLTU is to perform Voice Activity Detection (VAD), to detect silence and non-silence, and then cut the long waveform at silence part into smaller pieces. This way, we may still use all the data for training.