Making use of audio chunks of more than 10 seconds

Currently, the upper limit on the duration of audio chunks taken as input by `Persephone` is 10 seconds. This is an issue for the real-world deployment of `Persephone`, because many documents in archives such as the [Pangloss Collection](https://pangloss.cnrs.fr/) are divided into longer chunks.

Thus, the document “[Romanmangan, the fairy from the other world](https://doi.org/10.24397/pangloss-0002300)" has a duration of 1,890 seconds, and is divided into 212 sentences. Seventy sentences, amounting to more than half of the total duration of this substantial story, are above the 10-second limit, and thus not used in training. 

A suggestion from a reviewer of a paper at SLTU is to perform Voice Activity Detection (VAD), to detect silence and non-silence, and then cut the long waveform at silence part into smaller pieces.  This way, we may still use all the data for training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making use of audio chunks of more than 10 seconds #230

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Making use of audio chunks of more than 10 seconds #230

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions