Do you have a suggestion for supporting processing of mini-batches of multiple utterances at a time?
We have refactored our data to have feature files of fixed frame lengths. We can have dataLoader load in the .bin features as utterances(rows) x frame--features(columns), but it seems we would need to modify ctc_fast.pyx to loop over the utterances, and somehow combine the gradients. The loop over the utts seems easy enough, but not sure how to combine gradients.
Have you already tested multi-utterance mini-batches and decided that they are not appropriate for the task?
Do you have a suggestion for supporting processing of mini-batches of multiple utterances at a time?
We have refactored our data to have feature files of fixed frame lengths. We can have dataLoader load in the .bin features as utterances(rows) x frame--features(columns), but it seems we would need to modify ctc_fast.pyx to loop over the utterances, and somehow combine the gradients. The loop over the utts seems easy enough, but not sure how to combine gradients.
Have you already tested multi-utterance mini-batches and decided that they are not appropriate for the task?