Skip to content

Save spectrograms with np.savez_compressed #1

@NickleDave

Description

@NickleDave
  • can we reduce file size
  • without affecting training
  • and requiring a ton of re-engineering of dataset prep / datapipe class

Currently we use a "just a bunch of files of approach", which lets us use the same npz file--the spectrogram, the input to a model--with multiple npy files--the labels, the target of the model.

Sort of a worst case might be where we get a big benefit from jamming all the spectrograms in a single zarr archive, but that means we have to re-engineer all the code that assumes the spectrograms exist as separate files: the prep step, the dataset class, etc. The reason to prefer the separate files is mainly for tracking metadata and for readability, but maybe I am overvaluing this.

This doesn't need to be highest priority but it could help make it easier to upload the dataset.

edit: if we were to cram all the spectrograms into a single zarr archive, then we might want to access with a mem-mapping approach. DAS docs suggest it's not easy to squeeze good performance out of this:

While zarr, h5py, and xarray provide mechanisms for out-of-memory access, they tend to be slower in our experience or require fine tuning to reach the performance reached with memmapped npy files.

I did find examples for pytorch + zarr previously in other domains but similarly got the impression that it's not a simple clear process to follow and it's not easy to troubleshoot. Although the point about just mem-mapping npy makes me wonder if I should try that

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions