It seems like all of the potential methods to download the dataset are about 6k samples shy. Using huggingface's load_dataset() option, downloading tar.xz files manually, gif lfs, etc. all run into the same issue as of this date.
The downloaded dataset ends up containing ~94k samples (94164 samples per my most recent attempt at this), which makes attempts to reproduce the work or leverage the excellent dataset/dataloader work done already quite challenging.
If I'm eyeballing it, it looks like the data_10.tar.xz file in the data is the most likely culprit, as the other .tar files over around ~7.8 GB in size, and data_10.tar.xz is 3.25 GB.
It's certainly possible I'm missing something, but I haven't been able to figure out an effective way around this issue. Any assistance in the matter would be appreciated!
It seems like all of the potential methods to download the dataset are about 6k samples shy. Using huggingface's load_dataset() option, downloading tar.xz files manually, gif lfs, etc. all run into the same issue as of this date.
The downloaded dataset ends up containing ~94k samples (94164 samples per my most recent attempt at this), which makes attempts to reproduce the work or leverage the excellent dataset/dataloader work done already quite challenging.
If I'm eyeballing it, it looks like the data_10.tar.xz file in the data is the most likely culprit, as the other .tar files over around ~7.8 GB in size, and data_10.tar.xz is 3.25 GB.
It's certainly possible I'm missing something, but I haven't been able to figure out an effective way around this issue. Any assistance in the matter would be appreciated!