Do not scan the same files multiple times in ScalableShardDataset

It seems that ScalableShardDataset can run os.walk of the same folders and then get length of each file potentially hundreds of times depending on the number of logical shards in one rank, because each StreamingDocDataset runs full os.walk() in its setup: 
https://github.com/foundation-model-stack/fms-fsdp/blob/23a9e39fd7f83b46bf27cfbf5e09143514f8c7eb/fms_fsdp/utils/dataset_utils.py#L1226
https://github.com/foundation-model-stack/fms-fsdp/blob/23a9e39fd7f83b46bf27cfbf5e09143514f8c7eb/fms_fsdp/utils/dataset_utils.py#L925

This can be especially inefficient in case of many files. Should os.walk() run only once and its result should be reused?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not scan the same files multiple times in ScalableShardDataset #137

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Do not scan the same files multiple times in ScalableShardDataset #137

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions