-
Notifications
You must be signed in to change notification settings - Fork 46
Do not scan the same files multiple times in ScalableShardDataset #137
Copy link
Copy link
Open
Description
It seems that ScalableShardDataset can run os.walk of the same folders and then get length of each file potentially hundreds of times depending on the number of logical shards in one rank, because each StreamingDocDataset runs full os.walk() in its setup:
fms-fsdp/fms_fsdp/utils/dataset_utils.py
Line 1226 in 23a9e39
| [d.setup() for d in self.data] |
fms-fsdp/fms_fsdp/utils/dataset_utils.py
Line 925 in 23a9e39
| for root, dirs, files in os.walk(datapath, topdown=False) |
This can be especially inefficient in case of many files. Should os.walk() run only once and its result should be reused?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels