Hi Team and @philipp-fischer ,
I'm testing my workload that enables fully random access with shuffle_buffer_size==1 and max_samples_per_sequence==1 and I noticed that the storage throughput was pretty bad due to repeated OpenFile which makes the file handles released and recreated for every micro-batch.
Digging into the code I bit I noticed a few places which seem to provide customization for this behavior. Namely,
itar_cache_size (code ref) which controls the size of the LRU cache that stores the file descriptors.
parallel_shard_iters (code ref) which constructs the itar_cache_size parameter.
Ideally in the fully random access we'd want all file handles to be kept open until the end of the training.
I'm using the basic Megatron-Energon code:
train_dataset = get_train_dataset(
cfg.model.energon.path,
batch_size=cfg.model.micro_batch_size,
task_encoder=task_encoder,
worker_config=worker_config,
max_samples_per_sequence=max_samples_per_sequence,
packing_buffer_size=None,
shuffle_buffer_size=cfg.model.energon.shuffle_buffer_size,
split_part='train',
)
data_loader = get_savable_loader(train_dataset, worker_config=worker_config)
Therefore, while I'm figuring out ways to pass these parameters through, I'd like to reach out to see if I'm on the right track and if you can help me configure these parameters quickly. Thanks!
Hi Team and @philipp-fischer ,
I'm testing my workload that enables fully random access with
shuffle_buffer_size==1andmax_samples_per_sequence==1and I noticed that the storage throughput was pretty bad due to repeatedOpenFilewhich makes the file handles released and recreated for every micro-batch.Digging into the code I bit I noticed a few places which seem to provide customization for this behavior. Namely,
itar_cache_size(code ref) which controls the size of the LRU cache that stores the file descriptors.parallel_shard_iters(code ref) which constructs theitar_cache_sizeparameter.Ideally in the fully random access we'd want all file handles to be kept open until the end of the training.
I'm using the basic Megatron-Energon code:
Therefore, while I'm figuring out ways to pass these parameters through, I'd like to reach out to see if I'm on the right track and if you can help me configure these parameters quickly. Thanks!