File Handles released frequently during fully random access workloads

Hi Team and @philipp-fischer ,

I'm testing my workload that enables fully random access with `shuffle_buffer_size==1` and `max_samples_per_sequence==1` and I noticed that the storage throughput was pretty bad due to repeated `OpenFile` which makes the file handles released and recreated for every micro-batch.

Digging into the code I bit I noticed a few places which seem to provide customization for this behavior. Namely,
1. `itar_cache_size` ([code ref](https://github.com/NVIDIA/Megatron-Energon/blob/develop/src/megatron/energon/flavors/webdataset/itar_reader.py#L55C9-L55C24)) which controls the size of the LRU cache that stores the file descriptors.
2. `parallel_shard_iters` ([code ref](https://github.com/NVIDIA/Megatron-Energon/blob/develop/src/megatron/energon/flavors/webdataset/base_webdataset.py#L116)) which constructs the `itar_cache_size` parameter.

Ideally in the fully random access we'd want all file handles to be kept open until the end of the training.

I'm using the basic Megatron-Energon code:
```python
train_dataset = get_train_dataset(
      cfg.model.energon.path,
      batch_size=cfg.model.micro_batch_size,
      task_encoder=task_encoder,
      worker_config=worker_config,
      max_samples_per_sequence=max_samples_per_sequence,
      packing_buffer_size=None,
      shuffle_buffer_size=cfg.model.energon.shuffle_buffer_size,
      split_part='train',
  )
  data_loader = get_savable_loader(train_dataset, worker_config=worker_config)
```

Therefore, while I'm figuring out ways to pass these parameters through, I'd like to reach out to see if I'm on the right track and if you can help me configure these parameters quickly. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File Handles released frequently during fully random access workloads #81

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

File Handles released frequently during fully random access workloads #81

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions