Skip to content

Implement energon prepare with a remote dataset (using local temp)#221

Open
voegtlel wants to merge 5 commits into
developfrom
feature/remote_dataprep
Open

Implement energon prepare with a remote dataset (using local temp)#221
voegtlel wants to merge 5 commits into
developfrom
feature/remote_dataprep

Conversation

@voegtlel
Copy link
Copy Markdown
Collaborator

@voegtlel voegtlel commented Apr 8, 2026

Fixes #91

Implements energon prepare for remote (including media metadata preparation)

@voegtlel voegtlel requested a review from philipp-fischer April 8, 2026 16:01
@voegtlel voegtlel force-pushed the feature/remote_dataprep branch from 8c093f4 to 70caf01 Compare April 8, 2026 16:03
@radulescupetru
Copy link
Copy Markdown

I see ValueError: Path msc://default/path_to_s3_dataset is not local error

Preparing filesystem dataset and computing media metadata...
Traceback (most recent call last):
  File "/home/user/miniconda3/envs/conda_env/bin/energon", line 7, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/user/miniconda3/envs/conda_env/lib/python3.12/site-packages/click/core.py", line 1485, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/conda_env/lib/python3.12/site-packages/click/core.py", line 1406, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/conda_env/lib/python3.12/site-packages/click/core.py", line 1873, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/conda_env/lib/python3.12/site-packages/click/core.py", line 1269, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/conda_env/lib/python3.12/site-packages/click/core.py", line 824, in invoke
    return callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/conda_env/lib/python3.12/site-packages/megatron/energon/tools/prepare_media.py", line 110, in command
    stored = prepare_filesystem_dataset(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/conda_env/lib/python3.12/site-packages/megatron/energon/media/filesystem_prepare.py", line 44, in prepare_filesystem_dataset
    root = root_path.local_path()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/conda_env/lib/python3.12/site-packages/megatron/energon/epathlib/epath.py", line 258, in local_path
    raise ValueError(f"Path {self} is not local")

There's also a TODO in that method:

Only supporting local file system, because sqlite does not support remote file systems.

TODO: Implement remote file systems. Maybe create locally in tmp then upload?

@voegtlel voegtlel force-pushed the feature/remote_dataprep branch from 9905571 to 9cca0d0 Compare April 30, 2026 11:31
@voegtlel
Copy link
Copy Markdown
Collaborator Author

@radulescupetru sorry for taking a while, had other priorities. Now also implemented that for filesystem:// links. Can you try again?

@voegtlel voegtlel force-pushed the feature/remote_dataprep branch from 9cca0d0 to 6a3b40c Compare April 30, 2026 11:33
Comment thread src/megatron/energon/local_copy.py
@voegtlel voegtlel force-pushed the feature/remote_dataprep branch from 25a9d35 to 598db74 Compare May 6, 2026 12:09
Comment thread src/megatron/energon/epathlib/epath.py Outdated
Comment thread src/megatron/energon/epathlib/epath.py
Comment thread src/megatron/energon/media/filesystem_prepare.py Outdated
Comment thread src/megatron/energon/media/filesystem_prepare.py
Comment thread src/megatron/energon/media/filesystem_prepare.py Outdated
Comment thread src/megatron/energon/media/filesystem_prepare.py Outdated
Comment thread src/megatron/energon/tools/prepare_media.py
Comment thread src/megatron/energon/flavors/webdataset/prepare.py
Comment thread src/megatron/energon/flavors/webdataset/prepare.py Outdated
Comment thread src/megatron/energon/local_copy.py
…path handling. Fix S3 emulator timestamp handling
@voegtlel voegtlel force-pushed the feature/remote_dataprep branch from 582a007 to ac7dc96 Compare May 21, 2026 16:24
# Prefix to be removed from found paths to remap to relative paths
root_prefix = self._internal_str_path.lstrip("/")

for obj in self.fs.list_recursive(self._internal_str_path):
Copy link
Copy Markdown
Collaborator

@philipp-fischer philipp-fischer May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either us or MSC team needs to fix this for local paths. This is way slower than os.walk. Maybe for now we should have os.walk here explicitly for local paths. In MSC they do os.listdir + sorting + isdir/isdile plus object construction for each file, i.e. lots of overhead.
But let's make sure we still preserve DSS URLs when using os.walk.


owns_remote_sqlite_tmp = False
remote_sqlite_tmp_dir: Optional[Path] = None
if not parent_path.is_local():
Copy link
Copy Markdown
Collaborator

@philipp-fischer philipp-fischer May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will evaluate to True for DSS caches, because they are local. Can we maybe catch the read-only case somewhere early to get a useful error? (also for the other prepare entry points)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support dataprep in object store

3 participants