Skip to content

Removed unneeded sqlite index.#224

Merged
voegtlel merged 4 commits into
NVIDIA:developfrom
bbuschkaemper:remove-duplicate-sqlite-key
May 4, 2026
Merged

Removed unneeded sqlite index.#224
voegtlel merged 4 commits into
NVIDIA:developfrom
bbuschkaemper:remove-duplicate-sqlite-key

Conversation

@bbuschkaemper
Copy link
Copy Markdown
Contributor

SQLite implicitly creates an index due to sample_key being unique. The added index is an unnecessary duplicate resulting in a slowdown, especially for very large (i.e. >1B row) datasets.

Signed-off-by: Björn Buschkämper <bjoern.buschkaemper@gmail.com>
Copilot AI review requested due to automatic review settings April 15, 2026 13:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Removes creation of a redundant explicit SQLite index on samples(sample_key) in the WebDataset SQLite index writer, relying on the implicit index created by the UNIQUE constraint to avoid performance regressions on very large datasets.

Changes:

  • Stop creating idx_samples_sample_key on close; rely on the implicit unique index from sample_key TEXT ... UNIQUE.
  • Add unit tests ensuring no duplicate sample_key index is created and key-based lookups/duplicate detection still work.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/megatron/energon/flavors/webdataset/indexing.py Removes explicit creation of the redundant idx_samples_sample_key index in SqliteIndexWriter.close().
tests/test_webdataset_indexing.py Adds tests validating the implicit unique index behavior and ensuring the removed explicit index is not created.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_webdataset_indexing.py Outdated
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Björn Buschkämper <bjoern.buschkaemper@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@philipp-fischer
Copy link
Copy Markdown
Collaborator

@lvoegtle What do you think, do we really need a unit test for sqlite indices?

@bbuschkaemper
Copy link
Copy Markdown
Contributor Author

@philipp-fischer I could remove the unit test if you want, was prob an overkill

@voegtlel
Copy link
Copy Markdown
Collaborator

voegtlel commented May 4, 2026

@lvoegtle What do you think, do we really need a unit test for sqlite indices?

Yes, I think we can omit the new additional unittest. This is tested quite thoroughly through the e2e tests.

Copy link
Copy Markdown
Collaborator

@voegtlel voegtlel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is actually needed

Comment thread src/megatron/energon/flavors/webdataset/indexing.py
@bbuschkaemper bbuschkaemper requested a review from voegtlel May 4, 2026 13:16
@voegtlel
Copy link
Copy Markdown
Collaborator

voegtlel commented May 4, 2026

Thank you for this fix! Merging when Tests succeed.

@voegtlel voegtlel merged commit c560494 into NVIDIA:develop May 4, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants