Removed unneeded sqlite index.#224
Conversation
Signed-off-by: Björn Buschkämper <bjoern.buschkaemper@gmail.com>
There was a problem hiding this comment.
Pull request overview
Removes creation of a redundant explicit SQLite index on samples(sample_key) in the WebDataset SQLite index writer, relying on the implicit index created by the UNIQUE constraint to avoid performance regressions on very large datasets.
Changes:
- Stop creating
idx_samples_sample_keyon close; rely on the implicit unique index fromsample_key TEXT ... UNIQUE. - Add unit tests ensuring no duplicate sample_key index is created and key-based lookups/duplicate detection still work.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/megatron/energon/flavors/webdataset/indexing.py |
Removes explicit creation of the redundant idx_samples_sample_key index in SqliteIndexWriter.close(). |
tests/test_webdataset_indexing.py |
Adds tests validating the implicit unique index behavior and ensuring the removed explicit index is not created. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Björn Buschkämper <bjoern.buschkaemper@gmail.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@lvoegtle What do you think, do we really need a unit test for sqlite indices? |
|
@philipp-fischer I could remove the unit test if you want, was prob an overkill |
Yes, I think we can omit the new additional unittest. This is tested quite thoroughly through the e2e tests. |
voegtlel
left a comment
There was a problem hiding this comment.
I believe this is actually needed
Signed-off-by: Björn Buschkämper <bjoern.buschkaemper@gmail.com>
|
Thank you for this fix! Merging when Tests succeed. |
SQLite implicitly creates an index due to
sample_keybeing unique. The added index is an unnecessary duplicate resulting in a slowdown, especially for very large (i.e. >1B row) datasets.