Problem
On very large WebDataset materialisations, the SQLite indexing phase of energon prepare dominates total runtime. On a 244M-row / 4000-shard text-only dataset, the indexing step has been the bottleneck — we hit a 13-hour K8s pod timeout in the SQLite phase (the tar-write phase finished in ~8h).
Profile-by-inspection: with ~488M rows across samples + sample_parts and UNIQUE(sample_key) enforcement, the dominant costs are (1) single-writer INSERT throughput, (2) UNIQUE-constraint checks on a multi-GB btree, and (3) the post-load CREATE INDEX over samples(sample_key) and the two sample_parts indexes.
For datasets that are consumed purely sequentially via ShardInfosITarReader (e.g. large-scale text pretraining), the samples / sample_parts tables are not queried at training time — checkpoint/resume uses integer SliceState offsets resolved via .info.json + .tar.idx. So the multi-hour SQLite build is producing tables that are never read.
SqliteIndexWriter already accepts enable_sample_tables=False, but that knob is not plumbed through BaseWebdatasetFactory.prepare_dataset() or the energon prepare CLI, so end-users can't opt out.
What the existing --tar-index-only flag is not
--tar-index-only does not help here. Reading prepare_dataset end-to-end: the SqliteIndexWriterAggregator is constructed unconditionally, the worker pool runs and feeds all IndexSample / IndexSamplePart items into the SQLite INSERTs, and only after that does the function check tar_index_only and return early — skipping just the trivial split.yaml / dataset.yaml write. The CLI further requires .info.json to already exist (it reads get_dataset_info(path) first), so --tar-index-only is for re-indexing an already-prepared dataset, not for skipping indexing.
Proposal
Expose enable_sample_tables through BaseWebdatasetFactory.prepare_dataset() (default True — backwards compatible) and add a corresponding --no-sample-tables CLI flag. When False:
samples / sample_parts tables are not created.
INSERTs become no-ops.
- The post-load
CREATE INDEX calls are skipped (already gated in close()).
.tar.idx, .info.json, and split config are still produced — the integer-indexed loader continues to work.
Known limitations of --no-sample-tables (which the PR documents):
- Polylithic dataset joins (
join_dataset_loader.py SQL JOIN on samples.sample_key) — will fail.
WebdatasetFileStore / SqliteITarEntryReader — used by as_file_store(), energon mount, and aux-data access for CrudeSample datasets — will fail at first key lookup.
- Any user code that queries the SQLite sample tables.
All failures are loud (OperationalError: no such table: samples), not silent.
PR
Implemented in #230.
Problem
On very large WebDataset materialisations, the SQLite indexing phase of
energon preparedominates total runtime. On a 244M-row / 4000-shard text-only dataset, the indexing step has been the bottleneck — we hit a 13-hour K8s pod timeout in the SQLite phase (the tar-write phase finished in ~8h).Profile-by-inspection: with
~488Mrows acrosssamples+sample_partsandUNIQUE(sample_key)enforcement, the dominant costs are (1) single-writerINSERTthroughput, (2) UNIQUE-constraint checks on a multi-GB btree, and (3) the post-loadCREATE INDEXoversamples(sample_key)and the twosample_partsindexes.For datasets that are consumed purely sequentially via
ShardInfosITarReader(e.g. large-scale text pretraining), thesamples/sample_partstables are not queried at training time — checkpoint/resume uses integerSliceStateoffsets resolved via.info.json+.tar.idx. So the multi-hour SQLite build is producing tables that are never read.SqliteIndexWriteralready acceptsenable_sample_tables=False, but that knob is not plumbed throughBaseWebdatasetFactory.prepare_dataset()or theenergon prepareCLI, so end-users can't opt out.What the existing
--tar-index-onlyflag is not--tar-index-onlydoes not help here. Readingprepare_datasetend-to-end: theSqliteIndexWriterAggregatoris constructed unconditionally, the worker pool runs and feeds allIndexSample/IndexSamplePartitems into the SQLiteINSERTs, and only after that does the function checktar_index_onlyand return early — skipping just the trivialsplit.yaml/dataset.yamlwrite. The CLI further requires.info.jsonto already exist (it readsget_dataset_info(path)first), so--tar-index-onlyis for re-indexing an already-prepared dataset, not for skipping indexing.Proposal
Expose
enable_sample_tablesthroughBaseWebdatasetFactory.prepare_dataset()(defaultTrue— backwards compatible) and add a corresponding--no-sample-tablesCLI flag. WhenFalse:samples/sample_partstables are not created.INSERTs become no-ops.CREATE INDEXcalls are skipped (already gated inclose())..tar.idx,.info.json, and split config are still produced — the integer-indexed loader continues to work.Known limitations of
--no-sample-tables(which the PR documents):join_dataset_loader.pySQLJOINonsamples.sample_key) — will fail.WebdatasetFileStore/SqliteITarEntryReader— used byas_file_store(),energon mount, and aux-data access forCrudeSampledatasets — will fail at first key lookup.All failures are loud (
OperationalError: no such table: samples), not silent.PR
Implemented in #230.