Skip to content

Add option to skip SQLite samples/sample_parts tables during prepare (multi-hour cost on 100M+ sample datasets) #231

@pei-li-hedgehog

Description

@pei-li-hedgehog

Problem

On very large WebDataset materialisations, the SQLite indexing phase of energon prepare dominates total runtime. On a 244M-row / 4000-shard text-only dataset, the indexing step has been the bottleneck — we hit a 13-hour K8s pod timeout in the SQLite phase (the tar-write phase finished in ~8h).

Profile-by-inspection: with ~488M rows across samples + sample_parts and UNIQUE(sample_key) enforcement, the dominant costs are (1) single-writer INSERT throughput, (2) UNIQUE-constraint checks on a multi-GB btree, and (3) the post-load CREATE INDEX over samples(sample_key) and the two sample_parts indexes.

For datasets that are consumed purely sequentially via ShardInfosITarReader (e.g. large-scale text pretraining), the samples / sample_parts tables are not queried at training time — checkpoint/resume uses integer SliceState offsets resolved via .info.json + .tar.idx. So the multi-hour SQLite build is producing tables that are never read.

SqliteIndexWriter already accepts enable_sample_tables=False, but that knob is not plumbed through BaseWebdatasetFactory.prepare_dataset() or the energon prepare CLI, so end-users can't opt out.

What the existing --tar-index-only flag is not

--tar-index-only does not help here. Reading prepare_dataset end-to-end: the SqliteIndexWriterAggregator is constructed unconditionally, the worker pool runs and feeds all IndexSample / IndexSamplePart items into the SQLite INSERTs, and only after that does the function check tar_index_only and return early — skipping just the trivial split.yaml / dataset.yaml write. The CLI further requires .info.json to already exist (it reads get_dataset_info(path) first), so --tar-index-only is for re-indexing an already-prepared dataset, not for skipping indexing.

Proposal

Expose enable_sample_tables through BaseWebdatasetFactory.prepare_dataset() (default True — backwards compatible) and add a corresponding --no-sample-tables CLI flag. When False:

  • samples / sample_parts tables are not created.
  • INSERTs become no-ops.
  • The post-load CREATE INDEX calls are skipped (already gated in close()).
  • .tar.idx, .info.json, and split config are still produced — the integer-indexed loader continues to work.

Known limitations of --no-sample-tables (which the PR documents):

  • Polylithic dataset joins (join_dataset_loader.py SQL JOIN on samples.sample_key) — will fail.
  • WebdatasetFileStore / SqliteITarEntryReader — used by as_file_store(), energon mount, and aux-data access for CrudeSample datasets — will fail at first key lookup.
  • Any user code that queries the SQLite sample tables.

All failures are loud (OperationalError: no such table: samples), not silent.

PR

Implemented in #230.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions