Add option to skip SQLite samples/sample_parts tables during prepare (multi-hour cost on 100M+ sample datasets)

## Problem

On very large WebDataset materialisations, the SQLite indexing phase of `energon prepare` dominates total runtime. On a 244M-row / 4000-shard text-only dataset, the indexing step has been the bottleneck — we hit a 13-hour K8s pod timeout in the SQLite phase (the tar-write phase finished in ~8h).

Profile-by-inspection: with `~488M` rows across `samples` + `sample_parts` and `UNIQUE(sample_key)` enforcement, the dominant costs are (1) single-writer `INSERT` throughput, (2) UNIQUE-constraint checks on a multi-GB btree, and (3) the post-load `CREATE INDEX` over `samples(sample_key)` and the two `sample_parts` indexes.

For datasets that are consumed purely sequentially via `ShardInfosITarReader` (e.g. large-scale text pretraining), the `samples` / `sample_parts` tables are not queried at training time — checkpoint/resume uses integer `SliceState` offsets resolved via `.info.json` + `.tar.idx`. So the multi-hour SQLite build is producing tables that are never read.

`SqliteIndexWriter` already accepts `enable_sample_tables=False`, but that knob is not plumbed through `BaseWebdatasetFactory.prepare_dataset()` or the `energon prepare` CLI, so end-users can't opt out.

## What the existing `--tar-index-only` flag is *not*

`--tar-index-only` does **not** help here. Reading `prepare_dataset` end-to-end: the `SqliteIndexWriterAggregator` is constructed unconditionally, the worker pool runs and feeds all `IndexSample` / `IndexSamplePart` items into the SQLite `INSERT`s, and only *after* that does the function check `tar_index_only` and return early — skipping just the trivial `split.yaml` / `dataset.yaml` write. The CLI further requires `.info.json` to already exist (it reads `get_dataset_info(path)` first), so `--tar-index-only` is for *re-indexing* an already-prepared dataset, not for skipping indexing.

## Proposal

Expose `enable_sample_tables` through `BaseWebdatasetFactory.prepare_dataset()` (default `True` — backwards compatible) and add a corresponding `--no-sample-tables` CLI flag. When `False`:

- `samples` / `sample_parts` tables are not created.
- `INSERT`s become no-ops.
- The post-load `CREATE INDEX` calls are skipped (already gated in `close()`).
- `.tar.idx`, `.info.json`, and split config are still produced — the integer-indexed loader continues to work.

Known limitations of `--no-sample-tables` (which the PR documents):

- Polylithic dataset joins (`join_dataset_loader.py` SQL `JOIN` on `samples.sample_key`) — will fail.
- `WebdatasetFileStore` / `SqliteITarEntryReader` — used by `as_file_store()`, `energon mount`, and aux-data access for `CrudeSample` datasets — will fail at first key lookup.
- Any user code that queries the SQLite sample tables.

All failures are loud (`OperationalError: no such table: samples`), not silent.

## PR

Implemented in #230.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to skip SQLite samples/sample_parts tables during prepare (multi-hour cost on 100M+ sample datasets) #231

Problem

What the existing `--tar-index-only` flag is not

Proposal

PR

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add option to skip SQLite samples/sample_parts tables during prepare (multi-hour cost on 100M+ sample datasets) #231

Description

Problem

What the existing --tar-index-only flag is not

Proposal

PR

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

What the existing `--tar-index-only` flag is not