[Proposal] Fixing SQLite bottleneck

`prepare_dataset` still has a major scalability bottleneck for very large WebDataset preparations. #225 improved the worker -> SQLite write path, but for datasets with >1B samples the final `index.sqlite` construction still dominates runtime and can take multiple days.

The core problem is that every sample is written into a single global SQLite database, including secondary indexes and especially the `sample_parts` table/indexes. On the standard training path, this SQLite index is not even used: training reads shard metadata and per-shard `.tar.idx` files. The SQLite database is mainly needed for by-key access and file-store-style tooling, so training-only datasets currently pay a very large preparation cost for metadata they do not use.

Possible solutions:
1. Training-only fast path: add an option to `prepare_dataset` to skip the heavy global SQLite index for datasets that only need standard training. In that mode, Energon would generate only the metadata actually required for training (`split.yaml`, `.info.json`, shard metadata, `.tar.idx`, etc.) and omit `index.sqlite` / `sample_parts`.
2. Scalable long-term fix: replace the single global SQLite index with a backend that is designed to scale to billions of rows for this workload, e.g. an LSM-based key-value store such as RocksDB/Speedb, or another sharded/indexed backend with efficient bulk ingest and point lookups. This would preserve by-key/file-store functionality without making billion-sample dataset preparation dominated by SQLite finalization.

I’d like to discuss which direction would be preferred upstream before working on a PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Fixing SQLite bottleneck #232

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Proposal] Fixing SQLite bottleneck #232

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions