Skip to content

[Proposal] Fixing SQLite bottleneck #232

@bbuschkaemper

Description

@bbuschkaemper

prepare_dataset still has a major scalability bottleneck for very large WebDataset preparations. #225 improved the worker -> SQLite write path, but for datasets with >1B samples the final index.sqlite construction still dominates runtime and can take multiple days.

The core problem is that every sample is written into a single global SQLite database, including secondary indexes and especially the sample_parts table/indexes. On the standard training path, this SQLite index is not even used: training reads shard metadata and per-shard .tar.idx files. The SQLite database is mainly needed for by-key access and file-store-style tooling, so training-only datasets currently pay a very large preparation cost for metadata they do not use.

Possible solutions:

  1. Training-only fast path: add an option to prepare_dataset to skip the heavy global SQLite index for datasets that only need standard training. In that mode, Energon would generate only the metadata actually required for training (split.yaml, .info.json, shard metadata, .tar.idx, etc.) and omit index.sqlite / sample_parts.
  2. Scalable long-term fix: replace the single global SQLite index with a backend that is designed to scale to billions of rows for this workload, e.g. an LSM-based key-value store such as RocksDB/Speedb, or another sharded/indexed backend with efficient bulk ingest and point lookups. This would preserve by-key/file-store functionality without making billion-sample dataset preparation dominated by SQLite finalization.

I’d like to discuss which direction would be preferred upstream before working on a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions