Skip to content

Importer: stream large PGN imports instead of loading the whole file in memory #95

Description

@jozef2svrcek

importer::extract_pgn returns the whole decompressed PGN as a Vec<u8>, and process_pgn_bytes takes &[u8]. For the small feeds (TWIC, Lichess) that's fine, but the Ajedrez OTB base (~611 MB .7z) expands to a ~3 GB PGN held in memory during import (#40 B3). DuckDB spills to disk so it usually won't OOM, but it's a needless ~3 GB resident on top of DuckDB's buffers.

Fix options:

  • Refactor process_pgn_bytes to take a Read (stream the PGN through the parser) instead of &[u8]. The only snag is the byte-offset→line-number error reporting, which assumes the full slice — could track offset incrementally.
  • Or mmap the decompressed temp file (memmap2) and pass the mapping as &[u8] — the OS pages it in/out, keeping RSS low with no parser change.

Low urgency (works today on a typical machine), but worth doing before the historical base becomes a common import. Relates to #40.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions