importer::extract_pgn returns the whole decompressed PGN as a Vec<u8>, and process_pgn_bytes takes &[u8]. For the small feeds (TWIC, Lichess) that's fine, but the Ajedrez OTB base (~611 MB .7z) expands to a ~3 GB PGN held in memory during import (#40 B3). DuckDB spills to disk so it usually won't OOM, but it's a needless ~3 GB resident on top of DuckDB's buffers.
Fix options:
- Refactor
process_pgn_bytes to take a Read (stream the PGN through the parser) instead of &[u8]. The only snag is the byte-offset→line-number error reporting, which assumes the full slice — could track offset incrementally.
- Or
mmap the decompressed temp file (memmap2) and pass the mapping as &[u8] — the OS pages it in/out, keeping RSS low with no parser change.
Low urgency (works today on a typical machine), but worth doing before the historical base becomes a common import. Relates to #40.
importer::extract_pgnreturns the whole decompressed PGN as aVec<u8>, andprocess_pgn_bytestakes&[u8]. For the small feeds (TWIC, Lichess) that's fine, but the Ajedrez OTB base (~611 MB.7z) expands to a ~3 GB PGN held in memory during import (#40 B3). DuckDB spills to disk so it usually won't OOM, but it's a needless ~3 GB resident on top of DuckDB's buffers.Fix options:
process_pgn_bytesto take aRead(stream the PGN through the parser) instead of&[u8]. The only snag is the byte-offset→line-number error reporting, which assumes the full slice — could track offset incrementally.mmapthe decompressed temp file (memmap2) and pass the mapping as&[u8]— the OS pages it in/out, keeping RSS low with no parser change.Low urgency (works today on a typical machine), but worth doing before the historical base becomes a common import. Relates to #40.