feat: streaming transcript/checkpoint parsing to reduce memory amplification#956
Draft
Krishnachaitanyakc wants to merge 1 commit intogit-ai-project:mainfrom
Draft
Conversation
…ication Convert JSONL transcript parsers (Claude, Codex, Windsurf, Droid) from read_to_string to BufReader + line-by-line streaming. Use from_reader for JSON parsers (Gemini, Continue). Stream checkpoint JSONL reads via BufReader. Eliminate double read_all_checkpoints() calls in get_all_tracked_files() by threading checkpoints as a parameter. Add configurable size caps (max_checkpoint_jsonl_bytes=64MB, max_transcript_bytes=32MB) via env vars or file config. Checkpoint cap is advisory-only (warns but still parses). Transcript cap returns Err to preserve existing data rather than silently replacing with empty. Perform hash migration in-place instead of allocating a second Vec. Handle --reset with eager reset on read failure so corrupt checkpoint files can be recovered without swallowing non-corruption I/O errors. Harden generated CI workflows with version pinning and integrity checks. Targets: transcript RSS down >=40%, checkpoint RSS down >=30%.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
read_to_stringto streamingBufReader + .lines(), eliminating full-file memory materializationread_to_string + from_strtofrom_reader, avoiding intermediate String allocationBufReaderinstead ofread_to_stringread_all_checkpoints()calls inget_all_tracked_files()by reading once upfront and threading as&[Checkpoint]&mutiteration) instead of allocating a second Vecmax_checkpoint_jsonl_bytes=64MB,max_transcript_bytes=32MB) via env vars or file config; checkpoint cap is advisory-only, transcript cap returnsErrto preserve existing data--resetwith eager reset on read failure, propagating write errors independentlyMotivation
git-ai's transcript/checkpoint parsers cause ~5-8x memory amplification: a 187 MB transcript produces ~1.2 GB peak RSS, and a 307 MB checkpoint file produces ~1.78 GB peak RSS with ~33s runtime. This causes OOM kills and poor UX on long AI coding sessions.
Targets
Test plan
cargo fmt -- --check— cleancargo clippy --all-targets -- -D warnings— cleancargo test --lib— 1205 passed, 0 failedscripts/repro_runaway_memory.pyfor before/after RSS measurements--resetrecovers from corruptcheckpoints.jsonl