Skip to content

feat: streaming transcript/checkpoint parsing to reduce memory amplification#956

Draft
Krishnachaitanyakc wants to merge 1 commit intogit-ai-project:mainfrom
Krishnachaitanyakc:feat/streaming-transcript-parsing-v2
Draft

feat: streaming transcript/checkpoint parsing to reduce memory amplification#956
Krishnachaitanyakc wants to merge 1 commit intogit-ai-project:mainfrom
Krishnachaitanyakc:feat/streaming-transcript-parsing-v2

Conversation

@Krishnachaitanyakc
Copy link
Copy Markdown
Contributor

Summary

  • Convert 4 JSONL transcript parsers (Claude, Codex, Windsurf, Droid) from read_to_string to streaming BufReader + .lines(), eliminating full-file memory materialization
  • Convert 2 JSON parsers (Gemini, Continue) from read_to_string + from_str to from_reader, avoiding intermediate String allocation
  • Stream checkpoint JSONL reads via BufReader instead of read_to_string
  • Eliminate 2 redundant read_all_checkpoints() calls in get_all_tracked_files() by reading once upfront and threading as &[Checkpoint]
  • Perform hash migration in-place (&mut iteration) instead of allocating a second Vec
  • Add configurable size caps (max_checkpoint_jsonl_bytes=64MB, max_transcript_bytes=32MB) via env vars or file config; checkpoint cap is advisory-only, transcript cap returns Err to preserve existing data
  • Handle --reset with eager reset on read failure, propagating write errors independently
  • Harden generated CI workflows with action version pinning and integrity checks

Motivation

git-ai's transcript/checkpoint parsers cause ~5-8x memory amplification: a 187 MB transcript produces ~1.2 GB peak RSS, and a 307 MB checkpoint file produces ~1.78 GB peak RSS with ~33s runtime. This causes OOM kills and poor UX on long AI coding sessions.

Targets

  • Transcript peak RSS down ≥40%
  • Checkpoint peak RSS down ≥30%, wall-clock down ≥25%

Test plan

  • cargo fmt -- --check — clean
  • cargo clippy --all-targets -- -D warnings — clean
  • cargo test --lib — 1205 passed, 0 failed
  • Run scripts/repro_runaway_memory.py for before/after RSS measurements
  • Manual testing with large transcript files (>100MB)
  • Verify --reset recovers from corrupt checkpoints.jsonl

…ication

Convert JSONL transcript parsers (Claude, Codex, Windsurf, Droid) from
read_to_string to BufReader + line-by-line streaming. Use from_reader
for JSON parsers (Gemini, Continue). Stream checkpoint JSONL reads via
BufReader. Eliminate double read_all_checkpoints() calls in
get_all_tracked_files() by threading checkpoints as a parameter.

Add configurable size caps (max_checkpoint_jsonl_bytes=64MB,
max_transcript_bytes=32MB) via env vars or file config. Checkpoint cap
is advisory-only (warns but still parses). Transcript cap returns Err
to preserve existing data rather than silently replacing with empty.

Perform hash migration in-place instead of allocating a second Vec.
Handle --reset with eager reset on read failure so corrupt checkpoint
files can be recovered without swallowing non-corruption I/O errors.

Harden generated CI workflows with version pinning and integrity checks.

Targets: transcript RSS down >=40%, checkpoint RSS down >=30%.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant