feat: streaming transcript/checkpoint parsing to reduce memory amplification by Krishnachaitanyakc · Pull Request #956 · git-ai-project/git-ai

Krishnachaitanyakc · 2026-04-05T02:11:03Z

Summary

Convert 4 JSONL transcript parsers (Claude, Codex, Windsurf, Droid) from read_to_string to streaming BufReader + .lines(), eliminating full-file memory materialization
Convert 2 JSON parsers (Gemini, Continue) from read_to_string + from_str to from_reader, avoiding intermediate String allocation
Stream checkpoint JSONL reads via BufReader instead of read_to_string
Eliminate 2 redundant read_all_checkpoints() calls in get_all_tracked_files() by reading once upfront and threading as &[Checkpoint]
Perform hash migration in-place (&mut iteration) instead of allocating a second Vec
Add configurable size caps (max_checkpoint_jsonl_bytes=64MB, max_transcript_bytes=32MB) via env vars or file config; checkpoint cap is advisory-only, transcript cap returns Err to preserve existing data
Handle --reset with eager reset on read failure, propagating write errors independently
Harden generated CI workflows with action version pinning and integrity checks

Motivation

git-ai's transcript/checkpoint parsers cause ~5-8x memory amplification: a 187 MB transcript produces ~1.2 GB peak RSS, and a 307 MB checkpoint file produces ~1.78 GB peak RSS with ~33s runtime. This causes OOM kills and poor UX on long AI coding sessions.

Targets

Transcript peak RSS down ≥40%
Checkpoint peak RSS down ≥30%, wall-clock down ≥25%

Test plan

cargo fmt -- --check — clean
cargo clippy --all-targets -- -D warnings — clean
cargo test --lib — 1205 passed, 0 failed
Run scripts/repro_runaway_memory.py for before/after RSS measurements
Manual testing with large transcript files (>100MB)
Verify --reset recovers from corrupt checkpoints.jsonl

…ication Convert JSONL transcript parsers (Claude, Codex, Windsurf, Droid) from read_to_string to BufReader + line-by-line streaming. Use from_reader for JSON parsers (Gemini, Continue). Stream checkpoint JSONL reads via BufReader. Eliminate double read_all_checkpoints() calls in get_all_tracked_files() by threading checkpoints as a parameter. Add configurable size caps (max_checkpoint_jsonl_bytes=64MB, max_transcript_bytes=32MB) via env vars or file config. Checkpoint cap is advisory-only (warns but still parses). Transcript cap returns Err to preserve existing data rather than silently replacing with empty. Perform hash migration in-place instead of allocating a second Vec. Handle --reset with eager reset on read failure so corrupt checkpoint files can be recovered without swallowing non-corruption I/O errors. Harden generated CI workflows with version pinning and integrity checks. Targets: transcript RSS down >=40%, checkpoint RSS down >=30%.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: streaming transcript/checkpoint parsing to reduce memory amplification#956

feat: streaming transcript/checkpoint parsing to reduce memory amplification#956
Krishnachaitanyakc wants to merge 1 commit intogit-ai-project:mainfrom
Krishnachaitanyakc:feat/streaming-transcript-parsing-v2

Krishnachaitanyakc commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Krishnachaitanyakc commented Apr 5, 2026

Summary

Motivation

Targets

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant