fix: skip binary workspace diff contents#498
Conversation
Greptile SummaryThis PR prevents Kanban from crashing or hanging when a workspace contains large binary files (e.g., mp4s, pngs) that Git reports as modified. It uses the
Confidence Score: 4/5Safe to merge for the tracked-binary crash case; untracked binary files remain unguarded but this is a pre-existing gap, not a regression. The binary guard works correctly for all three tracked-diff paths and is well-tested with a real git repo. The one gap is untracked files: diffStat is forced to null for them (since numstat has nothing to compare against), so a large untracked binary would still be read as UTF-8 text via readWorkingTreeFile, leaving the original OOM vector open for that narrower case. The PR description calls this out as a known follow-up, and the immediate crash scenario (tracked mp4s in a worktree) is fully addressed. src/workspace/get-workspace-changes.ts — the untracked-file branch in buildFileChange still calls readWorkingTreeFile unconditionally for non-deleted files.
|
| Filename | Overview |
|---|---|
| src/workspace/get-workspace-changes.ts | Adds isBinary to DiffStat, extracts parseDiffStatOutput to detect binary via numstat's "-\t-" sentinel, and gates all file-content reads on that flag. Untracked binary files are not covered by the guard. |
| test/runtime/get-workspace-changes.test.ts | New integration tests use a real git repo with a 50 MiB sparse file to verify binary content is skipped. Covers getWorkspaceChanges and getWorkspaceChangesBetweenRefs; untracked binary scenario is not tested. |
| web-ui/src/components/detail-panels/diff-viewer-panel.tsx | Replaces null render for binary entries with an explicit "Binary file not shown." message, consistent with surrounding inline-style usage. |
| web-ui/src/components/git-history/git-commit-diff-panel.tsx | Refactors the binary/rows/no-diff conditional chain to make the binary branch explicit and removes an unreachable null branch; adds "Binary file not shown." message. |
| web-ui/src/components/detail-panels/diff-viewer-panel.test.tsx | Adds assertion for "Binary file not shown." text in the binary file test case; straightforward update. |
| web-ui/src/components/git-history/git-commit-diff-panel.test.tsx | Renames the binary test and adds assertion for "Binary file not shown." message; also verifies the old "No textual diff available." message is absent. |
Sequence Diagram
sequenceDiagram
participant Caller
participant buildFileChange
participant readDiffStat
participant parseDiffStatOutput
participant readHeadFile
participant readWorkingTreeFile
Caller->>buildFileChange: entry (status, path)
alt "entry.status == untracked"
buildFileChange->>buildFileChange: "diffStat = null (skip numstat)"
else tracked file
buildFileChange->>readDiffStat: (repoRoot, path)
readDiffStat->>parseDiffStatOutput: git diff --numstat output
alt binary file
parseDiffStatOutput-->>readDiffStat: isBinary:true
else text file
parseDiffStatOutput-->>readDiffStat: isBinary:false
end
readDiffStat-->>buildFileChange: DiffStat or null
end
alt "diffStat?.isBinary == true"
buildFileChange->>buildFileChange: "oldText=null, newText=null"
else not binary
buildFileChange->>readHeadFile: (repoRoot, basePath)
readHeadFile-->>buildFileChange: oldText
buildFileChange->>readWorkingTreeFile: (repoRoot, path)
readWorkingTreeFile-->>buildFileChange: newText
end
buildFileChange-->>Caller: RuntimeWorkspaceFileChange
Comments Outside Diff (1)
-
src/workspace/get-workspace-changes.ts, line 271-294 (link)Untracked binary files bypass the binary guard
For files with
entry.status === "untracked",diffStatis hard-coded tonullbecausegit diff --numstat HEADproduces no output for untracked paths. As a result,diffStat?.isBinaryis always falsy for these entries, andreadWorkingTreeFileis called unconditionally when the entry is neither deleted nor binary-flagged. A large untracked binary (e.g., a staged-but-not-committed mp4) would still be read into memory as UTF-8 text, reproducing the same OOM risk the PR is fixing for modified tracked files.Prompt To Fix With AI
This is a comment left during a code review. Path: src/workspace/get-workspace-changes.ts Line: 271-294 Comment: **Untracked binary files bypass the binary guard** For files with `entry.status === "untracked"`, `diffStat` is hard-coded to `null` because `git diff --numstat HEAD` produces no output for untracked paths. As a result, `diffStat?.isBinary` is always falsy for these entries, and `readWorkingTreeFile` is called unconditionally when the entry is neither deleted nor binary-flagged. A large untracked binary (e.g., a staged-but-not-committed mp4) would still be read into memory as UTF-8 text, reproducing the same OOM risk the PR is fixing for modified tracked files. How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
src/workspace/get-workspace-changes.ts:271-294
**Untracked binary files bypass the binary guard**
For files with `entry.status === "untracked"`, `diffStat` is hard-coded to `null` because `git diff --numstat HEAD` produces no output for untracked paths. As a result, `diffStat?.isBinary` is always falsy for these entries, and `readWorkingTreeFile` is called unconditionally when the entry is neither deleted nor binary-flagged. A large untracked binary (e.g., a staged-but-not-committed mp4) would still be read into memory as UTF-8 text, reproducing the same OOM risk the PR is fixing for modified tracked files.
Reviews (1): Last reviewed commit: "fix: skip binary workspace diff contents" | Re-trigger Greptile
Problem
Kanban could crash or hang when the workspace changes view included large binary files. The concrete reproduction came from a Cline web worktree where Git reported several tracked mp4 files as modified. The previous workspace changes loader treated every changed file as text and read both sides of the diff into memory before returning the response.
That is unsafe for repositories with large committed assets. A changed mp4, webm, png, or other binary can be tens of megabytes, and reading it through UTF-8 text paths can allocate large strings, churn the workspace changes cache, and eventually drive the Node process into an out of memory crash.
Technical approach
The fix keeps the API shape unchanged and uses information Kanban was already asking Git for.
git diff --numstatreports binary files with-for both additions and deletions. The workspace changes loader now parses that output into a sharedDiffStatshape with anisBinaryflag.Each workspace file-change builder now reads numstat before loading file contents. If Git marks the file as binary, Kanban skips both the old-side read and the working-tree read, returning
oldText: nullandnewText: null. This prevents binary bytes from entering the response payload or the workspace changes cache.This applies to all three workspace changes paths:
The UI already detected binary-looking paths and avoided rendering diff rows for them. I added a small expanded-body message so users can tell the file was intentionally omitted instead of seeing an empty diff panel:
Debugging notes
The original suspicion was that
cline-webwas simply large, but plain Git commands were fast. The expensive part was Kanban reading changed binary file contents as text after Git reported them in the diff.A specific task worktree showed these files as modified:
Git reported them as binary in numstat:
That is the exact signal this PR now uses to avoid content reads.
There was also a repo-level reason those mp4s showed as modified in the task worktree:
cline-webhas LFS attributes for mp4 files, but those committed blobs appear to still be full mp4 blobs rather than LFS pointer blobs. Git therefore reports a binary diff even when the working-tree files are not something Kanban changed. This PR does not try to fix that repository state. It makes Kanban robust when Git reports binary changes.Decisions
I intentionally kept this narrower than a broader file-size cap or API change. A size cap would also protect against very large text files, but it requires new response semantics and more UI states. The immediate failure mode was binary assets, and Git already provides a reliable signal for those through numstat.
The response contract remains unchanged. Binary files continue to appear in the file list, but their text fields stay null and the UI explains why no diff rows are shown.
Testing
Ran these focused checks before pushing:
The precommit hook also passed:
Follow-up
A separate improvement could add a file-size cap for very large text files. I would keep that separate because it changes behavior beyond the binary OOM case and likely deserves explicit UI wording for oversized text previews.