Skip to content

fix: skip binary workspace diff contents#498

Open
saoudrizwan wants to merge 1 commit into
mainfrom
saoudrizwan/skip-binary-workspace-diffs
Open

fix: skip binary workspace diff contents#498
saoudrizwan wants to merge 1 commit into
mainfrom
saoudrizwan/skip-binary-workspace-diffs

Conversation

@saoudrizwan

@saoudrizwan saoudrizwan commented May 18, 2026

Copy link
Copy Markdown
Contributor

Problem

Kanban could crash or hang when the workspace changes view included large binary files. The concrete reproduction came from a Cline web worktree where Git reported several tracked mp4 files as modified. The previous workspace changes loader treated every changed file as text and read both sides of the diff into memory before returning the response.

That is unsafe for repositories with large committed assets. A changed mp4, webm, png, or other binary can be tens of megabytes, and reading it through UTF-8 text paths can allocate large strings, churn the workspace changes cache, and eventually drive the Node process into an out of memory crash.

Technical approach

The fix keeps the API shape unchanged and uses information Kanban was already asking Git for. git diff --numstat reports binary files with - for both additions and deletions. The workspace changes loader now parses that output into a shared DiffStat shape with an isBinary flag.

Each workspace file-change builder now reads numstat before loading file contents. If Git marks the file as binary, Kanban skips both the old-side read and the working-tree read, returning oldText: null and newText: null. This prevents binary bytes from entering the response payload or the workspace changes cache.

This applies to all three workspace changes paths:

getWorkspaceChanges
getWorkspaceChangesBetweenRefs
getWorkspaceChangesFromRef

The UI already detected binary-looking paths and avoided rendering diff rows for them. I added a small expanded-body message so users can tell the file was intentionally omitted instead of seeing an empty diff panel:

Binary file not shown.
image

Debugging notes

The original suspicion was that cline-web was simply large, but plain Git commands were fast. The expensive part was Kanban reading changed binary file contents as text after Git reported them in the diff.

A specific task worktree showed these files as modified:

public/assets/videos/plan-act-4k.mp4
public/assets/videos/refactor-web-1080p.mp4
public/assets/videos/understand-codebase-4k.mp4

Git reported them as binary in numstat:

-	-	public/assets/videos/plan-act-4k.mp4

That is the exact signal this PR now uses to avoid content reads.

There was also a repo-level reason those mp4s showed as modified in the task worktree: cline-web has LFS attributes for mp4 files, but those committed blobs appear to still be full mp4 blobs rather than LFS pointer blobs. Git therefore reports a binary diff even when the working-tree files are not something Kanban changed. This PR does not try to fix that repository state. It makes Kanban robust when Git reports binary changes.

Decisions

I intentionally kept this narrower than a broader file-size cap or API change. A size cap would also protect against very large text files, but it requires new response semantics and more UI states. The immediate failure mode was binary assets, and Git already provides a reliable signal for those through numstat.

The response contract remains unchanged. Binary files continue to appear in the file list, but their text fields stay null and the UI explains why no diff rows are shown.

Testing

Ran these focused checks before pushing:

npm test -- test/runtime/get-workspace-changes.test.ts test/runtime/git-utils.test.ts
npm --prefix web-ui run test -- src/components/git-history/git-commit-diff-panel.test.tsx src/components/detail-panels/diff-viewer-panel.test.tsx
npm run typecheck
npm --prefix web-ui run typecheck

The precommit hook also passed:

54 test files passed
548 tests passed

Follow-up

A separate improvement could add a file-size cap for very large text files. I would keep that separate because it changes behavior beyond the binary OOM case and likely deserves explicit UI wording for oversized text previews.

@greptile-apps

greptile-apps Bot commented May 18, 2026

Copy link
Copy Markdown

Greptile Summary

This PR prevents Kanban from crashing or hanging when a workspace contains large binary files (e.g., mp4s, pngs) that Git reports as modified. It uses the - - sentinel that git diff --numstat emits for binary files to skip content reads entirely, keeping the response contract unchanged.

  • Adds isBinary to the internal DiffStat shape and extracts parseDiffStatOutput to handle binary detection, then guards all three workspace-changes builder paths (buildFileChange, buildFileChangeBetweenRefs, buildFileChangeFromRef) so that binary files return oldText: null / newText: null without touching the file contents.
  • Replaces the silent empty-body render for binary diff entries in both DiffViewerPanel and GitCommitDiffPanel with an explicit "Binary file not shown." message.
  • Adds integration tests that commit a 50 MiB sparse file and assert no text is loaded.

Confidence Score: 4/5

Safe to merge for the tracked-binary crash case; untracked binary files remain unguarded but this is a pre-existing gap, not a regression.

The binary guard works correctly for all three tracked-diff paths and is well-tested with a real git repo. The one gap is untracked files: diffStat is forced to null for them (since numstat has nothing to compare against), so a large untracked binary would still be read as UTF-8 text via readWorkingTreeFile, leaving the original OOM vector open for that narrower case. The PR description calls this out as a known follow-up, and the immediate crash scenario (tracked mp4s in a worktree) is fully addressed.

src/workspace/get-workspace-changes.ts — the untracked-file branch in buildFileChange still calls readWorkingTreeFile unconditionally for non-deleted files.

Important Files Changed

Filename Overview
src/workspace/get-workspace-changes.ts Adds isBinary to DiffStat, extracts parseDiffStatOutput to detect binary via numstat's "-\t-" sentinel, and gates all file-content reads on that flag. Untracked binary files are not covered by the guard.
test/runtime/get-workspace-changes.test.ts New integration tests use a real git repo with a 50 MiB sparse file to verify binary content is skipped. Covers getWorkspaceChanges and getWorkspaceChangesBetweenRefs; untracked binary scenario is not tested.
web-ui/src/components/detail-panels/diff-viewer-panel.tsx Replaces null render for binary entries with an explicit "Binary file not shown." message, consistent with surrounding inline-style usage.
web-ui/src/components/git-history/git-commit-diff-panel.tsx Refactors the binary/rows/no-diff conditional chain to make the binary branch explicit and removes an unreachable null branch; adds "Binary file not shown." message.
web-ui/src/components/detail-panels/diff-viewer-panel.test.tsx Adds assertion for "Binary file not shown." text in the binary file test case; straightforward update.
web-ui/src/components/git-history/git-commit-diff-panel.test.tsx Renames the binary test and adds assertion for "Binary file not shown." message; also verifies the old "No textual diff available." message is absent.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant buildFileChange
    participant readDiffStat
    participant parseDiffStatOutput
    participant readHeadFile
    participant readWorkingTreeFile

    Caller->>buildFileChange: entry (status, path)
    alt "entry.status == untracked"
        buildFileChange->>buildFileChange: "diffStat = null (skip numstat)"
    else tracked file
        buildFileChange->>readDiffStat: (repoRoot, path)
        readDiffStat->>parseDiffStatOutput: git diff --numstat output
        alt binary file
            parseDiffStatOutput-->>readDiffStat: isBinary:true
        else text file
            parseDiffStatOutput-->>readDiffStat: isBinary:false
        end
        readDiffStat-->>buildFileChange: DiffStat or null
    end
    alt "diffStat?.isBinary == true"
        buildFileChange->>buildFileChange: "oldText=null, newText=null"
    else not binary
        buildFileChange->>readHeadFile: (repoRoot, basePath)
        readHeadFile-->>buildFileChange: oldText
        buildFileChange->>readWorkingTreeFile: (repoRoot, path)
        readWorkingTreeFile-->>buildFileChange: newText
    end
    buildFileChange-->>Caller: RuntimeWorkspaceFileChange
Loading

Comments Outside Diff (1)

  1. src/workspace/get-workspace-changes.ts, line 271-294 (link)

    P2 Untracked binary files bypass the binary guard

    For files with entry.status === "untracked", diffStat is hard-coded to null because git diff --numstat HEAD produces no output for untracked paths. As a result, diffStat?.isBinary is always falsy for these entries, and readWorkingTreeFile is called unconditionally when the entry is neither deleted nor binary-flagged. A large untracked binary (e.g., a staged-but-not-committed mp4) would still be read into memory as UTF-8 text, reproducing the same OOM risk the PR is fixing for modified tracked files.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/workspace/get-workspace-changes.ts
    Line: 271-294
    
    Comment:
    **Untracked binary files bypass the binary guard**
    
    For files with `entry.status === "untracked"`, `diffStat` is hard-coded to `null` because `git diff --numstat HEAD` produces no output for untracked paths. As a result, `diffStat?.isBinary` is always falsy for these entries, and `readWorkingTreeFile` is called unconditionally when the entry is neither deleted nor binary-flagged. A large untracked binary (e.g., a staged-but-not-committed mp4) would still be read into memory as UTF-8 text, reproducing the same OOM risk the PR is fixing for modified tracked files.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
src/workspace/get-workspace-changes.ts:271-294
**Untracked binary files bypass the binary guard**

For files with `entry.status === "untracked"`, `diffStat` is hard-coded to `null` because `git diff --numstat HEAD` produces no output for untracked paths. As a result, `diffStat?.isBinary` is always falsy for these entries, and `readWorkingTreeFile` is called unconditionally when the entry is neither deleted nor binary-flagged. A large untracked binary (e.g., a staged-but-not-committed mp4) would still be read into memory as UTF-8 text, reproducing the same OOM risk the PR is fixing for modified tracked files.

Reviews (1): Last reviewed commit: "fix: skip binary workspace diff contents" | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant