Skip to content

fix(staleness): read changed files with -z so non-ASCII/quoted paths are detected#452

Open
tirth8205 wants to merge 2 commits into
Egonex-AI:mainfrom
tirth8205:fix/staleness-nonascii-paths
Open

fix(staleness): read changed files with -z so non-ASCII/quoted paths are detected#452
tirth8205 wants to merge 2 commits into
Egonex-AI:mainfrom
tirth8205:fix/staleness-nonascii-paths

Conversation

@tirth8205

Copy link
Copy Markdown
Contributor

Problem

  • git diff <hash>..HEAD --name-only C-quotes any path containing non-ASCII bytes (or quotes/backslashes/control chars): it wraps the path in double quotes and octal-escapes the bytes. For a file uni-café.txt, git emits the literal line "uni-caf\303\251.txt". getChangedFiles splits on \n, trims, and returns the raw lines verbatim with NO unquoting, so it returns the string "uni-caf\303\251.txt" instead of uni-café.txt. Verified against a real git repo: git diff BASE..HEAD --name-only on a repo containing uni-café.txt prints "uni-caf\303\251.txt". Downstream, mergeGraphUpdate does changedSet.has(node.filePath) against the real stored path uni-café.txt, which never equals the quoted/escaped string — so the changed file's stale nodes are silently kept and never refreshed (and isStale's changedFiles list contains a bogus path). Repos with CJK/accented/emoji filenames are…

Fix

  • Disable git path quoting and split on NUL instead of newline. Add the -z flag (NUL-terminated, no quoting) and parse accordingly, or pass -c core.quotePath=false. With -z: execFileSync('git', ['diff', ${lastCommitHash}..HEAD, '--name-only', '-z'], {cwd: projectDir, encoding: 'utf-8'}) then output.split('\0').map(l => l.trim()).filter(l => l.length > 0). The -z form also tolerates paths that contain literal newlines. (~3 LOC change.)

Testing

Adds unit test(s) that fail before the change and pass after. The full core test suite, eslint, and tsc --noEmit all pass locally on this branch.

Found via a static correctness audit of the incremental staleness detection.

🤖 Generated with Claude Code

…are detected

git diff --name-only C-quotes paths with non-ASCII/control bytes (e.g.
uni-café.txt becomes "uni-caf\303\251.txt"), so getChangedFiles returned
the escaped string and the file's stale nodes were never matched/refreshed.
Pass -z (NUL-terminated, unquoted) and split on NUL instead of newline.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@thejesh23 thejesh23 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. line.trim() defeats part of the -z fix.
The whole point of -z is to preserve raw bytes (including whitespace) in path tokens; git diff --name-only -z will never emit surrounding whitespace itself, so the only thing trim() can do here is silently corrupt legitimate filenames that contain leading/trailing spaces or tabs. Drop the .trim() and rely on the length > 0 filter to drop the trailing empty token after the final NUL.

2. Paths with embedded newlines aren't exercised in tests.
The PR description calls out that -z "tolerates paths that contain literal newlines," but neither new test case actually contains a \n inside a path — the only failure mode currently covered is non-ASCII and a space. Add a "weird\nname.txt\0" fixture so the regression guard matches the stated motivation, otherwise a future refactor reintroducing split("\n") would still pass.

3. Rename/copy handling silently regresses on the first commit hash with a rename.
git diff --name-only emits a single post-rename path, but if anything downstream ever swaps in --name-status or -M (common when extending staleness to track rename edges), each rename entry under -z is three NUL-separated tokens (R100\0old\0new\0), and the current split("\0").filter(length>0) would treat R100 and old as changed file paths. Worth a comment on getChangedFiles pinning the --name-only assumption, or a parser that consumes status prefixes properly.

Nit: the inline comment example uni-café.txt uses the post-decoding form; consider showing the actual bytes (\"uni-caf\\303\\251.txt\") git emits, which is what made this hard to spot.

git diff --name-only -z emits raw NUL-terminated path bytes and never
adds surrounding whitespace, so trim() could only corrupt legitimate
filenames with leading/trailing spaces or tabs. Drop it and rely on the
length filter to discard the trailing empty token.

Add regression tests for embedded-newline paths (the stated -z
motivation) and for whitespace-preserving tokens. Document the
--name-only single-token invariant so a future switch to
--name-status/-M does not silently break the parser.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@tirth8205

Copy link
Copy Markdown
Contributor Author

1. line.trim() defeats the -z byte-preservation contract — Fixed. Dropped the .map(line => line.trim()) step; parse is now output.split("\0").filter(line => line.length > 0). Since git diff --name-only -z emits raw, NUL-terminated path bytes and never adds surrounding whitespace, trim() could only corrupt legitimate filenames with leading/trailing spaces or tabs. The trailing empty token after the final NUL is still removed by the length filter. Added a test asserting " leading.txt\0trailing.txt \0\ttabbed.txt\0" round-trips with whitespace intact.

2. No regression test for embedded newlines — Added. New case mocks execFileSync returning "weird\nname.txt\0other.txt\0" and asserts ["weird\nname.txt", "other.txt"]. This is precisely what a reintroduced split("\n") would break, so the stated -z motivation is now locked.

3. Rename/copy --name-status / -M token-triples — There is no current bug (the code passes only --name-only with no -M/-C, so git emits a single post-rename path per entry), so I took the documented-invariant route you suggested rather than adding a status-prefix parser that no current code path exercises. Added a comment above the execFileSync call pinning the --name-only single-token assumption and warning that --name-status/-M/-C would require consuming status-prefix tokens (R100\0old\0new\0) before this parser is reused.

Nit — The branch comment already shows the actual git-emitted bytes ("uni-caf\303\251.txt") alongside the decoded form, so no change needed there.

Full core suite green (695 passed); tsc --noEmit clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants