Skip to content

fix(base-extractor): preserve full string value across escape sequences in getStringValue#450

Open
tirth8205 wants to merge 2 commits into
Egonex-AI:mainfrom
tirth8205:fix/base-extractor-string-escapes
Open

fix(base-extractor): preserve full string value across escape sequences in getStringValue#450
tirth8205 wants to merge 2 commits into
Egonex-AI:mainfrom
tirth8205:fix/base-extractor-string-escapes

Conversation

@tirth8205

Copy link
Copy Markdown
Contributor

Problem

  • tree-sitter splits a string literal's contents into MULTIPLE string_fragment nodes whenever an escape_sequence (e.g. \t, ", \n) appears between them. getStringValue returns only the FIRST string_fragment child, so everything from the first escape onward is silently dropped. I verified this against the real tree-sitter-typescript WASM grammar shipped in the repo: for the source import x from './a\tb' the string node has children [" , string_fragment 'a', escape_sequence '\t', string_fragment 'b', '], and getStringValue returns "./a" instead of the full import source. Likewise "a\"b" returns "a" (should be a"b) and "he\tllo" returns "he". The only consumer, typescript-extractor.extractImport (line 364), therefore records a truncated import source for any module path containing an escape, and getStringValue is also re-exported publicly from extractors/index.ts.

Fix

  • Concatenate the text of every content child (string_fragment and escape_sequence) instead of returning only the first fragment: export function getStringValue(node: TreeSitterNode): string { let value = ""; let found = false; for (let i = 0; i < node.childCount; i++) { const child = node.child(i); if (child && (child.type === "string_fragment" || child.type === "escape_sequence")) { value += child.text; found = true; } } if (found) return value; return node.text.replace(/^['"]|['"]$/g, ""); }…

Testing

Adds unit test(s) that fail before the change and pass after. The full core test suite, eslint, and tsc --noEmit all pass locally on this branch.

Found via a static correctness audit of the shared tree-sitter base extractor.

🤖 Generated with Claude Code

…es in getStringValue

tree-sitter splits a string literal's contents into multiple string_fragment
nodes whenever an escape_sequence appears between them. getStringValue returned
only the first string_fragment, silently dropping everything from the first
escape onward (e.g. './a\tb' became './a'). This truncated import sources for
any module path containing an escape.

Fix concatenates the text of every content child (string_fragment and
escape_sequence) instead of returning only the first fragment, preserving the
full raw value.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@thejesh23 thejesh23 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. Function returns raw source text, not the decoded string value.
escape_sequence.text is the literal source (e.g. the two characters \ + t), so getStringValue on './a\tb' now returns ./a\tb verbatim — a backslash plus t, not a tab. The name and JSDoc ("unquoted string value") imply a decoded value; consumers comparing import sources to filesystem paths or resolved module specifiers will still be wrong, just wrong differently. Either decode the escapes here or rename/document this as "raw inner text".

2. Behavior is JS/TS-grammar-specific but the helper is re-exported as generic.
Only string_fragment/escape_sequence are recognized; Python/Go/Rust/Ruby/etc. use different child node types (string_content, interpreted_string_literal children, raw_string_literal, etc.), so for those grammars the function silently falls through to the quote-stripping path. Worth either restricting the helper to JS-family extractors or adding the other content node types — relevant to #435 (Dart extractor already calls this for string-literal imports).

3. Test coverage misses the fallback and template-string paths.
All three new tests hit the new concatenation branch on TS grammar; the node.text.replace(/^['"]|['"]$/g, "") fallback and template_string (with template_chars / ${...}) are still unexercised, and the second assertion uses toContain rather than toBe so it would also pass on the pre-fix truncated output a. Tighten that assertion and add a fallback-path case.

…contract

Address review on getStringValue:
- Clarify JSDoc that the helper returns raw inner text with escape
  sequences preserved verbatim (not decoded), so consumers don't assume a
  decoded value.
- Document that recognized content nodes (string_fragment/escape_sequence)
  are JS/TS-family; other grammars fall through to the quote-stripping
  fallback. Annotate the fallback branch.
- Tighten the escaped-quote assertion from toContain to toBe('a\\"b') so a
  regression to the truncated first fragment would fail.
- Add a fallback-path test covering nodes without JS-family content
  children (single-quote, double-quote, and backtick stripping).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@tirth8205

Copy link
Copy Markdown
Contributor Author

1. Decode escapes vs. raw inner text — Keeping the raw-concatenation contract; not adding escape decoding. This PR's scope is fixing the truncation where getStringValue returned only the first string_fragment and dropped everything after the first escape ('./a\tb''./a'). The fix concatenates string_fragment + escape_sequence text, yielding the full inner text ./a\tb (backslash + t). The only non-test consumer is typescript-extractor.extractImport, whose result feeds resolve(dir, source); real JS/TS import specifiers use forward slashes and effectively never contain C-style escapes, so decoding wouldn't improve resolution and could actively corrupt a legitimate path containing a backslash, while also changing behavior for other potential callers. That said, the JSDoc was genuinely misleading — I've rewritten it to state the function returns raw inner text with escape sequences preserved verbatim (not decoded), so consumers don't assume a decoded value.

2. JS/TS-grammar specificity of the helper — Documented rather than speculatively extended. Added a comment/JSDoc noting the recognized content nodes (string_fragment / escape_sequence) are JS-family and that other grammars (Python string_content, Go/Rust string-literal children) have no matching children and fall through to the quote-stripping fallback, and annotated the fallback branch. The only caller on this branch is typescript-extractor (no Dart extractor exists here yet — the #435 reference is forward-looking). Adding other grammars' node types unverified would be fragile in a bugfix PR, so extending the recognized set is left to the extractor PRs that need it.

3. Test coverage + assertion tightness — Fixed. Changed the escaped-quote assertion from toContain('a\"b') to toBe('a\"b') (verified against the real parse tree that this is the exact raw value), so a regression to the truncated first fragment "a" fails. Added a fallback-path test covering nodes without JS-family content children (single-quote, double-quote, and backtick stripping). Note on the template_string suggestion: I confirmed via the parse tree that TS template_string children are also string_fragment, so templates hit the concatenation branch, not the fallback — the fallback only triggers when there are no string_fragment/escape_sequence children, which is the case the new test exercises (and which is exactly what non-JS grammars produce). One factual note on the original rationale: the pre-fix output "a" does not contain a\"b, so the old toContain would already have failed before the fix — it wasn't a false-pass — but tightening to toBe is still the right call.

Full core suite: 697 passed. tsc --noEmit: clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants