fix(base-extractor): preserve full string value across escape sequences in getStringValue by tirth8205 · Pull Request #450 · Egonex-AI/Understand-Anything

tirth8205 · 2026-06-14T18:24:20Z

Problem

tree-sitter splits a string literal's contents into MULTIPLE string_fragment nodes whenever an escape_sequence (e.g. \t, ", \n) appears between them. getStringValue returns only the FIRST string_fragment child, so everything from the first escape onward is silently dropped. I verified this against the real tree-sitter-typescript WASM grammar shipped in the repo: for the source import x from './a\tb' the string node has children [" , string_fragment 'a', escape_sequence '\t', string_fragment 'b', '], and getStringValue returns "./a" instead of the full import source. Likewise "a\"b" returns "a" (should be a"b) and "he\tllo" returns "he". The only consumer, typescript-extractor.extractImport (line 364), therefore records a truncated import source for any module path containing an escape, and getStringValue is also re-exported publicly from extractors/index.ts.

Fix

Concatenate the text of every content child (string_fragment and escape_sequence) instead of returning only the first fragment: export function getStringValue(node: TreeSitterNode): string { let value = ""; let found = false; for (let i = 0; i < node.childCount; i++) { const child = node.child(i); if (child && (child.type === "string_fragment" || child.type === "escape_sequence")) { value += child.text; found = true; } } if (found) return value; return node.text.replace(/^['"]|['"]$/g, ""); }…

Testing

Adds unit test(s) that fail before the change and pass after. The full core test suite, eslint, and tsc --noEmit all pass locally on this branch.

Found via a static correctness audit of the shared tree-sitter base extractor.

🤖 Generated with Claude Code

…es in getStringValue tree-sitter splits a string literal's contents into multiple string_fragment nodes whenever an escape_sequence appears between them. getStringValue returned only the first string_fragment, silently dropping everything from the first escape onward (e.g. './a\tb' became './a'). This truncated import sources for any module path containing an escape. Fix concatenates the text of every content child (string_fragment and escape_sequence) instead of returning only the first fragment, preserving the full raw value. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

thejesh23

1. Function returns raw source text, not the decoded string value.
escape_sequence.text is the literal source (e.g. the two characters \ + t), so getStringValue on './a\tb' now returns ./a\tb verbatim — a backslash plus t, not a tab. The name and JSDoc ("unquoted string value") imply a decoded value; consumers comparing import sources to filesystem paths or resolved module specifiers will still be wrong, just wrong differently. Either decode the escapes here or rename/document this as "raw inner text".

2. Behavior is JS/TS-grammar-specific but the helper is re-exported as generic.
Only string_fragment/escape_sequence are recognized; Python/Go/Rust/Ruby/etc. use different child node types (string_content, interpreted_string_literal children, raw_string_literal, etc.), so for those grammars the function silently falls through to the quote-stripping path. Worth either restricting the helper to JS-family extractors or adding the other content node types — relevant to #435 (Dart extractor already calls this for string-literal imports).

3. Test coverage misses the fallback and template-string paths.
All three new tests hit the new concatenation branch on TS grammar; the node.text.replace(/^['"]|['"]$/g, "") fallback and template_string (with template_chars / ${...}) are still unexercised, and the second assertion uses toContain rather than toBe so it would also pass on the pre-fix truncated output a. Tighten that assertion and add a fallback-path case.

…contract Address review on getStringValue: - Clarify JSDoc that the helper returns raw inner text with escape sequences preserved verbatim (not decoded), so consumers don't assume a decoded value. - Document that recognized content nodes (string_fragment/escape_sequence) are JS/TS-family; other grammars fall through to the quote-stripping fallback. Annotate the fallback branch. - Tighten the escaped-quote assertion from toContain to toBe('a\\"b') so a regression to the truncated first fragment would fail. - Add a fallback-path test covering nodes without JS-family content children (single-quote, double-quote, and backtick stripping). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

tirth8205 · 2026-06-16T22:36:39Z

1. Decode escapes vs. raw inner text — Keeping the raw-concatenation contract; not adding escape decoding. This PR's scope is fixing the truncation where getStringValue returned only the first string_fragment and dropped everything after the first escape ('./a\tb' → './a'). The fix concatenates string_fragment + escape_sequence text, yielding the full inner text ./a\tb (backslash + t). The only non-test consumer is typescript-extractor.extractImport, whose result feeds resolve(dir, source); real JS/TS import specifiers use forward slashes and effectively never contain C-style escapes, so decoding wouldn't improve resolution and could actively corrupt a legitimate path containing a backslash, while also changing behavior for other potential callers. That said, the JSDoc was genuinely misleading — I've rewritten it to state the function returns raw inner text with escape sequences preserved verbatim (not decoded), so consumers don't assume a decoded value.

2. JS/TS-grammar specificity of the helper — Documented rather than speculatively extended. Added a comment/JSDoc noting the recognized content nodes (string_fragment / escape_sequence) are JS-family and that other grammars (Python string_content, Go/Rust string-literal children) have no matching children and fall through to the quote-stripping fallback, and annotated the fallback branch. The only caller on this branch is typescript-extractor (no Dart extractor exists here yet — the #435 reference is forward-looking). Adding other grammars' node types unverified would be fragile in a bugfix PR, so extending the recognized set is left to the extractor PRs that need it.

3. Test coverage + assertion tightness — Fixed. Changed the escaped-quote assertion from toContain('a\"b') to toBe('a\"b') (verified against the real parse tree that this is the exact raw value), so a regression to the truncated first fragment "a" fails. Added a fallback-path test covering nodes without JS-family content children (single-quote, double-quote, and backtick stripping). Note on the template_string suggestion: I confirmed via the parse tree that TS template_string children are also string_fragment, so templates hit the concatenation branch, not the fallback — the fallback only triggers when there are no string_fragment/escape_sequence children, which is the case the new test exercises (and which is exactly what non-JS grammars produce). One factual note on the original rationale: the pre-fix output "a" does not contain a\"b, so the old toContain would already have failed before the fix — it wasn't a false-pass — but tightening to toBe is still the right call.

Full core suite: 697 passed. tsc --noEmit: clean.

thejesh23 reviewed Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(base-extractor): preserve full string value across escape sequences in getStringValue#450

fix(base-extractor): preserve full string value across escape sequences in getStringValue#450
tirth8205 wants to merge 2 commits into
Egonex-AI:mainfrom
tirth8205:fix/base-extractor-string-escapes

tirth8205 commented Jun 14, 2026

Uh oh!

thejesh23 left a comment

Uh oh!

tirth8205 commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tirth8205 commented Jun 14, 2026

Problem

Fix

Testing

Uh oh!

thejesh23 left a comment

Choose a reason for hiding this comment

Uh oh!

tirth8205 commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants