Replace TRegExpr with purpose-built backtracking bytecode VM regex engine#585
Conversation
…gine (#515) TRegExpr used native call recursion for backtracking, causing SIGSEGV on inputs ~42K+ chars when combined with the evaluator's stack depth. Three preprocessing passes papered over feature gaps (modifier scope leak, no named groups, ASCII-approximate Unicode properties). This replaces the entire backend with a custom regex engine while keeping the public API (ExecuteRegExp, TGocciaRegExpMatchResult) unchanged. New units: - Goccia.RegExp.Compiler: recursive-descent parser over ES2026 regex grammar, single-pass bytecode emitter with pre-scanned named groups for forward \k<name> resolution, inline modifier group scoping, and duplicate named group validation via disjunction path tracking. - Goccia.RegExp.VM: iterative dispatch loop with heap-allocated backtrack stack and always-on failure memoization (64K-entry hash table). No native call recursion. 10M step limit throws Error instead of crashing. Key design decisions: - Modifier state (ignoreCase, multiline, dotAll) is encoded per-instruction at compile time, not read from global flags at runtime. This gives correct scoping for (?ims-ims:...) modifier groups. - Duplicate named backreferences emit a SPLIT chain with strict-mode backrefs (fail if group uncaptured) + terminal FAIL, so only the participating group's captured text is matched. - Reuses TextSemantics.pas UTF-8 functions (TryReadUTF8CodePoint, AdvanceUTF8StringIndex, CodePointToUTF8) rather than reimplementing. - Removes FPC regexpr package from the cross-compilation toolchain. - Removes staging/sm/RegExp/test-trailing.js from KNOWN_ENGINE_CRASHES. Closes #515 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub. 1 Skipped Deployment
|
📝 WalkthroughWalkthroughThis PR replaces the TRegExpr-based regex engine with a custom regex compiler and bytecode virtual machine (VM). It introduces a new regex-to-bytecode compiler supporting ES2025 modifier groups, Unicode property escapes, named captures, and backreferences, paired with a VM executor using backtracking, memoization, and step limiting. The legacy ChangesRegex Engine Replacement
Sequence DiagramsequenceDiagram
participant App as Application
participant Engine as RegExp Engine
participant Compiler as Regex Compiler
participant VM as Regex VM
participant Input as Input String
App->>Engine: ExecuteRegExp(pattern, flags, input)
Engine->>Compiler: CompileRegExp(pattern, flags)
Compiler->>Compiler: Parse pattern tree<br/>(atoms, groups, quantifiers)
Compiler->>Compiler: Emit bytecode<br/>(opcodes + char-class table)
Compiler-->>Engine: Return TRegExpProgram
Engine->>VM: ExecuteRegExpVM(program, input, startIndex)
VM->>Input: Read UTF-8 at current position
VM->>VM: Execute opcodes<br/>(match, split, jump, assert)
VM->>VM: Memoize (PC, inputPos)<br/>to avoid re-exploration
VM->>VM: Backtrack on failure<br/>(pop stack, retry alternatives)
alt Match found
VM-->>Engine: CaptureSlots[], Matched=true
else Backtracking exceeded step limit
VM-->>Engine: ERegExpRuntimeError
else No match
VM-->>Engine: Matched=false
end
Engine->>Engine: Map VM slots to Groups[]<br/>Set NextIndex
Engine-->>App: TGocciaRegExpMatchResult
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes This PR introduces a complete regex engine replacement with dense, multi-component logic: a 1533-line recursive-descent compiler emitting complex bytecode, a 684-line VM interpreter with memoization and backtracking state management, and coordinated changes across runtime, builtins, and CI/CD. The compiler's opcode design, quantifier/group handling, escape/property parsing, and the VM's backtrack stack and memoization strategy all require careful verification against edge cases and the test suite. Possibly related issues
Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
test262 Conformance
Areas closest to 100%
Per-test deltas (+138 / -0)Newly passing (138):
Steady-state failures are non-blocking; regressions vs the cached main baseline (lower total pass count, or any PASS → non-PASS transition) fail the conformance gate. Measured on ubuntu-latest x64, bytecode mode. Areas grouped by the first two test262 path components; minimum 25 attempted tests, areas already at 100% excluded. Δ vs main compares against the most recent cached |
Suite TimingTest Runner (interpreted: 9,021 passed; bytecode: 9,021 passed)
MemoryGC rows aggregate the main thread plus all worker thread-local GCs. Test runner worker shutdown frees thread-local heaps in bulk; that shutdown reclamation is not counted as GC collections or collected objects.
Benchmarks (interpreted: 407; bytecode: 407)
MemoryGC rows aggregate the main thread plus all worker thread-local GCs. Benchmark runner performs explicit between-file collections, so collection and collected-object counts can be much higher than the test runner.
Measured on ubuntu-latest x64. |
Benchmark Results407 benchmarks Interpreted: 🟢 27 improved · 🔴 205 regressed · 175 unchanged · avg -4.5% arraybuffer.js — Interp: 🟢 2, 🔴 4, 8 unch. · avg -0.2% · Bytecode: 🟢 14 · avg +25.6%
arrays.js — Interp: 🔴 15, 4 unch. · avg -3.2% · Bytecode: 🟢 19 · avg +32.8%
async-await.js — Interp: 6 unch. · avg -1.0% · Bytecode: 🟢 6 · avg +24.9%
async-generators.js — Interp: 2 unch. · avg -5.8% · Bytecode: 🟢 2 · avg +28.2%
base64.js — Interp: 🔴 8, 2 unch. · avg -48.4% · Bytecode: 🟢 3, 🔴 7 · avg -35.2%
classes.js — Interp: 🔴 7, 24 unch. · avg -1.2% · Bytecode: 🟢 31 · avg +27.1%
closures.js — Interp: 🔴 6, 5 unch. · avg -3.2% · Bytecode: 🟢 11 · avg +29.5%
collections.js — Interp: 🔴 6, 6 unch. · avg -2.0% · Bytecode: 🟢 12 · avg +20.6%
csv.js — Interp: 🔴 13 · avg -8.7% · Bytecode: 🟢 13 · avg +29.7%
destructuring.js — Interp: 🔴 5, 17 unch. · avg -1.5% · Bytecode: 🟢 22 · avg +26.7%
fibonacci.js — Interp: 🔴 6, 2 unch. · avg -4.2% · Bytecode: 🟢 8 · avg +30.3%
float16array.js — Interp: 🟢 7, 🔴 11, 14 unch. · avg +1.1% · Bytecode: 🟢 32 · avg +41.1%
for-of.js — Interp: 7 unch. · avg -1.5% · Bytecode: 🟢 7 · avg +29.7%
generators.js — Interp: 🔴 1, 3 unch. · avg -1.2% · Bytecode: 🟢 4 · avg +18.3%
iterators.js — Interp: 🔴 22, 20 unch. · avg -3.2% · Bytecode: 🟢 42 · avg +33.3%
json.js — Interp: 🔴 20 · avg -10.4% · Bytecode: 🟢 20 · avg +21.3%
jsx.jsx — Interp: 🟢 2, 🔴 2, 17 unch. · avg +0.6% · Bytecode: 🟢 21 · avg +26.8%
modules.js — Interp: 🔴 8, 1 unch. · avg -6.2% · Bytecode: 🟢 9 · avg +34.5%
numbers.js — Interp: 🔴 8, 3 unch. · avg -4.4% · Bytecode: 🟢 11 · avg +24.4%
objects.js — Interp: 🔴 6, 1 unch. · avg -4.8% · Bytecode: 🟢 7 · avg +25.9%
promises.js — Interp: 🔴 6, 6 unch. · avg -3.3% · Bytecode: 🟢 12 · avg +28.6%
regexp.js — Interp: 🟢 2, 🔴 9 · avg -43.4% · Bytecode: 🟢 4, 🔴 7 · avg -24.9%
strings.js — Interp: 🔴 9, 10 unch. · avg -2.9% · Bytecode: 🟢 19 · avg +24.5%
tsv.js — Interp: 🔴 9 · avg -12.9% · Bytecode: 🟢 9 · avg +28.5%
typed-arrays.js — Interp: 🟢 1, 🔴 17, 4 unch. · avg -14.0% · Bytecode: 🟢 17, 🔴 3, 2 unch. · avg +20.6%
uint8array-encoding.js — Interp: 🟢 3, 🔴 7, 8 unch. · avg -12.6% · Bytecode: 🟢 9, 🔴 8, 1 unch. · avg +8.3%
weak-collections.js — Interp: 🟢 10, 5 unch. · avg +46.1% · Bytecode: 🟢 15 · avg +59.5%
Deterministic profile diffDeterministic profile diff: no significant changes. Measured on ubuntu-latest x64. Benchmark ranges compare cached main-branch min/max ops/sec with the PR run; overlapping ranges are treated as unchanged noise. Percentage deltas are secondary context. |
The memoization hash used Knuth multiplicative constants that overflow
Cardinal, requiring {$Q-}{$R-} suppression. Replace with a shift-xor
hash that stays within Cardinal range — no compiler flags needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two bugs found via test262: 1. RX_ASSERT_START/END read raw bytes via Ord(AInput[Pos]) instead of decoding full UTF-8 code points. Multi-byte line terminators (U+2028, U+2029) were never recognized, and accessing continuation bytes triggered range check errors. Fix: use GetCodePointBefore and ReadInputCodePoint for proper UTF-8 decoding. 2. ReadInputCodePoint only decoded UTF-8 when the unicode flag was set. Without /u, multi-byte BMP characters (U+0085, U+2028, etc.) were read as single bytes, causing . to match one byte instead of one code point. Fix: always decode UTF-8 regardless of flag — the unicode flag only affects supplementary plane advancement in the scanner loop. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Convert RegExp flag properties (source, flags, global, ignoreCase, multiline, dotAll, unicode, sticky, unicodeSets, hasIndices) from per-instance data properties to spec-correct accessor getters on RegExp.prototype (ES2026 §22.2.6). Accessing them on the prototype itself returns undefined (or '(?:)' for source, '' for flags) per spec. - Introduce ERegExpRuntimeError exception class so the regex VM's backtrack limit error is distinguishable from GocciaScript VM errors. Runtime.pas catches it and re-throws as a proper JS Error via ThrowError. - Add JS test coverage: dotAll with multi-byte BMP characters, dot rejecting multi-byte line terminators, multiline anchors with multi-byte context, catastrophic backtracking throws Error, and large input regression test for #515. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The scanner loop and GetCodePointBefore only decoded UTF-8 when the unicode flag was set. Without /u, multi-byte BMP characters (U+1680, U+2000-200A, U+2028, U+2029, etc.) were read as individual bytes, causing \S, \s, \b, ^, $ and . to misclassify them. This broke test262's character-class-escape-non-whitespace test on CI. Fix: always decode UTF-8 via TryReadUTF8CodePointAllowSurrogates (which also handles lone surrogates correctly). Remove the now-unused AUnicode parameter from ReadInputCodePoint and GetCodePointBefore. The scanner loop also always advances by code point now. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The lookbehind handler called RunVM starting from PC=0, which re-executed the entire regex pattern (including .*, quantifiers, etc.) instead of just the lookbehind body. This caused exponential blowup even on tiny inputs like "xabcd".match(/.*(?<=(..|...|....))(.*)/), hanging the test262 CI runner. Fix: RunVM now accepts AStartPC and AEndPos parameters. Lookahead and lookbehind pass PC+1 as the start (skipping the assertion instruction to execute only the body up to RX_MATCH). Lookbehind uses AEndPos to check where the sub-match ended rather than checking capture slots. Also bounds the lookbehind scan distance to MAX_LOOKBEHIND_DISTANCE (256 positions) to prevent O(n) RunVM calls on large inputs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The negation flag for (?!...) and (?<!...) was encoded as bit 7 of the instruction word, which overlaps with the opcode byte — corrupting the opcode from 12 (RX_LOOKAHEAD) to 140 (invalid). The VM's case-else branch silently skipped the instruction, making all negative assertions pass unconditionally. Fix: encode the negation flag as bit 23 of the Bx field (LOOK_NEGATED_FLAG = $800000), matching the BACKREF_STRICT_FLAG convention. InsertSplitAt preserves the flag when adjusting PC targets. Also adds JS tests for positive/negative lookahead, positive/negative lookbehind, lookbehind with alternation and quantifiers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Syntax validation:
- Reject dangling quantifiers (a**, ??, +) — nothing to repeat
- Reject invalid char class ranges ([z-a], [b-ac-e]) where start > end
- Reject {min,max} where min > max ({2,1})
- Reject trailing backslash (\)
- Reject \c without letter in unicode mode
- Reject invalid identity escapes in unicode mode
- Reject quantified assertions ((?=.)*) in unicode mode
Quantifier body relocation:
- EmitBodyAt adjusts absolute PC targets (SPLIT, JUMP, LOOKAHEAD,
LOOKBEHIND) by the offset between original and destination positions.
Without this, alternation inside * quantifiers had stale SPLIT targets,
causing /(aa|aabaac|ba|b|c)*/ to return ["",null] instead of
["aaba","ba"].
Zero-width loop detection:
- RX_SPLIT records (PC, InputPos) in the memoization table on each
visit. When revisited at the same position (zero-width iteration),
takes the exit branch instead of looping. Prevents infinite loops on
patterns like /(a*)b\1+/ where the backreference matches empty.
Also makes catastrophic patterns like /^(a+)+$/ terminate with null
instead of hitting the backtrack limit.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Encode ignoreCase flag per-backref instruction (BACKREF_ICASE_FLAG =
$400000) so (?i:\1) case-folds the backreference comparison while
\1 outside a modifier group does not. The flag is set at compile time
from FModifier.IgnoreCase, giving correct scoping to modifier groups.
- Cap ParseDecimalEscape at 1M to prevent integer overflow on huge
quantifiers like {2147483648} — avoids range check error on the
staging/sm/RegExp/regress-yarr-regexp.js test262 test.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cover every regression fix with explicit tests:
- Dangling quantifiers (a**, ??, +, *) throw SyntaxError
- Invalid char class ranges ([z-a], [b-ac-e]) throw SyntaxError
- Quantifier min > max ({2,1}) throws SyntaxError
- Trailing backslash throws SyntaxError
- Huge quantifier ({2147483648}) does not crash
- Greedy * with alternation picks correct match path
- Greedy * with char class quantifier backtracks correctly
- Zero-length backref with + quantifier does not hang
- (?i:\1) case-folds backreference, (?-i:\1) disables it
- \c without letter in /u throws SyntaxError
- Quantified assertions in /u throw SyntaxError
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Backref backtracking:
- Revert the zero-width memo-on-SPLIT-entry that prevented legitimate
backtracking through (a+) in /^(a+)\1*,\1+$/. Instead detect
zero-width loops via JUMP: when jumping back to a SPLIT and the top
backtrack entry has the same exit target and input position, the
iteration consumed nothing — take the exit directly.
Char class \c in unicode mode:
- CompileEscape (character class variant) now handles \c with the same
validation as CompileEscapeAtom: \c without a-zA-Z throws SyntaxError
in unicode mode. Also reject invalid identity escapes inside character
classes in unicode mode.
Dynamic step limit:
- Step limit is now max(10M, inputLength * 100) instead of a fixed 10M.
This prevents false positives on legitimate large inputs (e.g. test262
property-escapes/generated/ASCII.js tests \P{ASCII} against 1M+ chars)
while still catching catastrophic backtracking on small inputs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Every fix should ship with its test. These were missing from the
previous commit:
- Backref backtracking through (a+) in /^(a+)\1*,\1+$/
- String.replace with backreference capture
- \c inside character class in unicode mode throws SyntaxError
- \p{ASCII} on large input does not hit step limit
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The property-escapes/generated/ASCII.js test262 test builds a 1.1M
code point non-ASCII string and tests \P{ASCII}+ against it. The
greedy + pushes one backtrack entry per code point, exceeding the 1M
backtrack stack cap.
Fix: raise DEFAULT_BACKTRACK_CAP from 1M to 10M. Also inline
ReadInputCodePoint with a fast path for ASCII bytes (< 0x80) to avoid
the function call overhead of TryReadUTF8CodePointAllowSurrogates on
every character of large inputs. CharClassContainsLinear is also
marked inline.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dead code removal:
- Remove FlagIgnoreCase/FlagMultiline/FlagDotAll from TRegExpProgram
(modifier state is encoded per-instruction, these fields are unread)
- Remove EncodeInstr (duplicate of EncodeOpBx), AddCharClassFromDynamic
(duplicate of AddCharClass), CharClassContains (unused binary search),
CaseFold (unused method)
SIGSEGV fix:
- IsRegExpValue now checks HasOwnProperty('source') and
HasOwnProperty('flags') instead of Symbol.toStringTag. An object
created via Object.create(RegExp.prototype) inherits the tag but has
no internal regex state — the prototype getters would recurse
infinitely trying to read source/flags, causing stack overflow.
Memo correctness:
- Backref match failure now restores InputPos before calling MemoAdd,
so the memo records the correct (PC, pos) pair instead of the
partially-advanced position.
- Invalid opcodes now raise ERegExpRuntimeError instead of silently
skipping via Inc(PC).
Lazy memo allocation:
- MemoInit no longer allocates the 65K-entry table. MemoAdd allocates
on first use, MemoContains returns false if unallocated. This avoids
~1MB allocation per lookbehind sub-call (up to 256 calls per
assertion).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Simplify pass from three-agent review:
Dead code:
- Remove FlagUnicode from TRegExpProgram (unread since per-instruction
encoding handles unicode behavior)
- Remove FFlags field from TRegExpCompiler (parsed into FModifier/FUnicode
in constructor, never read after)
- Remove AInCharClass parameter from CompileEscape (never referenced)
- Remove dead ClassIdx variable from EmitUnicodePropertyClass
- Remove duplicate AddCharClassFromDynamic (identical to AddCharClass)
Shared constants:
- Move BACKREF_STRICT_FLAG, BACKREF_ICASE_FLAG, LOOK_NEGATED_FLAG and
their mask companions to the Compiler interface section so the VM
uses named constants instead of raw hex at 8 decode sites.
Efficiency:
- PushBacktrack reuses existing Slots array when length matches
(avoids heap allocation per push on hot path)
- FillChar($FF) replaces per-element slot init loop
Correctness:
- \P{...} inside character classes now throws SyntaxError instead of
silently treating it as \p{...} (misleading comment removed)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (2)
source/units/Goccia.RegExp.VM.pas (1)
408-411: 💤 Low valueMinor: Redundant assignment to
RefEnd.Line 410 re-reads
RefEndfrom slots, but it was already assigned at line 396. This is harmless but unnecessary.♻️ Remove redundant assignment
RefPos := RefStart; LookMatched := True; - RefEnd := ASlots[BackrefGroup * 2 + 1]; I := InputPos;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@source/units/Goccia.RegExp.VM.pas` around lines 408 - 411, Remove the redundant re-assignment of RefEnd in the block that sets RefPos, LookMatched and I: eliminate the line "RefEnd := ASlots[BackrefGroup * 2 + 1];" since RefEnd is already set earlier (same slot index); keep the other assignments (RefPos := RefStart, LookMatched := True, I := InputPos) unchanged and ensure no other logic relies on a second read of ASlots in this scope.source/units/Goccia.RegExp.Engine.pas (1)
143-146: 💤 Low valueMinor: Redundant
EMPTY_REGEXcheck.Lines 143-145 check if
PatternToCompile = EMPTY_REGEXand set it to'', but line 132 already returns early whenAPattern = EMPTY_REGEX. This code path is unreachable.♻️ Simplify by removing unreachable check
- PatternToCompile := APattern; - if PatternToCompile = EMPTY_REGEX then - PatternToCompile := ''; - Prog := CompileRegExp(PatternToCompile, AFlags); + Prog := CompileRegExp(APattern, AFlags);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@source/units/Goccia.RegExp.Engine.pas` around lines 143 - 146, The check that sets PatternToCompile to '' when PatternToCompile = EMPTY_REGEX is unreachable because the function already returns when APattern = EMPTY_REGEX; remove the redundant if block (the comparison against EMPTY_REGEX and the assignment to '') so the code simply sets PatternToCompile := APattern and then calls CompileRegExp(PatternToCompile, AFlags) without the extra conditional; reference symbols: PatternToCompile, EMPTY_REGEX, APattern, CompileRegExp.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/decision-log.md`:
- Line 20: The decision-log entry introduces a new area tag "engine" which isn't
in the documented valid tag list; either add "engine" to the contributor guide's
allowed tags list or change the entry's tag from `engine` to one of the existing
tags (e.g., `bytecode`, `interpreter`, or `runtime`) so the tag set remains
consistent; update the contributor guide's tag enumeration accordingly and
ensure the decision-log entry's tag matches that list (look for the tag list in
the contributor guide and the new tag in the decision-log entry line containing
`engine`).
In `@source/units/Goccia.RegExp.Compiler.pas`:
- Around line 519-535: ParseGroupName currently accepts any text until '>' but
must enforce ECMAScript RegExpIdentifierName rules: the first character must be
RegExpIdentifierStart (UnicodeIDStart or '$' or '_') and subsequent characters
must be RegExpIdentifierPart (UnicodeIDContinue or '$' or '_'), and empty names
are invalid; update TRegExpCompiler.ParseGroupName to validate the captured name
after reading (or validate per-char while reading) and raise a SyntaxError for
empty or invalid names (reject digit-first, hyphens, etc.). Also ensure the same
validation is invoked for backreference parsing that consumes names (e.g. the
method handling "\k<...>") so both named-capture and \k<> use the same
identifier check and produce SyntaxError on invalid input.
- Around line 356-453: GetUnicodePropertyRanges currently uses incomplete
hard-coded ranges; replace this brittle table with a dynamic, standards-correct
implementation inside GetUnicodePropertyRanges that iterates Unicode code points
(0..$10FFFF), uses the platform Unicode API (e.g.
System.Character.TCharacter.GetUnicodeCategory or a UnicodeData.txt lookup) to
test each code point against the requested property name (APropertyName like
'L','Lu','Ll','Nd','White_Space', etc.), accumulate contiguous matching code
points into ranges via AddRange(ARanges, ARangeCount, ...), and if the caller
requested the negated form (regex \P{...} — detect an uppercase leading 'P' or a
provided negation flag per caller convention) invert the resulting ranges to
return the complement; ensure Nd and other categories are derived from Unicode
general categories rather than ASCII-only lists and preserve performance by
batching contiguous code points into ranges.
- Around line 601-620: ParseDecimalEscape currently clamps parsed quantifiers to
MAX_QUANTIFIER but CompileQuantifier still unrolls the atom body per repetition
causing O(bound) code emission; change CompileQuantifier to detect large/clamped
counts (the value produced by ParseDecimalEscape / MAX_QUANTIFIER) and avoid
per-iteration unrolling by emitting a compact loop construct instead (e.g., emit
a single copy of the atom body plus a counter-driven jump/loop or a new
REPEAT_N/LOOP_N bytecode) rather than emitting SPLIT and the full atom for each
repetition; update code paths that generate SPLIT/unrolled bodies (the loop
around lines referenced in CompileQuantifier) to use the counter-loop emission
so large quantified counts no longer expand into many emitted instructions.
In `@tests/built-ins/RegExp/unicode.js`:
- Around line 218-230: The test cases use string literals containing raw
newlines inside calls to /^abc/m.test(...) and /abc$/m.test(...), which causes a
syntax error; replace those raw multi-line string literals with either escaped
newline sequences (\"\\n\") or template literals (backticks) so the strings are
valid (e.g., "xyz\\nabc" or `xyz\nabc`) for all four expect(...) calls
referencing /^abc/m.test and /abc$/m.test, preserving the same content and
assertions.
---
Nitpick comments:
In `@source/units/Goccia.RegExp.Engine.pas`:
- Around line 143-146: The check that sets PatternToCompile to '' when
PatternToCompile = EMPTY_REGEX is unreachable because the function already
returns when APattern = EMPTY_REGEX; remove the redundant if block (the
comparison against EMPTY_REGEX and the assignment to '') so the code simply sets
PatternToCompile := APattern and then calls CompileRegExp(PatternToCompile,
AFlags) without the extra conditional; reference symbols: PatternToCompile,
EMPTY_REGEX, APattern, CompileRegExp.
In `@source/units/Goccia.RegExp.VM.pas`:
- Around line 408-411: Remove the redundant re-assignment of RefEnd in the block
that sets RefPos, LookMatched and I: eliminate the line "RefEnd :=
ASlots[BackrefGroup * 2 + 1];" since RefEnd is already set earlier (same slot
index); keep the other assignments (RefPos := RefStart, LookMatched := True, I
:= InputPos) unchanged and ensure no other logic relies on a second read of
ASlots in this scope.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8f20a955-72f5-4f75-9f94-a0ef198213fc
📒 Files selected for processing (16)
.github/workflows/ci.yml.github/workflows/toolchain.ymldocs/build-system.mddocs/built-ins.mddocs/decision-log.mdscripts/run_test262_suite.tssource/units/Goccia.Builtins.GlobalRegExp.passource/units/Goccia.RegExp.Compiler.passource/units/Goccia.RegExp.Engine.passource/units/Goccia.RegExp.Runtime.passource/units/Goccia.RegExp.Unicode.passource/units/Goccia.RegExp.VM.pastests/built-ins/RegExp/constructor.jstests/built-ins/RegExp/modifiers.jstests/built-ins/RegExp/prototype/exec.jstests/built-ins/RegExp/unicode.js
💤 Files with no reviewable changes (2)
- scripts/run_test262_suite.ts
- source/units/Goccia.RegExp.Unicode.pas
Summary
Goccia.RegExp.Compiler.pas(recursive-descent parser + bytecode emitter, ~1,400 lines) andGoccia.RegExp.VM.pas(iterative executor with heap backtrack stack + always-on memoization, ~650 lines). Public API (ExecuteRegExp,TGocciaRegExpMatchResult) unchanged;Goccia.RegExp.Runtime.pasandGoccia.Builtins.GlobalRegExp.pasunmodified.regexprpackage from the cross-compilation toolchain, simplifying CI.(?ims-ims:...)) encoded per-instruction at compile time for correct scoping; duplicate named backreferences emit strict-mode SPLIT chains; memoization cache (64K entries) prunes exponential backtracking.Closes #515
Testing