Implement v flag (UnicodeSets) for RegExp#621
Conversation
Add full support for the ES2024 RegExp v flag (unicodeSets), including
set intersection (&&), subtraction (--), nested character classes,
\q{} string disjunction, and properties-of-strings (\p{Basic_Emoji},
\p{Emoji_Keycap_Sequence}, \p{RGI_Emoji_Flag_Sequence}, etc.).
Closes #608
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub. 1 Skipped Deployment
|
|
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughThis PR implements the ECMAScript ChangesUnicode Sets v Flag Feature
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly Related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
Suite TimingTest Runner (interpreted: 9,095 passed; bytecode: 9,095 passed)
MemoryGC rows aggregate the main thread plus all worker thread-local GCs. Test runner worker shutdown frees thread-local heaps in bulk; that shutdown reclamation is not counted as GC collections or collected objects.
Benchmarks (interpreted: 407; bytecode: 407)
MemoryGC rows aggregate the main thread plus all worker thread-local GCs. Benchmark runner performs explicit between-file collections, so collection and collected-object counts can be much higher than the test runner.
Measured on ubuntu-latest x64. |
Benchmark Results407 benchmarks Interpreted: 🟢 240 improved · 🔴 39 regressed · 128 unchanged · avg +5.3% arraybuffer.js — Interp: 🟢 5, 🔴 3, 6 unch. · avg +2.6% · Bytecode: 🟢 7, 7 unch. · avg +2.3%
arrays.js — Interp: 🟢 18, 1 unch. · avg +7.1% · Bytecode: 🟢 9, 10 unch. · avg +2.1%
async-await.js — Interp: 🟢 4, 2 unch. · avg +5.8% · Bytecode: 🟢 2, 4 unch. · avg +0.1%
async-generators.js — Interp: 🟢 1, 1 unch. · avg +8.5% · Bytecode: 2 unch. · avg +1.0%
base64.js — Interp: 🟢 9, 1 unch. · avg +4.5% · Bytecode: 10 unch. · avg +0.8%
classes.js — Interp: 🟢 27, 4 unch. · avg +5.5% · Bytecode: 🟢 11, 20 unch. · avg +2.7%
closures.js — Interp: 🟢 8, 🔴 1, 2 unch. · avg +4.2% · Bytecode: 🟢 6, 5 unch. · avg +1.2%
collections.js — Interp: 🟢 12 · avg +9.3% · Bytecode: 🟢 5, 7 unch. · avg +1.6%
csv.js — Interp: 🟢 1, 🔴 3, 9 unch. · avg -1.4% · Bytecode: 13 unch. · avg +0.9%
destructuring.js — Interp: 🟢 16, 6 unch. · avg +5.6% · Bytecode: 🟢 10, 12 unch. · avg +2.3%
fibonacci.js — Interp: 🟢 8 · avg +6.9% · Bytecode: 🟢 5, 3 unch. · avg +2.8%
float16array.js — Interp: 🟢 19, 🔴 2, 11 unch. · avg +3.3% · Bytecode: 🟢 5, 🔴 2, 25 unch. · avg +0.1%
for-of.js — Interp: 🟢 4, 3 unch. · avg +3.5% · Bytecode: 🟢 4, 3 unch. · avg +3.4%
generators.js — Interp: 🟢 4 · avg +6.4% · Bytecode: 4 unch. · avg +0.9%
iterators.js — Interp: 🟢 34, 8 unch. · avg +6.2% · Bytecode: 🟢 15, 🔴 1, 26 unch. · avg +2.2%
json.js — Interp: 🔴 4, 16 unch. · avg -1.0% · Bytecode: 🟢 12, 8 unch. · avg +4.3%
jsx.jsx — Interp: 🟢 19, 2 unch. · avg +5.9% · Bytecode: 🟢 14, 7 unch. · avg +3.6%
modules.js — Interp: 🔴 1, 8 unch. · avg -0.8% · Bytecode: 🟢 6, 3 unch. · avg +3.6%
numbers.js — Interp: 🔴 2, 9 unch. · avg -1.5% · Bytecode: 🟢 2, 🔴 1, 8 unch. · avg +1.2%
objects.js — Interp: 🟢 6, 1 unch. · avg +3.8% · Bytecode: 🟢 7 · avg +5.1%
promises.js — Interp: 🟢 5, 7 unch. · avg +3.8% · Bytecode: 🟢 4, 🔴 2, 6 unch. · avg +1.4%
regexp.js — Interp: 🟢 6, 5 unch. · avg +2.5% · Bytecode: 🟢 3, 8 unch. · avg +1.3%
strings.js — Interp: 🟢 2, 🔴 6, 11 unch. · avg -1.8% · Bytecode: 🟢 12, 7 unch. · avg +5.6%
tsv.js — Interp: 🟢 2, 🔴 2, 5 unch. · avg +0.3% · Bytecode: 🟢 1, 8 unch. · avg +1.1%
typed-arrays.js — Interp: 🟢 14, 🔴 4, 4 unch. · avg +23.0% · Bytecode: 🟢 7, 🔴 5, 10 unch. · avg +1.3%
uint8array-encoding.js — Interp: 🟢 12, 🔴 2, 4 unch. · avg +30.5% · Bytecode: 🟢 11, 🔴 2, 5 unch. · avg +10.5%
weak-collections.js — Interp: 🟢 4, 🔴 9, 2 unch. · avg -14.8% · Bytecode: 🟢 7, 8 unch. · avg -0.4%
Deterministic profile diffDeterministic profile diff: no significant changes. Measured on ubuntu-latest x64. Benchmark ranges compare cached main-branch min/max ops/sec with the PR run; overlapping ranges are treated as unchanged noise. Percentage deltas are secondary context. |
test262 Conformance
Areas closest to 100%
Per-test deltas (+121 / -0)Newly passing (121):
Steady-state failures are non-blocking; regressions vs the cached main baseline (lower total pass count, or any PASS → non-PASS transition) fail the conformance gate. Measured on ubuntu-latest x64, bytecode mode. Areas grouped by the first two test262 path components; minimum 25 attempted tests, areas already at 100% excluded. Δ vs main compares against the most recent cached |
- Compute complement for nested negated classes ([^...] inside a class)
so patterns like [[^0-9]&&[a-z]] produce correct results
- Support chained same-kind operators ([a-z&&[a-m]&&[a-f]]) with a
syntax error when mixing && and -- at the same nesting level
- Remove dead ParseClassOperand method
- Guard \q{} behind FUnicodeSets so it only works in v-mode
- Add 11 new tests: chained ops, nested negation, \q{} in set
operations, bytecode mode smoke tests
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Extract FilterClassStrings + ApplyClassSetOp to eliminate near-identical
IntersectClassContents/SubtractClassContents bodies
- Use ComplementContents for \P{} complement instead of inline logic,
removing ~8KB of unused stack arrays from ParseClassEscape
- Delegate EmitClassContents range emission to EmitCharClassRanges
instead of duplicating the case-folding expansion
- Remove ~35 duplicate Basic_Emoji ranges already covered by earlier entries
- Replace O(n^2) bubble sort in NormalizeRanges with insertion sort
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@source/units/Goccia.RegExp.Compiler.pas`:
- Around line 1298-1306: The loop that emits raw RX_CHAR ops for multi-codepoint
strings bypasses case-folding and should use the existing EmitCharMatch helper;
in the block inside for I := AContents.StringCount - 1 downto 0 do where you set
SplitHoles/SplitCount and call Emit(EncodeOpBx(RX_SPLIT, 0)), replace the inner
for J := 0 to High(AContents.Strings[I].CodePoints) do Emit(EncodeOpBx(RX_CHAR,
Integer(...))) with calls to EmitCharMatch(AContents.Strings[I].CodePoints[J])
(or the appropriate signature of EmitCharMatch) so FModifier.IgnoreCase handling
is applied consistently to these string branches instead of emitting raw RX_CHAR
opcodes.
- Around line 1202-1239: When parsing a potential range endpoint the code
currently allows a prior call to ParseClassEscape to append ranges/strings to
AContents and then reuses those as the range endpoint; instead, detect this
situation in v-mode and throw a SyntaxError: after calling
ParseClassEscape(AContents) (and before using
AContents.Ranges[AContents.RangeCount - 1] as Lo) check if the parser is in
v-mode and if AContents.RangeCount > SavePos or AContents.StringCount > SavePos,
and if so raise an EConvertError/SyntaxError (same message as other regex syntax
errors) instead of treating the escape as a numeric endpoint; apply the same
check for the EscContents code path (after ParseClassEscape(EscContents)) so
character class escapes cannot be used as either Lo or Hi when building a range
with AddClassRange or when using ReadCodePoint.
- Around line 1177-1198: The nested-class parser currently calls
ParseClassUnion() but does not let it consume set operators, causing pending
'&&' or '--' to break out and leave Match(']') to fail; fix by updating
ParseClassUnion (and any helper like ParseClassOperand/ParseClassRange if
needed) to accept an allowSetOperators boolean (or equivalent flag) and
implement handling of '&&' and '--' when that flag is true; then call
ParseClassUnion(EscContents, True) (or pass True) from the nested class branch
inside CompileUnicodeSetsClass/this routine so nested character classes like
[[a-z&&[a-f]]] are parsed with set operators, while preserving current top-level
behavior where necessary, and ensure
MergeClassContents/ComplementContents/NormalizeRanges continue to operate on the
fully-parsed EscContents.
- Around line 761-1005: The GetStringPropertySequences implementation
(TRegExpCompiler.GetStringPropertySequences) currently hardcodes only a small
subset of sequences for properties like 'RGI_Emoji_Flag_Sequence' and
'RGI_Emoji_ZWJ_Sequence', which breaks full Unicode v-flag semantics; replace
the hardcoded lists with a data-driven solution that loads the complete sequence
sets from the authoritative Unicode data (e.g., emoji-sequences.txt / emoji-test
or the UTS `#51` derived list) at build or init time and populates AContents via
AddClassString/AddClassRange (or a helper like LoadEmojiSequences) so all ~334
flag and ~1000+ ZWJ sequences are included; keep the existing property name
checks and error path (EConvertError) but ensure the loading occurs once and is
referenced by GetStringPropertySequences for the named properties.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c6f6d353-8ada-42ad-b0f3-f4f15b223e9f
📒 Files selected for processing (4)
source/units/Goccia.Error.Suggestions.passource/units/Goccia.Lexer.passource/units/Goccia.RegExp.Compiler.pastests/built-ins/RegExp/unicode-sets.js
- Allow set operators (&&, --) inside nested character classes by
extracting ApplyNestedSetOps and calling it after ParseClassUnion
in both nested-class and top-level paths
- Reject character class escapes (\d, \w, \s, \p{}, \q{}) as range
endpoints in v-mode with SyntaxError per ES2024 spec
- Use EmitCharMatch instead of raw RX_CHAR for string sequence
branches so \q{} and string properties respect /i case-folding
- Add 7 new tests covering nested set ops, class-escape rejection,
and case-insensitive string matching
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
vflag (unicodeSets) with full set operations and properties-of-strings support.vanddflags in regex literals (/pattern/v); update error suggestions.FUnicodeSetsfield, range set operation helpers (normalize, intersect, subtract), and a dedicated v-mode character class parser supporting&&(intersection),--(subtraction), nested[...]classes,\q{...}string disjunction,\P{...}(complement) inside classes, and v-mode syntax restrictions on unescaped special characters.\p{Basic_Emoji},\p{Emoji_Keycap_Sequence},\p{RGI_Emoji_Flag_Sequence},\p{RGI_Emoji_Modifier_Sequence},\p{RGI_Emoji_Tag_Sequence},\p{RGI_Emoji_ZWJ_Sequence}— multi-codepoint sequences compiled as split/char alternations using existing bytecode opcodes.RX_SPLIT/RX_CHAR/RX_JUMPsequences, requiring no new VM opcodes.Testing