Implement Unicode RegExp simple case folding#640
Conversation
Add generated Unicode CaseFolding.txt simple/common mappings to the RegExp Unicode resource and use them for Unicode-aware ignoreCase matching. This expands compiled character sets for /iu and /iv, canonicalizes Unicode backreference comparisons, and keeps non-Unicode /i behavior separate. Covers the ECMAScript simple-folding distinction that maps ſ to s, K to k, and ẞ to ß without applying full expansions such as ß to ss.
|
The latest updates on your projects. Learn more about Vercel for GitHub. 1 Skipped Deployment
|
|
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughImplements Unicode simple case folding for RegExp case-insensitive matching: adds UCD parsing and embedded pair caches, exposes folding/canonicalization APIs, refactors compiler char-class emission to apply folding, updates VM backreference comparisons to use Unicode-aware canonicalization, adds tests, and updates docs. ChangesUnicode Simple Case Folding Implementation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
Benchmark Results407 benchmarks Interpreted: 🟢 55 improved · 🔴 69 regressed · 283 unchanged · avg +0.4% arraybuffer.js — Interp: 🔴 6, 8 unch. · avg -0.5% · Bytecode: 🔴 12, 2 unch. · avg -7.5%
arrays.js — Interp: 🔴 1, 18 unch. · avg -0.5% · Bytecode: 🔴 18, 1 unch. · avg -9.8%
async-await.js — Interp: 6 unch. · avg -1.5% · Bytecode: 🔴 4, 2 unch. · avg -11.6%
async-generators.js — Interp: 2 unch. · avg -0.9% · Bytecode: 🔴 1, 1 unch. · avg -12.0%
base64.js — Interp: 10 unch. · avg -0.4% · Bytecode: 🟢 7, 🔴 3 · avg +1.9%
classes.js — Interp: 🟢 1, 🔴 3, 27 unch. · avg -0.4% · Bytecode: 🔴 20, 11 unch. · avg -7.2%
closures.js — Interp: 🔴 3, 8 unch. · avg -1.1% · Bytecode: 🔴 11 · avg -12.7%
collections.js — Interp: 12 unch. · avg -1.2% · Bytecode: 🔴 12 · avg -10.6%
csv.js — Interp: 🟢 4, 9 unch. · avg +1.1% · Bytecode: 🔴 12, 1 unch. · avg -9.3%
destructuring.js — Interp: 🟢 1, 🔴 5, 16 unch. · avg -1.3% · Bytecode: 🔴 21, 1 unch. · avg -10.3%
fibonacci.js — Interp: 🔴 3, 5 unch. · avg -1.5% · Bytecode: 🔴 8 · avg -11.6%
float16array.js — Interp: 🟢 1, 🔴 4, 27 unch. · avg -0.4% · Bytecode: 🟢 4, 🔴 27, 1 unch. · avg -5.9%
for-of.js — Interp: 🟢 3, 4 unch. · avg +1.1% · Bytecode: 🔴 7 · avg -11.4%
generators.js — Interp: 4 unch. · avg -2.1% · Bytecode: 🔴 4 · avg -8.0%
iterators.js — Interp: 🟢 24, 18 unch. · avg +3.2% · Bytecode: 🔴 40, 2 unch. · avg -8.3%
json.js — Interp: 🟢 1, 19 unch. · avg +1.1% · Bytecode: 🔴 20 · avg -9.9%
jsx.jsx — Interp: 🔴 18, 3 unch. · avg -5.3% · Bytecode: 🔴 19, 2 unch. · avg -8.9%
modules.js — Interp: 9 unch. · avg -0.2% · Bytecode: 🔴 9 · avg -14.2%
numbers.js — Interp: 11 unch. · avg -1.2% · Bytecode: 🔴 11 · avg -8.7%
objects.js — Interp: 7 unch. · avg -0.9% · Bytecode: 🔴 7 · avg -9.6%
promises.js — Interp: 🟢 2, 🔴 1, 9 unch. · avg -0.5% · Bytecode: 🔴 11, 1 unch. · avg -8.5%
regexp.js — Interp: 🟢 1, 🔴 2, 8 unch. · avg -17.2% · Bytecode: 🟢 6, 🔴 5 · avg -13.7%
strings.js — Interp: 🟢 3, 🔴 2, 14 unch. · avg -0.4% · Bytecode: 🔴 19 · avg -10.9%
tsv.js — Interp: 🔴 3, 6 unch. · avg -2.6% · Bytecode: 🔴 8, 1 unch. · avg -6.6%
typed-arrays.js — Interp: 🟢 10, 12 unch. · avg +26.8% · Bytecode: 🔴 18, 4 unch. · avg -9.5%
uint8array-encoding.js — Interp: 🟢 2, 🔴 8, 8 unch. · avg -4.3% · Bytecode: 🟢 5, 🔴 9, 4 unch. · avg +7.8%
weak-collections.js — Interp: 🟢 2, 🔴 10, 3 unch. · avg -1.6% · Bytecode: 🟢 1, 🔴 13, 1 unch. · avg -21.7%
Deterministic profile diffDeterministic profile diff: no significant changes. Measured on ubuntu-latest x64. Benchmark ranges compare cached main-branch min/max ops/sec with the PR run; overlapping ranges are treated as unchanged noise. Percentage deltas are secondary context. |
Suite TimingTest Runner (interpreted: 9,307 passed; bytecode: 9,307 passed)
MemoryGC rows aggregate the main thread plus all worker thread-local GCs. Test runner worker shutdown frees thread-local heaps in bulk; that shutdown reclamation is not counted as GC collections or collected objects.
Benchmarks (interpreted: 407; bytecode: 407)
MemoryGC rows aggregate the main thread plus all worker thread-local GCs. Benchmark runner performs explicit between-file collections, so collection and collected-object counts can be much higher than the test runner.
Measured on ubuntu-latest x64. |
test262 Conformance
Areas closest to 100%
Per-test deltas (+8 / -0)Newly passing (8):
Steady-state failures are non-blocking; regressions vs the cached main baseline (lower total pass count, or any PASS → non-PASS transition) fail the conformance gate. Measured on ubuntu-latest x64, bytecode mode. Areas grouped by the first two test262 path components; minimum 25 attempted tests, areas already at 100% excluded. Δ vs main compares against the most recent cached |
Handle non-Unicode RegExp ignoreCase with generated simple uppercase mappings, filtered by the ECMAScript non-ASCII-to-ASCII guard. Also handle /iu property escapes with complement-before-canonicalization semantics for \P{...}, and decode UTF-8 pattern literals consistently for non-u RegExp source text.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@source/units/Goccia.RegExp.Compiler.pas`:
- Around line 454-459: The negated-path currently only runs the
complement-before-folding branch when ANegated and FModifier.IgnoreCase and
FUnicode and not FUnicodeSets, which leaves top-level \P{...} in Unicode-sets
mode using the old negate-after-folding logic; change the condition in the block
that creates FoldRanges/ReduceUnicodeSimpleCaseFoldClosed/EmitRawCharClassRanges
so it triggers whenever ANegated and FModifier.IgnoreCase and FUnicode (i.e.,
remove the not FUnicodeSets guard) so the complement-before-folding path is
applied in `/iv` as well; update the relevant code around FoldRanges,
CharRangesToUnicodeRanges, ReduceUnicodeSimpleCaseFoldClosed,
EmitRawCharClassRanges and UnicodeRangesToCharRanges and add a small `/iv`
regression test mirroring the `/iu` cases.
In `@source/units/Goccia.RegExp.UnicodeData.pas`:
- Around line 767-770: The empty stub ExpandRegExpNonUnicodeCaseFolding must
implement the non-embedded ASCII case-fold expansion so ignore-case matching
works when GOCCIA_REGEXP_EMBEDDED_UCD is off: iterate the incoming ARanges
(TUnicodePropertyRangeArray) and for any ASCII letters add their opposite-case
codepoint ranges (e.g., map 'a'-'z' to 'A'-'Z' and vice versa) or expand
single-letter ranges to include the ASCII counterpart, merging/normalizing
overlaps so classes remain sorted/unique; update the routine used by
EmitCharMatch to call this so /i and character-class forms ([a] vs a) behave
identically—note RegExpCanonicalizeCodePoint handles backreferences only, so
keep this ASCII-only expansion here and preserve array normalization semantics
used elsewhere.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: afd38f83-f6f2-41e4-96f5-dc59362c3c05
⛔ Files ignored due to path filters (2)
source/generated/Generated.UnicodeData.pasis excluded by!**/generated/**source/generated/Generated.UnicodeData.resis excluded by!**/generated/**
📒 Files selected for processing (6)
docs/built-ins.mdscripts/generate-unicode-data.jssource/units/Goccia.RegExp.Compiler.passource/units/Goccia.RegExp.UnicodeData.passource/units/Goccia.RegExp.VM.pastests/built-ins/RegExp/unicode.js
Summary
/iuand/iv.CaseFolding/Simple, expand compiled character sets through simple folds, and canonicalize Unicode backreference comparisons.ßdoes not matchSS.Testing
Verification run:
./build.pas testrunner && ./build/GocciaTestRunner tests/built-ins/RegExp --asi./build/GocciaTestRunner tests/built-ins/RegExp --asi --mode=bytecode./build.pas loader && ./build/GocciaScriptLoader /tmp/goccia-regexp-615-fixed.js --asi./format.pas --check./fixtures/ffi/build.sh && ./build/GocciaTestRunner tests --asi --unsafe-ffi