Generate full Unicode property tables for regex \p{} escapes#630
Conversation
Replace the 12 hardcoded ASCII-approximate property ranges with ICU-first
UnicodeSet queries backed by a generated UCD binary resource fallback.
Adds \p{Script=Greek}, \p{General_Category=...}, \p{Emoji}, and all ES2026
§22.2.2.9 properties with full alias resolution.
Closes #607
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub. 1 Skipped Deployment
|
|
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds ICU-backed and embedded Unicode property range resolution, a Node.js generator that creates a compact UCD resource and Pascal unit, integrates dynamic property lookup into the RegExp compiler (replacing hardcoded maps), and documents the generation/build fallback. ChangesUnicode Property Data Integration for RegExp
Sequence Diagram(s)sequenceDiagram
participant Compiler as RegExp Compiler
participant UnicodeICU as UnicodeICU Unit
participant ICU as ICU Library
Compiler->>UnicodeICU: TryICUGetUnicodePropertyRanges(property, value)
UnicodeICU->>UnicodeICU: TryLoadFunctions (one-time, critical section)
UnicodeICU->>ICU: uset_openEmpty(), uset_applyPropertyAlias(...)
UnicodeICU->>ICU: uset_getItemCount(), uset_getItem(...)
ICU-->>UnicodeICU: count and range pairs
UnicodeICU-->>Compiler: TUnicodePropertyRangeArray
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
Suite TimingTest Runner (interpreted: 9,271 passed; bytecode: 9,271 passed)
MemoryGC rows aggregate the main thread plus all worker thread-local GCs. Test runner worker shutdown frees thread-local heaps in bulk; that shutdown reclamation is not counted as GC collections or collected objects.
Benchmarks (interpreted: 407; bytecode: 407)
MemoryGC rows aggregate the main thread plus all worker thread-local GCs. Benchmark runner performs explicit between-file collections, so collection and collected-object counts can be much higher than the test runner.
Measured on ubuntu-latest x64. |
Benchmark Results407 benchmarks Interpreted: 🟢 40 improved · 🔴 253 regressed · 114 unchanged · avg -2.9% arraybuffer.js — Interp: 🟢 4, 🔴 5, 5 unch. · avg +0.5% · Bytecode: 🟢 13, 1 unch. · avg +31.6%
arrays.js — Interp: 🔴 14, 5 unch. · avg -6.2% · Bytecode: 🟢 19 · avg +23.1%
async-await.js — Interp: 🔴 2, 4 unch. · avg -5.3% · Bytecode: 🟢 6 · avg +26.8%
async-generators.js — Interp: 🔴 1, 1 unch. · avg -3.4% · Bytecode: 🟢 2 · avg +30.6%
base64.js — Interp: 🟢 7, 3 unch. · avg +6.1% · Bytecode: 🟢 9, 1 unch. · avg +24.6%
classes.js — Interp: 🔴 20, 11 unch. · avg -3.2% · Bytecode: 🟢 31 · avg +24.6%
closures.js — Interp: 🔴 8, 3 unch. · avg -3.9% · Bytecode: 🟢 11 · avg +28.7%
collections.js — Interp: 🔴 5, 7 unch. · avg -1.0% · Bytecode: 🟢 12 · avg +29.3%
csv.js — Interp: 🔴 13 · avg -9.3% · Bytecode: 🟢 13 · avg +26.0%
destructuring.js — Interp: 🔴 16, 6 unch. · avg -5.0% · Bytecode: 🟢 22 · avg +27.7%
fibonacci.js — Interp: 🔴 7, 1 unch. · avg -4.7% · Bytecode: 🟢 8 · avg +29.6%
float16array.js — Interp: 🟢 4, 🔴 23, 5 unch. · avg -0.6% · Bytecode: 🟢 32 · avg +43.3%
for-of.js — Interp: 🔴 3, 4 unch. · avg -1.9% · Bytecode: 🟢 7 · avg +29.1%
generators.js — Interp: 🔴 3, 1 unch. · avg -3.7% · Bytecode: 🟢 4 · avg +25.3%
iterators.js — Interp: 🔴 41, 1 unch. · avg -7.7% · Bytecode: 🟢 42 · avg +24.1%
json.js — Interp: 🔴 10, 10 unch. · avg -3.0% · Bytecode: 🟢 20 · avg +26.1%
jsx.jsx — Interp: 🔴 9, 12 unch. · avg -2.3% · Bytecode: 🟢 21 · avg +23.6%
modules.js — Interp: 🔴 2, 7 unch. · avg -2.1% · Bytecode: 🟢 9 · avg +28.0%
numbers.js — Interp: 🔴 10, 1 unch. · avg -7.4% · Bytecode: 🟢 11 · avg +28.1%
objects.js — Interp: 🔴 5, 2 unch. · avg -5.9% · Bytecode: 🟢 7 · avg +28.4%
promises.js — Interp: 🟢 3, 9 unch. · avg +1.3% · Bytecode: 🟢 12 · avg +25.4%
regexp.js — Interp: 🟢 6, 🔴 5 · avg +3.6% · Bytecode: 🟢 11 · avg +28.0%
strings.js — Interp: 🟢 1, 🔴 17, 1 unch. · avg -7.7% · Bytecode: 🟢 19 · avg +27.7%
tsv.js — Interp: 🔴 7, 2 unch. · avg -7.1% · Bytecode: 🟢 9 · avg +25.1%
typed-arrays.js — Interp: 🟢 4, 🔴 12, 6 unch. · avg -2.6% · Bytecode: 🟢 14, 🔴 6, 2 unch. · avg +13.1%
uint8array-encoding.js — Interp: 🟢 4, 🔴 11, 3 unch. · avg +6.2% · Bytecode: 🟢 12, 🔴 3, 3 unch. · avg +44.5%
weak-collections.js — Interp: 🟢 7, 🔴 4, 4 unch. · avg +4.8% · Bytecode: 🟢 15 · avg +42.6%
Deterministic profile diffDeterministic profile diff: no significant changes. Measured on ubuntu-latest x64. Benchmark ranges compare cached main-branch min/max ops/sec with the PR run; overlapping ranges are treated as unchanged noise. Percentage deltas are secondary context. |
test262 Conformance
Areas closest to 100%
Per-test deltas (+293 / -0)Newly passing (293):
Steady-state failures are non-blocking; regressions vs the cached main baseline (lower total pass count, or any PASS → non-PASS transition) fail the conformance gate. Measured on ubuntu-latest x64, bytecode mode. Areas grouped by the first two test262 path components; minimum 25 attempted tests, areas already at 100% excluded. Δ vs main compares against the most recent cached |
Three bugs in generate-unicode-data.js caused fallback failures on Linux: 1. Script aliases not generated — Scripts.txt uses long names (Cyrillic) but the alias matcher compared against short names (Cyrl), so no aliases were created. Fixed with findAllAliases() that resolves via the canonical short name regardless of which form the data file uses. 2. ScriptExtensions base ranges not merged — ScriptExtensions.txt uses short names (Latn) but scriptData keys were long names (Latin), so the base range lookup returned undefined. Fixed by building a short-to-long name map from PropertyValueAliases.txt. 3. ASCII binary property missing — defined via Block property in the UCD, not in PropList.txt or DerivedCoreProperties.txt. Added as a hardcoded entry (U+0000..U+007F) alongside Any and Assigned. Entry count: 1033 → 1418 (385 new alias entries). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Content-based blob dedup: buildResourceFromEntries now deduplicates by base64 content instead of Buffer reference identity, preventing silent duplication if addEntry is called twice with identical ranges. 2. Robust short-to-long resolution: parseScriptExtensions resolves short script names by checking which alias form exists as a key in scriptData, instead of using a fragile length heuristic. 3. Extract CopyICURanges: the four identical range-copy loops in GetUnicodePropertyRanges are replaced with a nested procedure. 4. Thread-safe UCD cache: TryReadEmbeddedResource is now guarded by a TRTLCriticalSection, matching the ICU.pas initialization pattern. 5. Remove unused SysUtils import from UnicodeICU.pas. 6. Add DerivedBinaryProperties.txt to cover Bidi_Mirrored (ES2026 Table 67 binary property missing from the previous file set). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Generator: - Extract parseRange() and forEachUCDLine() helpers, eliminating duplicated range-parsing and comment-stripping boilerplate across parseUCDRangeFile, parseScriptExtensions, and the alias parsers. - Unify three near-identical property-registration loops (gc, sc, scx) into registerProperty() and make property-name expansion data-driven. Pascal: - Extract LoadSymbol() helper in UnicodeICU.pas, replacing five repetitions of the ICUGetProcAddress + Assigned guard pattern. - Replace string-allocating TryGetEntryName + CompareStr in binary search with TryCompareEntryName that compares bytes directly against the resource buffer, eliminating heap allocation per search step. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
scripts/generate-unicode-data.js (1)
125-127: ⚡ Quick winResolve redirect targets against the current request URL.
Locationis allowed to be relative to the request URL, so recursing with the raw header can turn a valid redirect into a download failure. Build the next URL withnew URL(response.headers.location, url)before callingdownloadFileagain. (developer.mozilla.org)Patch
- if (response.statusCode >= 300 && response.statusCode < 400 && response.headers.location) { - resolve(downloadFile(response.headers.location, redirectCount + 1)); + if (response.statusCode >= 300 && response.statusCode < 400 && response.headers.location) { + const redirectUrl = new URL(response.headers.location, url).toString(); + resolve(downloadFile(redirectUrl, redirectCount + 1)); return; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/generate-unicode-data.js` around lines 125 - 127, The redirect handling in downloadFile is passing response.headers.location directly which can be a relative URL; update the redirect branch inside the https.get callback to resolve the location against the current request URL (use new URL(response.headers.location, url).toString() or .href) before recursing to downloadFile(url, redirectCount + 1) so downloadFile receives an absolute URL; update the call in the block that references response.headers.location and keep the redirectCount logic intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@source/units/Goccia.RegExp.Compiler.pas`:
- Around line 389-394: CopyICURanges currently relies on AddRange which silently
stops when the fixed ARanges buffer is full; before the for-loop in
CopyICURanges check that there is enough capacity to copy all ICURanges (e.g.
compute needed := ARangeCount + Length(ICURanges) and compare against the fixed
buffer max capacity used by ARanges or a constant like MAX_RANGES), and if not,
raise an explicit error/exception or call Abort/Assert to fail loudly; keep
references to ARanges, ARangeCount, ICURanges, AddRange and CopyICURanges when
implementing the capacity check and failing code path.
- Around line 410-429: The Unicode property handling is only expanding ASCII
case pairs and misses Unicode simple case folds when /i and /u are both set;
update EmitCharClassRanges so that after ranges are obtained via
TryICUGetUnicodePropertyRanges, TryGetUnicodePropertyRanges or CopyICURanges
(ICURanges), you perform Unicode simple case folding for each code point in
those ranges (not just ±32 ASCII) and add the folded code points/ranges to the
character class (and mirror behavior for negated \P). Use ICU or your existing
Unicode case-fold table to map each code point to its simple case-fold target(s)
and merge/normalize the resulting ranges before emitting; ensure this logic runs
only when both the Unicode flag and caseless flag are active and reference the
same functions: EmitCharClassRanges, TryICUGetUnicodePropertyRanges,
TryGetUnicodePropertyRanges, CopyICURanges.
---
Nitpick comments:
In `@scripts/generate-unicode-data.js`:
- Around line 125-127: The redirect handling in downloadFile is passing
response.headers.location directly which can be a relative URL; update the
redirect branch inside the https.get callback to resolve the location against
the current request URL (use new URL(response.headers.location, url).toString()
or .href) before recursing to downloadFile(url, redirectCount + 1) so
downloadFile receives an absolute URL; update the call in the block that
references response.headers.location and keep the redirectCount logic intact.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: fe7feaa5-180f-4cc9-b89b-6c7f08b0380a
⛔ Files ignored due to path filters (2)
source/generated/Generated.UnicodeData.pasis excluded by!**/generated/**source/generated/Generated.UnicodeData.resis excluded by!**/generated/**
📒 Files selected for processing (7)
docs/build-system.mdscripts/generate-unicode-data.jssource/shared/UnicodeICU.passource/units/Goccia.RegExp.Compiler.passource/units/Goccia.RegExp.UnicodeData.passource/units/Goccia.inctests/built-ins/RegExp/unicode-properties.js
CopyICURanges now raises EConvertError if the ICU/fallback range count exceeds the fixed buffer capacity, instead of silently truncating via AddRange's early exit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
\p{}property ranges with ICU-first UnicodeSet queries backed by a generated UCD binary resource fallback, expanding support to the full ES2026 §22.2.2.9 property set (General_Category, Script, Script_Extensions, and all binary properties with full alias resolution).\p{PropertyName=PropertyValue}syntax parsing (\p{Script=Greek},\p{gc=Lu}, etc.) and increaseMAX_CHAR_RANGESfrom 512 to 2048 for full Unicode coverage.uset_applyPropertyAlias+ embedded UCD binary resource fallback +GOCCIA_REGEXP_NO_EMBEDDED_UCDcompile-time opt-out.Closes #607
Testing