fix: handle surrogate pairs in non-unicode regex patterns#4700
fix: handle surrogate pairs in non-unicode regex patterns#4700jedel1043 merged 6 commits intoboa-dev:mainfrom
Conversation
Test262 conformance changes
Fixed tests (8): |
|
@jedel1043 why cI not work ? |
|
@jedel1043 All 8 tests are now passing 💪 |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4700 +/- ##
==========================================
+ Coverage 47.24% 57.08% +9.84%
==========================================
Files 476 549 +73
Lines 46892 60152 +13260
==========================================
+ Hits 22154 34338 +12184
- Misses 24738 25814 +1076 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Are there any issues you'd suggest I pick up ? @jedel1043 |
| let has_named_groups = p.code_points().collect::<Vec<_>>().windows(3).any(|w| { | ||
| matches!( | ||
| (w[0], w[1], w[2]), | ||
| ( | ||
| CodePoint::Unicode('('), | ||
| CodePoint::Unicode('?'), | ||
| CodePoint::Unicode('<') | ||
| ) | ||
| ) | ||
| }); |
There was a problem hiding this comment.
I kind of don't understand why we need this. IIRC the only things that influence if a regex is parsed as UTF16 or UCS2 are the u and v flags, right? And that's being taken care of by the full_unicode check above.
There was a problem hiding this comment.
@jedel1043 I want to explain the full picture. Sorry, I forgot to update the PR description after my last changes
In the first version of the PR, I found that many Prototype tests were failing. The reason was how we handle Unicode and Non-Unicode modes
In Non-Unicode mode, JavaScript treats 'heavy' characters (like emojis or special math symbols) as two 16-bit units, not one character. But our engine was treating them as one unit. This made the search index wrong
I used flat_map to split these characters into two units. This fixed the indexing and made the Prototype tests pass.
After that, the test test\built-ins\RegExp\named-groups\non-unicode-property-names-valid.js failed. I realized this is a special case Even in Non-Unicode mode, Named Groups (?) need the character to stay as one 'Identity' (Code Point). If we split the name into two units, the Regex engine cannot find the group name because the name becomes 'broken' So, I added a check:
let has_named_groups = !full_unicode && p.to_std_string_escaped().contains("(?<");
or
matches!(
(w[0], w[1], w[2]),
(
CodePoint::Unicode('('),
CodePoint::Unicode('?'),
CodePoint::Unicode('<')
)
)
});
This code checks for the (?< sequence If the pattern has named groups, we keep the characters as Code Points (like Unicode mode). .
There was a problem hiding this comment.
Then this edge case is not our responsibility. It should be the responsibility of regress to parse group names as unicode even if parsing without unicode support.
I would suggest opening a bug on their side reporting this, and removing the hack. The test should pass afterwards without having to hack around this.
There was a problem hiding this comment.
You are right,I will update the PR
There was a problem hiding this comment.
@jedel1043 I have updated the code and removed the hack. To keep the CI green while we wait for a fix in regress, should I add this specific test to the test262_config.toml ignore list?
There was a problem hiding this comment.
Yep, please do! And if you can, add a TODO on the config file pointing to regress' issue.
There was a problem hiding this comment.
Got it! I will add the test to test262_config.toml with a TODO note in this PR.
|
@jedel1043 Anything else? |
jedel1043
left a comment
There was a problem hiding this comment.
Nope, everything looks good!

Overview
This PR fixes
test262failures related to RegExp matching when surrogate pairs (like𠮷) are used without theuorvflags.The Problem
I found that in Non-Unicode mode, Boa was compiling the regex pattern using full 32-bit Code Points. However, at runtime, the engine uses
find_from_ucs2which looks at raw 16-bit Code Units.For example:
𠮷was compiled as one atom:0x20BB7.[0xD842, 0xDFB7].This mismatch caused the regex to fail even if the character was clearly there.
What I Changed
core/engine/src/builtins/regexp/mod.rs:compile_native_regexpto check if the Unicode flag is missing.uorvflag, I manually "flatten" the pattern. I iterate through each code point and decompose it into its individual UTF-16 units (surrogates) before passing them to theregressmatcher.How I Tested
I ran the
boa_testerwith the following command:Result: All tests passed

and exec test
