fix: handle surrogate pairs in non-unicode regex patterns by amrkhaled104 · Pull Request #4700 · boa-dev/boa

amrkhaled104 · 2026-02-23T21:27:19Z

Overview

This PR fixes test262 failures related to RegExp matching when surrogate pairs (like 𠮷) are used without the u or v flags.

The Problem

I found that in Non-Unicode mode, Boa was compiling the regex pattern using full 32-bit Code Points. However, at runtime, the engine uses find_from_ucs2 which looks at raw 16-bit Code Units.

For example:

The character 𠮷 was compiled as one atom: 0x20BB7.
But the input string in memory is two units: [0xD842, 0xDFB7].
This mismatch caused the regex to fail even if the character was clearly there.

What I Changed

In core/engine/src/builtins/regexp/mod.rs:

I updated compile_native_regexp to check if the Unicode flag is missing.
If there is no u or v flag, I manually "flatten" the pattern. I iterate through each code point and decompose it into its individual UTF-16 units (surrogates) before passing them to the regress matcher.
This makes the compiled pattern match the 16-bit structure of the input string.

How I Tested

I ran the boa_tester with the following command:

cargo run --bin boa_tester -- run -s test/built-ins/String/prototype/match/regexp-prototype-match-v-u-flag.js

Result: All tests passed

and exec test

github-actions · 2026-02-23T21:37:08Z

Test262 conformance changes

Test result	main count	PR count	difference
Total	52,862	52,862	0
Passed	49,497	49,504	+7
Ignored	2,261	2,262	+1
Failed	1,104	1,096	-8
Panics	0	0	0
Conformance	93.63%	93.65%	+0.01%

Fixed tests (8):

test/staging/sm/RegExp/unicode-raw.js (previously Failed)
test/staging/sm/RegExp/unicode-class-raw.js (previously Failed)
test/built-ins/String/prototype/replace/regexp-prototype-replace-v-u-flag.js (previously Failed)
test/built-ins/String/prototype/matchAll/regexp-prototype-matchAll-v-u-flag.js (previously Failed)
test/built-ins/String/prototype/search/regexp-prototype-search-v-flag.js (previously Failed)
test/built-ins/String/prototype/search/regexp-prototype-search-v-u-flag.js (previously Failed)
test/built-ins/String/prototype/match/regexp-prototype-match-v-u-flag.js (previously Failed)
test/built-ins/RegExp/prototype/exec/regexp-builtin-exec-v-u-flag.js (previously Failed)

amrkhaled104 · 2026-02-23T23:25:34Z

@jedel1043 why cI not work ?

…e pairs

amrkhaled104 · 2026-02-23T23:56:40Z

@jedel1043 All 8 tests are now passing 💪

codecov · 2026-02-24T01:06:11Z

Codecov Report

❌ Patch coverage is 81.25000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.08%. Comparing base (6ddc2b4) to head (8f51053).
⚠️ Report is 674 commits behind head on main.

Files with missing lines	Patch %	Lines
core/engine/src/builtins/regexp/mod.rs	81.25%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4700      +/-   ##
==========================================
+ Coverage   47.24%   57.08%   +9.84%     
==========================================
  Files         476      549      +73     
  Lines       46892    60152   +13260     
==========================================
+ Hits        22154    34338   +12184     
- Misses      24738    25814    +1076

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

amrkhaled104 · 2026-02-24T17:08:44Z

Are there any issues you'd suggest I pick up ? @jedel1043

jedel1043 · 2026-02-24T17:11:02Z

core/engine/src/builtins/regexp/mod.rs

+        let has_named_groups = p.code_points().collect::<Vec<_>>().windows(3).any(|w| {
+            matches!(
+                (w[0], w[1], w[2]),
+                (
+                    CodePoint::Unicode('('),
+                    CodePoint::Unicode('?'),
+                    CodePoint::Unicode('<')
+                )
+            )
+        });


I kind of don't understand why we need this. IIRC the only things that influence if a regex is parsed as UTF16 or UCS2 are the u and v flags, right? And that's being taken care of by the full_unicode check above.

@jedel1043 I want to explain the full picture. Sorry, I forgot to update the PR description after my last changes
In the first version of the PR, I found that many Prototype tests were failing. The reason was how we handle Unicode and Non-Unicode modes
In Non-Unicode mode, JavaScript treats 'heavy' characters (like emojis or special math symbols) as two 16-bit units, not one character. But our engine was treating them as one unit. This made the search index wrong
I used flat_map to split these characters into two units. This fixed the indexing and made the Prototype tests pass.

After that, the test test\built-ins\RegExp\named-groups\non-unicode-property-names-valid.js failed. I realized this is a special case Even in Non-Unicode mode, Named Groups (?) need the character to stay as one 'Identity' (Code Point). If we split the name into two units, the Regex engine cannot find the group name because the name becomes 'broken' So, I added a check:
let has_named_groups = !full_unicode && p.to_std_string_escaped().contains("(?<");
or

matches!( (w[0], w[1], w[2]), ( CodePoint::Unicode('('), CodePoint::Unicode('?'), CodePoint::Unicode('<') ) ) });

This code checks for the (?< sequence If the pattern has named groups, we keep the characters as Code Points (like Unicode mode). .

Then this edge case is not our responsibility. It should be the responsibility of regress to parse group names as unicode even if parsing without unicode support.

I would suggest opening a bug on their side reporting this, and removing the hack. The test should pass afterwards without having to hack around this.

You are right,I will update the PR

@jedel1043 I have updated the code and removed the hack. To keep the CI green while we wait for a fix in regress, should I add this specific test to the test262_config.toml ignore list?

Yep, please do! And if you can, add a TODO on the config file pointing to regress' issue.

Got it! I will add the test to test262_config.toml with a TODO note in this PR.

amrkhaled104 · 2026-02-24T21:21:22Z

@jedel1043 Anything else?

jedel1043

Nope, everything looks good!

fix: handle surrogate pairs in non-unicode regex patterns

e5e848d

amrkhaled104 added 2 commits February 24, 2026 01:44

fix: implement hybrid regex compilation for named groups and surrogat…

d4353d1

…e pairs

style: fix formatting issues

34da5a9

jedel1043 reviewed Feb 24, 2026

View reviewed changes

refactor: simplify regex compilation and remove named groups hack

0de6413

jedel1043 added bug Something isn't working builtins PRs and Issues related to builtins/intrinsics labels Feb 24, 2026

jedel1043 added this to the v1.0.0 milestone Feb 24, 2026

test: ignore failing named groups test and add TODO

909d95a

Merge branch 'boa-dev:main' into fix/regexp-non-unicode-surrogates

8f51053

jedel1043 approved these changes Feb 25, 2026

View reviewed changes

jedel1043 added this pull request to the merge queue Feb 25, 2026

Merged via the queue into boa-dev:main with commit ffe47c8 Feb 25, 2026
18 checks passed

Uh oh!

Comments

Conversation

amrkhaled104 commented Feb 23, 2026

Overview

The Problem

What I Changed

How I Tested

Uh oh!

github-actions bot commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test262 conformance changes

Uh oh!

amrkhaled104 commented Feb 23, 2026

Uh oh!

amrkhaled104 commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

amrkhaled104 commented Feb 24, 2026

Uh oh!

jedel1043 Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

amrkhaled104 Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jedel1043 Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

amrkhaled104 Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

amrkhaled104 Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

jedel1043 Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

amrkhaled104 Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

amrkhaled104 commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jedel1043 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Feb 23, 2026 •

edited

Loading

amrkhaled104 commented Feb 23, 2026 •

edited

Loading

codecov bot commented Feb 24, 2026 •

edited

Loading

amrkhaled104 Feb 24, 2026 •

edited

Loading

amrkhaled104 commented Feb 24, 2026 •

edited

Loading