Skip to content

Comments

fix: handle surrogate pairs in non-unicode regex patterns#4700

Merged
jedel1043 merged 6 commits intoboa-dev:mainfrom
amrkhaled104:fix/regexp-non-unicode-surrogates
Feb 25, 2026
Merged

fix: handle surrogate pairs in non-unicode regex patterns#4700
jedel1043 merged 6 commits intoboa-dev:mainfrom
amrkhaled104:fix/regexp-non-unicode-surrogates

Conversation

@amrkhaled104
Copy link
Contributor

Overview

This PR fixes test262 failures related to RegExp matching when surrogate pairs (like 𠮷) are used without the u or v flags.

The Problem

I found that in Non-Unicode mode, Boa was compiling the regex pattern using full 32-bit Code Points. However, at runtime, the engine uses find_from_ucs2 which looks at raw 16-bit Code Units.

For example:

  • The character 𠮷 was compiled as one atom: 0x20BB7.
  • But the input string in memory is two units: [0xD842, 0xDFB7].
    This mismatch caused the regex to fail even if the character was clearly there.

What I Changed

  1. In core/engine/src/builtins/regexp/mod.rs:
  • I updated compile_native_regexp to check if the Unicode flag is missing.
  • If there is no u or v flag, I manually "flatten" the pattern. I iterate through each code point and decompose it into its individual UTF-16 units (surrogates) before passing them to the regress matcher.
  • This makes the compiled pattern match the 16-bit structure of the input string.

How I Tested

I ran the boa_tester with the following command:

cargo run --bin boa_tester -- run -s test/built-ins/String/prototype/match/regexp-prototype-match-v-u-flag.js

Result: All tests passed
image

and exec test
image


@github-actions
Copy link

github-actions bot commented Feb 23, 2026

Test262 conformance changes

Test result main count PR count difference
Total 52,862 52,862 0
Passed 49,497 49,504 +7
Ignored 2,261 2,262 +1
Failed 1,104 1,096 -8
Panics 0 0 0
Conformance 93.63% 93.65% +0.01%
Fixed tests (8):
test/staging/sm/RegExp/unicode-raw.js (previously Failed)
test/staging/sm/RegExp/unicode-class-raw.js (previously Failed)
test/built-ins/String/prototype/replace/regexp-prototype-replace-v-u-flag.js (previously Failed)
test/built-ins/String/prototype/matchAll/regexp-prototype-matchAll-v-u-flag.js (previously Failed)
test/built-ins/String/prototype/search/regexp-prototype-search-v-flag.js (previously Failed)
test/built-ins/String/prototype/search/regexp-prototype-search-v-u-flag.js (previously Failed)
test/built-ins/String/prototype/match/regexp-prototype-match-v-u-flag.js (previously Failed)
test/built-ins/RegExp/prototype/exec/regexp-builtin-exec-v-u-flag.js (previously Failed)

@amrkhaled104
Copy link
Contributor Author

@jedel1043 why cI not work ?
image

@amrkhaled104
Copy link
Contributor Author

amrkhaled104 commented Feb 23, 2026

@jedel1043 All 8 tests are now passing 💪

@codecov
Copy link

codecov bot commented Feb 24, 2026

Codecov Report

❌ Patch coverage is 81.25000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.08%. Comparing base (6ddc2b4) to head (8f51053).
⚠️ Report is 674 commits behind head on main.

Files with missing lines Patch % Lines
core/engine/src/builtins/regexp/mod.rs 81.25% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4700      +/-   ##
==========================================
+ Coverage   47.24%   57.08%   +9.84%     
==========================================
  Files         476      549      +73     
  Lines       46892    60152   +13260     
==========================================
+ Hits        22154    34338   +12184     
- Misses      24738    25814    +1076     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@amrkhaled104
Copy link
Contributor Author

Are there any issues you'd suggest I pick up ? @jedel1043

Comment on lines 349 to 358
let has_named_groups = p.code_points().collect::<Vec<_>>().windows(3).any(|w| {
matches!(
(w[0], w[1], w[2]),
(
CodePoint::Unicode('('),
CodePoint::Unicode('?'),
CodePoint::Unicode('<')
)
)
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of don't understand why we need this. IIRC the only things that influence if a regex is parsed as UTF16 or UCS2 are the u and v flags, right? And that's being taken care of by the full_unicode check above.

Copy link
Contributor Author

@amrkhaled104 amrkhaled104 Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jedel1043 I want to explain the full picture. Sorry, I forgot to update the PR description after my last changes
In the first version of the PR, I found that many Prototype tests were failing. The reason was how we handle Unicode and Non-Unicode modes
In Non-Unicode mode, JavaScript treats 'heavy' characters (like emojis or special math symbols) as two 16-bit units, not one character. But our engine was treating them as one unit. This made the search index wrong
I used flat_map to split these characters into two units. This fixed the indexing and made the Prototype tests pass.

After that, the test test\built-ins\RegExp\named-groups\non-unicode-property-names-valid.js failed. I realized this is a special case Even in Non-Unicode mode, Named Groups (?) need the character to stay as one 'Identity' (Code Point). If we split the name into two units, the Regex engine cannot find the group name because the name becomes 'broken' So, I added a check:
let has_named_groups = !full_unicode && p.to_std_string_escaped().contains("(?<");
or

            matches!(
                (w[0], w[1], w[2]),
                (
                    CodePoint::Unicode('('),
                    CodePoint::Unicode('?'),
                    CodePoint::Unicode('<')
                )
            )
        });

This code checks for the (?< sequence If the pattern has named groups, we keep the characters as Code Points (like Unicode mode). .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then this edge case is not our responsibility. It should be the responsibility of regress to parse group names as unicode even if parsing without unicode support.

I would suggest opening a bug on their side reporting this, and removing the hack. The test should pass afterwards without having to hack around this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right,I will update the PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jedel1043 I have updated the code and removed the hack. To keep the CI green while we wait for a fix in regress, should I add this specific test to the test262_config.toml ignore list?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, please do! And if you can, add a TODO on the config file pointing to regress' issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! I will add the test to test262_config.toml with a TODO note in this PR.

@jedel1043 jedel1043 added bug Something isn't working builtins PRs and Issues related to builtins/intrinsics labels Feb 24, 2026
@jedel1043 jedel1043 added this to the v1.0.0 milestone Feb 24, 2026
@amrkhaled104
Copy link
Contributor Author

amrkhaled104 commented Feb 24, 2026

@jedel1043 Anything else?

Copy link
Member

@jedel1043 jedel1043 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, everything looks good!

@jedel1043 jedel1043 added this pull request to the merge queue Feb 25, 2026
Merged via the queue into boa-dev:main with commit ffe47c8 Feb 25, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working builtins PRs and Issues related to builtins/intrinsics

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants