Skip to content

Fix PCRE2 UTF-8 validation errors for Qwen tokenizers#171

Merged
kirklandsign merged 6 commits intomainfrom
copilot/fix-issue-14432
Feb 4, 2026
Merged

Fix PCRE2 UTF-8 validation errors for Qwen tokenizers#171
kirklandsign merged 6 commits intomainfrom
copilot/fix-issue-14432

Conversation

Copy link
Contributor

Copilot AI commented Feb 3, 2026

Qwen tokenizer patterns contain byte sequences that aren't valid UTF-8, causing PCRE2 compilation to fail with "UTF-8 error: isolated byte with 0x80 bit set". This prevents loading tokenizers for Qwen2.5 and OuteTTS models.

Changes

  • pcre2_regex.cpp: Detect UTF-8 validation errors (codes -3 to -23) and retry compilation without PCRE2_UTF flag
  • std_regex.cpp: Catch std::regex_error in find_all() to prevent crashes from pattern complexity errors
  • test_regex.cpp: Add test for invalid UTF-8 pattern handling

Fallback chain

Pattern compilation now follows: RE2 → PCRE2 (with UTF) → PCRE2 (without UTF) → std::regex

Example

// Previously failed: "UTF-8 error: overlong 2-byte sequence"
const std::string pattern = "test\xC0\x80pattern";
Pcre2Regex regex;
regex.compile(pattern);  // Now succeeds via non-UTF retry
Original prompt

Look at pytorch/executorch#14432 and pytorch/executorch#16631

Can you fix on this repo?


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 3, 2026
Copilot AI and others added 2 commits February 3, 2026 02:17
- Retry PCRE2 compilation without PCRE2_UTF flag when UTF-8 validation fails
- Catch std::regex_error exceptions in StdRegex::find_all() to prevent crashes
- Add test for invalid UTF-8 pattern handling

Co-authored-by: kirklandsign <107070759+kirklandsign@users.noreply.github.com>
Co-authored-by: kirklandsign <107070759+kirklandsign@users.noreply.github.com>
@kirklandsign kirklandsign marked this pull request as ready for review February 3, 2026 02:20
Copilot AI and others added 2 commits February 3, 2026 02:21
Co-authored-by: kirklandsign <107070759+kirklandsign@users.noreply.github.com>
Co-authored-by: kirklandsign <107070759+kirklandsign@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix issue with memory management in executorch Fix PCRE2 UTF-8 validation errors for Qwen tokenizers Feb 3, 2026
Copilot AI requested a review from kirklandsign February 3, 2026 02:24
.gitignore Outdated
Comment on lines 16 to 17
_codeql_build_dir/
_codeql_detected_source_root
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?

Co-authored-by: kirklandsign <107070759+kirklandsign@users.noreply.github.com>
@kirklandsign kirklandsign merged commit d43c287 into main Feb 4, 2026
4 checks passed
@kirklandsign kirklandsign deleted the copilot/fix-issue-14432 branch February 4, 2026 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants