A specialized Rust regex engine designed for LLM tokenization and complex pattern matching.
Originally created as the regex backend for splintr, an LLM tokenizer. Passes compliance tests for industry-standard tokenizer patterns (OpenAI's
cl100k_base, Meta's Llama 3).Please report issues on the Issue Tracker.
This is a specialized tool, not a general-purpose replacement.
The Rust ecosystem already has the excellent, battle-tested regex crate. For 99% of use cases, you should use that.
Only use regexr if you specifically need:
- Lookarounds: You need
(?=...),(?<=...), or(?!\S)without a C build dependency.- Why not
regex? It intentionally omits lookarounds to guarantee linear time. - Why not
fancy-regex? No JIT compilation.
- Why not
- JIT Compilation: You want native code generation for hot regex patterns.
- Why not
regex/fancy-regex? Neither offers JIT compilation. - Why not
pcre2? Requires installing a system C library (libpcre2).
- Why not
- No System Dependencies: Just
cargo add regexr— no C toolchain or system libraries required. All code (including JIT) compiles withcargo build. - WASM Support: Compiles to
wasm32-unknown-unknownwith--no-default-features. Same API and features across native and WASM targets — JIT/SIMD are disabled automatically, but lookarounds, backreferences, and ReDoS protection all work. No need to swap crates per platform. - Bounded Execution: ReDoS protection that memoizes states (guaranteeing completion) rather than just aborting after a timeout (like
pcre2).
Developers building LLM tokenizers (like GPT-4 or Llama 3) currently face a dilemma in Rust:
regexcrate: Fast, safe, but lacks lookarounds and JIT compilation.fancy-regex: Supports lookarounds, but lacks JIT compilation.pcre2: Supports everything including JIT, but requires system C library installation and cross-compilation setup.
regexr bridges this gap. It provides Lookarounds + JIT compilation + Backreferences with no system dependencies — just cargo add and go.
Add this to your Cargo.toml:
[dependencies]
regexr = "*"For JIT compilation support:
[dependencies]
regexr = { version = "*", features = ["full"] }use regexr::Regex;
let re = Regex::new(r"\w+").unwrap();
assert!(re.is_match("hello"));
// Find first match
if let Some(m) = re.find("hello world") {
println!("Found: {}", m.as_str()); // "hello"
}
// Find all matches
for m in re.find_iter("hello world") {
println!("{}", m.as_str());
}use regexr::Regex;
let re = Regex::new(r"(\w+)@(\w+)\.(\w+)").unwrap();
let caps = re.captures("user@example.com").unwrap();
println!("{}", &caps[0]); // "user@example.com"
println!("{}", &caps[1]); // "user"
println!("{}", &caps[2]); // "example"
println!("{}", &caps[3]); // "com"use regexr::Regex;
let re = Regex::new(r"(?P<user>\w+)@(?P<domain>\w+\.\w+)").unwrap();
let caps = re.captures("user@example.com").unwrap();
println!("{}", &caps["user"]); // "user"
println!("{}", &caps["domain"]); // "example.com"Enable JIT for patterns that will be matched many times:
use regexr::RegexBuilder;
let re = RegexBuilder::new(r"\w+")
.jit(true)
.build()
.unwrap();
assert!(re.is_match("hello"));For patterns with many literal alternatives (e.g., keyword matching in tokenizers):
use regexr::RegexBuilder;
let re = RegexBuilder::new(r"(function|for|while|if|else|return)")
.optimize_prefixes(true)
.build()
.unwrap();
assert!(re.is_match("function"));use regexr::Regex;
let re = Regex::new(r"\d+").unwrap();
// Replace first match
let result = re.replace("abc 123 def", "NUM");
assert_eq!(result, "abc NUM def");
// Replace all matches
let result = re.replace_all("abc 123 def 456", "NUM");
assert_eq!(result, "abc NUM def NUM");simd(default): Enables SIMD-accelerated literal searchjit: Enables JIT compilation (x86-64 and ARM64)full: Enables both JIT and SIMD
| Platform | JIT Support | SIMD Support |
|---|---|---|
| Linux x86-64 | ✓ | ✓ (AVX2) |
| Linux ARM64 | ✓ | ✗ |
| macOS x86-64 | ✓ | ✓ (AVX2) |
| macOS ARM64 (Apple Silicon) | ✓ | ✗ |
| Windows x86-64 | ✓ | ✓ (AVX2) |
| WASM (wasm32) | ✗ | ✗ |
| Other | ✗ | ✗ |
Build without default features for a minimal installation (also works for WASM):
cargo build --no-default-features # Minimal (PikeVM + LazyDFA only)
cargo build --no-default-features --target wasm32-unknown-unknown # WASM targetBuild with all optimizations:
cargo build --features "full"The library automatically selects the best execution engine based on pattern characteristics:
Non-JIT mode (default):
- ShiftOr: Small patterns (≤64 states) without anchors/word boundaries
- EagerDfa: Patterns with word boundaries or anchors
- LazyDfa: General patterns with on-demand state construction
- BacktrackingVm: Patterns with backreferences
- PikeVm: Patterns with lookaround or non-greedy quantifiers
JIT mode (with jit feature):
- BacktrackingJit: Patterns with backreferences
- TaggedNfa: Patterns with lookaround or non-greedy quantifiers
- JitShiftOr: Small patterns with alternations
- DFA JIT: General patterns, benefits from SIMD prefiltering
See docs/architecture.md for details on the engine selection logic.
Speedup relative to regex crate (higher is better):
Highlights (speedup vs regex crate):
| Benchmark | regexr |
regexr-jit |
pcre2-jit |
|---|---|---|---|
| log_parsing | 0.80-0.84x | 3.91-4.09x | 3.57-3.71x |
| url_extraction | 0.81-0.83x | 1.95-1.99x | 2.10-2.13x |
| unicode_letters | 1.24x | 1.43-1.44x | 1.65-1.72x |
| html_tags | 0.82-0.87x | 1.33-1.43x | 0.80-0.85x |
| word_boundary | 1.19-1.24x | 1.15-1.19x | 0.72-0.74x |
| email_validation | 0.99-1.00x | 1.00-1.11x | 0.94-1.00x |
| alternation | 0.88-1.01x | 0.88-1.01x | 0.12-0.15x |
regexr-jitexcels at log parsing (4x faster thanregex)regexr(non-JIT) matchesregexperformance on most patterns- Both outperform
fancy-regexandpcre2(non-JIT) consistently
- Architecture Overview - Engine architecture and selection logic
- Features - Detailed feature documentation
If you use regexr in your research, please cite:
@software{regexr2025,
author = {Syah, Farhan},
title = {regexr: A Rust Regex Engine with JIT Compilation for LLM Tokenization},
year = {2025},
url = {https://github.com/ml-rust/regexr},
note = {Rust regex engine with lookaround support and JIT compilation}
}
