Skip to content

feat: optional features, SentencePiece byte-fallback decoding#14

Merged
farhan-syah merged 5 commits intomainfrom
0.9.0
Mar 12, 2026
Merged

feat: optional features, SentencePiece byte-fallback decoding#14
farhan-syah merged 5 commits intomainfrom
0.9.0

Conversation

@farhan-syah
Copy link
Collaborator

Summary

  • Make rayon and regexr-jit optional features (both default-enabled) so WASM/embedded targets can opt out via --no-default-features
  • Fix SentencePiece decoding of <0xNN> byte-fallback tokens — properly reconstruct multi-byte UTF-8 sequences instead of passing through literal token strings
  • Gate all rayon call sites behind #[cfg(feature = "rayon")] with sequential fallbacks
  • Remove pcre2 from default features, making regexr the sole default backend
  • Context-aware leading-space stripping: only strip for multi-token sequences to preserve word boundaries in streaming decode

Test plan

  • All 65 tests pass across feature combinations (default, --no-default-features, explicit features)
  • Zero clippy warnings
  • New unit tests for parse_byte_fallback() edge cases (valid hex, invalid formats, boundary values)
  • Integration test for byte-fallback UTF-8 reconstruction (e.g. é via <0xC3><0xA9>)

Implement a SentencePiece unigram tokenizer alongside the existing BPE
tokenizer, enabling support for Mistral V1/V2 and other models that use
SentencePiece rather than tiktoken-style BPE.

Core implementation in src/core/sentencepiece.rs:
- Greedy longest-match encoding with score-based tie-breaking
- BOS prepension and ▁ word boundary handling
- Lossless and lossy decode paths with BOS/EOS skipping
- SentencePieceError type for out-of-range token IDs

Public API surface:
- SentencePieceTokenizer and SentencePieceError re-exported from crate root
- PySentencePieceTokenizer PyO3 wrapper with encode, decode, decode_lossy
- bos_token_id_by_name added to pretrained API for symmetry with eos_token_id_by_name
- SentencePieceTokenizer registered in the Python _core module and __init__.py

Documentation updated to cover both BPE and SentencePiece APIs in the
Rust and Python reference sections of docs/api_guide.md and README.md.
Update version to 0.9.0 across Cargo.toml, pyproject.toml, .version,
and uv.lock. Also update the crate description and keywords to reflect
the addition of SentencePiece alongside BPE.
Make regexr the sole default backend. The pcre2 feature remains
available as an explicit opt-in for benchmarking purposes.
Introduce `rayon` and `regexr-jit` as named optional features, both
enabled by default, so that WASM and embedded targets can opt out via
`--no-default-features`. Add a `wasm` feature as a no-op marker that
documents the intended build profile.

Move regexr's jit/simd flags under the new `regexr-jit` feature rather
than hardcoding them unconditionally.
…on usage

Decode `<0xNN>` byte-fallback tokens by accumulating raw bytes and
converting via `from_utf8_lossy`, which correctly reconstructs multi-byte
UTF-8 sequences (e.g. 'é' encoded as <0xC3><0xA9>). Previously these
tokens were passed through as literal strings, producing garbled output
for non-ASCII text.

Preserve the leading-space stripping only for multi-token sequences so
that single-token streaming decode does not lose meaningful word
boundaries.

Gate all `rayon` call sites behind `#[cfg(feature = "rayon")]` with
sequential fallbacks, enabling the tokenizer to build for WASM and other
no-std targets without changes to call sites.
@farhan-syah farhan-syah merged commit 2b663f3 into main Mar 12, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant