feat: optional features, SentencePiece byte-fallback decoding#14
Merged
farhan-syah merged 5 commits intomainfrom Mar 12, 2026
Merged
feat: optional features, SentencePiece byte-fallback decoding#14farhan-syah merged 5 commits intomainfrom
farhan-syah merged 5 commits intomainfrom
Conversation
Implement a SentencePiece unigram tokenizer alongside the existing BPE tokenizer, enabling support for Mistral V1/V2 and other models that use SentencePiece rather than tiktoken-style BPE. Core implementation in src/core/sentencepiece.rs: - Greedy longest-match encoding with score-based tie-breaking - BOS prepension and ▁ word boundary handling - Lossless and lossy decode paths with BOS/EOS skipping - SentencePieceError type for out-of-range token IDs Public API surface: - SentencePieceTokenizer and SentencePieceError re-exported from crate root - PySentencePieceTokenizer PyO3 wrapper with encode, decode, decode_lossy - bos_token_id_by_name added to pretrained API for symmetry with eos_token_id_by_name - SentencePieceTokenizer registered in the Python _core module and __init__.py Documentation updated to cover both BPE and SentencePiece APIs in the Rust and Python reference sections of docs/api_guide.md and README.md.
Update version to 0.9.0 across Cargo.toml, pyproject.toml, .version, and uv.lock. Also update the crate description and keywords to reflect the addition of SentencePiece alongside BPE.
Make regexr the sole default backend. The pcre2 feature remains available as an explicit opt-in for benchmarking purposes.
Introduce `rayon` and `regexr-jit` as named optional features, both enabled by default, so that WASM and embedded targets can opt out via `--no-default-features`. Add a `wasm` feature as a no-op marker that documents the intended build profile. Move regexr's jit/simd flags under the new `regexr-jit` feature rather than hardcoding them unconditionally.
…on usage Decode `<0xNN>` byte-fallback tokens by accumulating raw bytes and converting via `from_utf8_lossy`, which correctly reconstructs multi-byte UTF-8 sequences (e.g. 'é' encoded as <0xC3><0xA9>). Previously these tokens were passed through as literal strings, producing garbled output for non-ASCII text. Preserve the leading-space stripping only for multi-token sequences so that single-token streaming decode does not lose meaningful word boundaries. Gate all `rayon` call sites behind `#[cfg(feature = "rayon")]` with sequential fallbacks, enabling the tokenizer to build for WASM and other no-std targets without changes to call sites.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
rayonandregexr-jitoptional features (both default-enabled) so WASM/embedded targets can opt out via--no-default-features<0xNN>byte-fallback tokens — properly reconstruct multi-byte UTF-8 sequences instead of passing through literal token strings#[cfg(feature = "rayon")]with sequential fallbacksTest plan
--no-default-features, explicit features)parse_byte_fallback()edge cases (valid hex, invalid formats, boundary values)évia<0xC3><0xA9>)