feat: add DeepSeek V3 tokenizer with ByteLevel BPE support#6
Merged
farhan-syah merged 4 commits intomainfrom Nov 26, 2025
Merged
feat: add DeepSeek V3 tokenizer with ByteLevel BPE support#6farhan-syah merged 4 commits intomainfrom
farhan-syah merged 4 commits intomainfrom
Conversation
Add complete DeepSeek V3 tokenizer implementation with 128,000 token vocabulary and 71 special tokens (17 native + 54 agent). DeepSeek V3 uses ByteLevel BPE encoding, requiring new infrastructure for handling arbitrary byte sequences through printable Unicode characters. DeepSeek V3 features: - Native thinking tokens (<think>, </think>) for CoT reasoning - Native role tokens (<|User|>, <|Assistant|>, <|EOT|>) - FIM tokens for code completion (<|fim▁hole|>, etc.) - Tool calling tokens for function execution - Compatible with DeepSeek V3 and DeepSeek R1 models ByteLevel BPE infrastructure: - Implement encoder/decoder with bijective byte-to-char mapping - Extend Tokenizer with use_byte_level flag and constructors - Preserve printable ASCII/Latin-1, map control chars to U+0100+ - Add comprehensive documentation in docs/bytelevel_bpe.md - Lazy-initialized lookup tables for optimal performance Additional improvements: - Add "Exact Token ID Tests" to all tokenizers (cl100k, o200k, llama3) - Verify specific token IDs to prevent encoding/vocabulary regressions - Test cases cover: basic text, Chinese, emojis with expected IDs - Vocabulary conversion script: scripts/convert_deepseek_vocab.py - Update README and docs/special_tokens.md with DeepSeek V3 info
Add ByteLevelStreamingDecoder to handle streaming decode for tokenizers using ByteLevel BPE encoding (DeepSeek V3, GPT-2). The decoder performs two-stage decoding: first converts ByteLevel-encoded token bytes back to raw bytes, then assembles them into valid UTF-8 strings. Changes: - Implement ByteLevelStreamingDecoder in Rust core with UTF-8 buffering - Add Python bindings with full API (add_token, add_tokens, flush, reset) - Export from all module levels (src/core, src/lib, src/python, python) - Add comprehensive test suite covering ASCII, Unicode, emoji, special tokens - Document streaming decoder usage in bytelevel_bpe.md with examples
Create comprehensive API guide and streamline README for better maintainability. Move detailed API documentation, usage examples, and performance tips to a dedicated guide. Changes: - Add docs/api_guide.md with complete Python and Rust API reference - Include detailed examples for encoding, decoding, and streaming - Document both StreamingDecoder and ByteLevelStreamingDecoder usage - Add performance optimization tips and best practices - Streamline README to focus on quick start and key features - Replace verbose inline docs with links to API guide
ad47a9c to
b1761be
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DeepSeek V3 Tokenizer
<think>,</think>,<|User|>,<|Assistant|>,<|EOT|>, FIM tokens, tool calling tokensByteLevel Streaming Decoder
ByteLevelStreamingDecoderfor token-by-token LLM output decodingtokenizer.byte_level_streaming_decoder()in PythonDocumentation
/docs/api_guide.mdwith comprehensive API reference and examples/docs/bytelevel_bpe.mdexplaining ByteLevel encoding/docs/special_tokens.mdwith DeepSeek V3 tokens