Blueprint:
custom-tokenizer-blueprint-v4-final.md— Final
Hardware target: RTX 3060 12 GB VRAM · 16–32 GB RAM · NVMe SSD
Budget: ₹0 — all tools, datasets, frameworks are free
Primary language: Rust
Secondary: Python (Morfessor, corpus pipeline, BPE training, PyO3 bindings, OPUS proxy)
Single shared tokenizer for the Hierarchical Multi-Agent AI System (Orchestrator, Math, Code, Debug+QA, Finance agents). Supersedes blueprint v1–v3.
Key properties:
- 52,000 vocabulary (
u16IDs — 2 bytes each) - O(n) BPE via
bpe = "0.3"(github/rust-gems, MIT) - MorphBPE: Morfessor 2.0 boundary enforcement
- F-gram compression: top-1,488 phrases via Aho-Corasick (
daachorse) - 256 special token slots (IDs 10,256–10,511): D-CoT, SkillNet, PaCoRe, RAGEN-2 anchors
tokenizer.bincold-start < 10 ms viamemmap2- PyO3 wheel:
encode,decode,check_dcot_tags,encode_fast,save_pretrained
| Crate | Role |
|---|---|
tokenizer-core |
Runtime: deployed in AI system |
tokenizer-train |
Offline training binary |
tokenizer-python |
PyO3 wheel for training pipeline |
tokenizer-bench |
Criterion benchmarks |
tokenizer-cli |
Debug / inspection tool |
| ID Range | Category | Count |
|---|---|---|
| 0–255 | Byte fallback | 256 |
| 256–10,255 | Integers 0–9,999 | 10,000 |
| 10,256–10,511 | Special / control | 256 |
| 10,512–42,511 | BPE base (English) | 32,000 |
| 42,512–50,511 | Domain extensions | 8,000 |
| 50,512–51,999 | F-gram reserved | 1,488 |
# Prerequisites
rustup install nightly
cargo install cargo-fuzz maturin
# Check workspace compiles
cargo check --workspace
# Run Gate 0 tests (requires trained tokenizer_v1.bin)
cargo test --release --test gate0_validation
# Benchmarks
cargo bench --bench load_bench
cargo bench --bench encode_bench
# Build PyO3 wheel
cd tokenizer-python
maturin build --release
pip install target/wheels/your_tokenizer-*.whl
# CLI usage
echo "pub fn main() -> Result<(), Error>" | cargo run --bin tokenize -- --show-ids --show-domainsv1 corpus (10B) → Gate 0 → v2 corpus (50B) → OPUS proxy → CPT → [SFT → F-GRPO → Rollout-SFT] × 3 → quantize → Gate 1
See scripts/ for each step. See blueprint §11 for full step-by-step with time estimates.
-
tokenizer_v1.binproduced, loads in Rust in < 50 ms - All D-CoT tags → single tokens (RAGEN-2 requirement)
- All SkillNet tags → single tokens (PROPHET requirement)
- All PaCoRe chain tags → single tokens
- Integers 0–9,999 → single tokens (SymPy verifier)
- Rust operators atomic (
->,=>,::,..=,||,&&) - Roundtrip encode/decode passes all 10 domain cases
-
check_dcot_tags()PyO3 function works from Python -
cargo test --workspaceall pass
| Paper | Relevance |
|---|---|
github/rust-gems bpe (Dec 2024, MIT) |
O(n) BPE engine |
| SCONE arXiv 2502.01637 | F-gram compression |
| MorphBPE arXiv 2502.00894 | Morpheme boundary training |
| KL3M arXiv 2503.17247 | Domain tokenizer baselines |
| RAGEN-2 (2025) | D-CoT tag reward signal |
| F-GRPO arXiv 2602.06717 | Rollout buffer token consistency |
| SkillNet (2025) | <|skill:X|> single-token routing |
| PaCoRe (2025) | <|chain:N|> parallel reasoning |
Blueprint v4.0 — April 2026. Tokenizer frozen at v2 before CPT begins.