Skip to content

SunayHegde2006/Arun

Repository files navigation

Arun — Custom Tokenizer

Blueprint: custom-tokenizer-blueprint-v4-final.md — Final
Hardware target: RTX 3060 12 GB VRAM · 16–32 GB RAM · NVMe SSD
Budget: ₹0 — all tools, datasets, frameworks are free
Primary language: Rust
Secondary: Python (Morfessor, corpus pipeline, BPE training, PyO3 bindings, OPUS proxy)


Purpose

Single shared tokenizer for the Hierarchical Multi-Agent AI System (Orchestrator, Math, Code, Debug+QA, Finance agents). Supersedes blueprint v1–v3.

Key properties:

  • 52,000 vocabulary (u16 IDs — 2 bytes each)
  • O(n) BPE via bpe = "0.3" (github/rust-gems, MIT)
  • MorphBPE: Morfessor 2.0 boundary enforcement
  • F-gram compression: top-1,488 phrases via Aho-Corasick (daachorse)
  • 256 special token slots (IDs 10,256–10,511): D-CoT, SkillNet, PaCoRe, RAGEN-2 anchors
  • tokenizer.bin cold-start < 10 ms via memmap2
  • PyO3 wheel: encode, decode, check_dcot_tags, encode_fast, save_pretrained

Workspace Members

Crate Role
tokenizer-core Runtime: deployed in AI system
tokenizer-train Offline training binary
tokenizer-python PyO3 wheel for training pipeline
tokenizer-bench Criterion benchmarks
tokenizer-cli Debug / inspection tool

Vocabulary Budget (52K)

ID Range Category Count
0–255 Byte fallback 256
256–10,255 Integers 0–9,999 10,000
10,256–10,511 Special / control 256
10,512–42,511 BPE base (English) 32,000
42,512–50,511 Domain extensions 8,000
50,512–51,999 F-gram reserved 1,488

Quick Start

# Prerequisites
rustup install nightly
cargo install cargo-fuzz maturin

# Check workspace compiles
cargo check --workspace

# Run Gate 0 tests (requires trained tokenizer_v1.bin)
cargo test --release --test gate0_validation

# Benchmarks
cargo bench --bench load_bench
cargo bench --bench encode_bench

# Build PyO3 wheel
cd tokenizer-python
maturin build --release
pip install target/wheels/your_tokenizer-*.whl

# CLI usage
echo "pub fn main() -> Result<(), Error>" | cargo run --bin tokenize -- --show-ids --show-domains

Training Pipeline Order

v1 corpus (10B) → Gate 0 → v2 corpus (50B) → OPUS proxy → CPT → [SFT → F-GRPO → Rollout-SFT] × 3 → quantize → Gate 1

See scripts/ for each step. See blueprint §11 for full step-by-step with time estimates.


Gate 0 Checklist

  • tokenizer_v1.bin produced, loads in Rust in < 50 ms
  • All D-CoT tags → single tokens (RAGEN-2 requirement)
  • All SkillNet tags → single tokens (PROPHET requirement)
  • All PaCoRe chain tags → single tokens
  • Integers 0–9,999 → single tokens (SymPy verifier)
  • Rust operators atomic (->, =>, ::, ..=, ||, &&)
  • Roundtrip encode/decode passes all 10 domain cases
  • check_dcot_tags() PyO3 function works from Python
  • cargo test --workspace all pass

References

Paper Relevance
github/rust-gems bpe (Dec 2024, MIT) O(n) BPE engine
SCONE arXiv 2502.01637 F-gram compression
MorphBPE arXiv 2502.00894 Morpheme boundary training
KL3M arXiv 2503.17247 Domain tokenizer baselines
RAGEN-2 (2025) D-CoT tag reward signal
F-GRPO arXiv 2602.06717 Rollout buffer token consistency
SkillNet (2025) <|skill:X|> single-token routing
PaCoRe (2025) <|chain:N|> parallel reasoning

Blueprint v4.0 — April 2026. Tokenizer frozen at v2 before CPT begins.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors