A fast, CLI tool to estimate token usage, control API costs, and compress prompts for LLM providers. Built in Rust for performance, portability, and safety.
- Prompt Compression — Reduce prompt tokens by 70-90% using the Hieratic format
- Quality Metrics — Measure compression fidelity (semantic similarity, critical instruction preservation, structural integrity)
- LLM-as-a-Judge — Compare outputs from original vs compressed prompts using an LLM judge
- Incremental Compression — Compress long conversations turn-by-turn without re-processing history
- Structured Mode — Compression-aware handling of JSON, code blocks, HTML tables, and technical docs
- Everything from v0.1.x: token counting, cost estimation, multi-model comparison, load testing
curl -fsSL https://raw.githubusercontent.com/nooscraft/tokuin/main/install.sh | bashDetects your platform, downloads the latest release, verifies its checksum, and installs tokuin to /usr/local/bin (or ~/.local/bin if root access is unavailable).
Release binaries are built with --features all. Embedding-based semantic scoring (compression-embeddings) requires the ONNX Runtime, which only has compatible prebuilt binaries for some platforms:
| Platform | Token counting | Compression | Embedding scoring | LLM judge |
|---|---|---|---|---|
| Linux x86_64 | ✅ | ✅ | ✅ | ✅ |
| macOS Apple Silicon (aarch64) | ✅ | ✅ | ✅ | ✅ |
| macOS Intel (x86_64) | ✅ | ✅ | ❌ heuristic only | ✅ |
| Linux aarch64 | ✅ | ✅ | ❌ heuristic only | ✅ |
| Windows x86_64 | ✅ | ✅ | ❌ heuristic only | ✅ |
Platforms without embedding scoring still have full compression — quality metrics use heuristic scoring instead of embedding-based semantic similarity. See Platform Support for Embedding Model Loading for the technical reasons and how to get full support on any platform.
irm https://raw.githubusercontent.com/nooscraft/tokuin/main/install.ps1 | iexBinary is placed in %LOCALAPPDATA%\Programs\tokuin. To customize: download the script first and invoke .\install.ps1 -InstallDir "C:\Tools".
Release archives for all platforms are at GitHub Releases. Download the archive for your OS/architecture, verify against checksums.txt, and place tokuin on your PATH.
Minimal build (token counting only):
git clone https://github.com/nooscraft/tokuin.git
cd tokuin
cargo build --releaseBuild with all features:
cargo build --release --features allBuild with embedding support (semantic scoring):
cargo build --release --features all,compression-embeddings
# Then set up models:
./target/release/tokuin setup modelsecho "Hello, world!" | tokuin --model gpt-4
tokuin prompt.txt --model gpt-4 --priceecho "Hello, world!" | tokuin --compare gpt-4 gpt-3.5-turbo --priceecho "Hello, world!" | tokuin --model gpt-4 --format json
echo '[{"role":"system","content":"You are helpful"},{"role":"user","content":"Hi!"}]' | \
tokuin --model gpt-4 --breakdown --pricetokuin prompt.txt --model gpt-4 --diff prompt-v2.txt --price
tokuin prompt.txt --model gpt-4 --watchTokuin includes a prompt compression system using the Hieratic format — a structured, LLM-parseable representation that reduces token usage by 70-90% while preserving semantic meaning.
Named after ancient Egypt's compressed cursive script (a practical simplification of hieroglyphics), Hieratic represents the same information in far fewer tokens. Any modern LLM reads it directly without decoding.
# Compress a prompt (medium level by default)
tokuin compress my-prompt.txt
# With quality metrics
tokuin compress my-prompt.txt --quality
# Aggressive compression
tokuin compress my-prompt.txt --level aggressive --quality
# Structured mode (JSON, code blocks, HTML tables)
tokuin compress api-spec.txt --structured --level medium| Level | Token Reduction | Quality Trade-off |
|---|---|---|
light |
30–50% | Minimal — safe for most prompts |
medium |
50–70% | Balanced — good default |
aggressive |
70–90% | Maximum — verify with --quality |
Original (850 tokens):
You are an expert programmer with 10 years of experience in building
distributed systems, microservices, and cloud-native applications...
[full verbose role description]
Example 1: Authentication Bug Fix
Given a service that was experiencing session token bypass attacks...
[detailed example spanning 200 tokens]
Hieratic (285 tokens — 66% reduction):
@HIERATIC v1.0
@ROLE[inline]
"Expert engineer: 10y distributed systems, microservices, cloud-native"
@EXAMPLES[inline]
1. Auth bug: session bypass → HMAC signing → 94% bot reduction
2. DB perf: 2.3s queries → pooling+cache → 0.1s, 10x capacity
@TASK
Analyze code and provide recommendations
@FOCUS: performance, security, maintainability
@STYLE: concise, actionable
LLMs read Hieratic natively — no expansion step needed when sending to an API.
Use --quality to measure compression fidelity:
tokuin compress prompt.txt --qualityQuality Metrics:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Overall Score: 82.3% (Good)
├─ Semantic Similarity: 85.1%
├─ Critical Instructions: 3/3 preserved (100.0%)
├─ Information Retention: 78.2%
└─ Structural Integrity: 100.0%
✅ Quality is acceptable (>= 70%)
Scoring modes:
| Mode | Speed | Accuracy | Requires |
|---|---|---|---|
heuristic |
Fast | Keyword-based | Nothing (default) |
semantic |
Slower | Embedding-based | compression-embeddings |
hybrid |
Moderate | Best of both | compression-embeddings |
# Use semantic scoring (on supported platforms)
tokuin compress prompt.txt --scoring semantic --quality
# Heuristic scoring (always available)
tokuin compress prompt.txt --scoring heuristic --qualityFor prompts with JSON, HTML tables, BNF grammars, or code blocks:
tokuin compress technical-spec.txt --structured --level mediumStructured mode:
- Preserves JSON document structure
- Keeps HTML tables intact
- Detects and consolidates repetitive instruction patterns
- Segments by logical sections (definitions, examples, format specs)
- Applies structure-aware importance scoring
Use structured mode for: extraction prompts, API documentation, data processing specs.
Use default mode for: conversational prompts, natural language instructions, role descriptions.
For multi-turn conversations or continuously growing documents:
# First turn — creates conversation-turn1.txt.state.json
tokuin compress conversation-turn1.txt --incremental
# Subsequent turns — state file is auto-detected
tokuin compress conversation-turn2.txt --incremental
tokuin compress conversation-turn3.txt --incrementalOnly the delta since the last anchor is compressed — no re-processing of history.
Options:
--anchor-threshold <N>: Tokens before creating an anchor summary (default: 1000)--retention-threshold <N>: Recent tokens kept uncompressed (default: 500)--previous <PATH>: Override default state file path
Incremental Mode:
Anchors: 3
Anchor tokens: 7300
Retained tokens: 500
Test whether compressed prompts produce equivalent outputs to the original — the most reliable quality signal:
export OPENROUTER_API_KEY="sk-or-..."
tokuin compress prompt.txt --quality --llm-judgeHow it works:
- Sends the original prompt to the evaluation model → output A
- Sends the compressed prompt → output B
- Judge model scores A vs B on equivalence, instruction compliance, information completeness
LLM Judge Evaluation:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Output Equivalence: 92/100
Instruction Compliance: 95/100
Information Completeness: 88/100
Quality Preservation: 90/100
Overall Fidelity: 91/100 (Excellent)
Evaluation Cost: $0.012
Options:
--llm-judge: Enable LLM judge (requiresload-testfeature + API key)--evaluation-model <MODEL>: Model for generating outputs (default:anthropic/claude-3-opus)--judge-model <MODEL>: Model for scoring (default:anthropic/claude-3-opus)--judge-api-key <KEY>: API key (or useOPENROUTER_API_KEYenv var)--judge-provider <PROVIDER>:openrouter(default),openai,anthropic
Cost per evaluation: ~$0.01–0.05 (2 output calls + 1 judge call). Use cheaper models like anthropic/claude-3-haiku for evaluation and keep the high-quality model for judging.
tokuin expand compressed.hieratic --output expanded.txt
# Pipe directly to an LLM tool
tokuin expand compressed.hieratic | your-llm-toolWhere compression works best:
- ✅ Prompts ≥ 100 tokens (70–90% reduction)
- ✅ Prompts with repetitive role descriptions, examples, or constraints
- ✅ Technical docs with clear sections (JSON extraction, API specs)
- ✅ Multi-turn conversations (use
--incremental)
Where it may not help or hurt:
- ❌ Very short prompts (< 50 tokens) — Hieratic header overhead exceeds savings
- ❌ Already dense text (URLs, code, minimal prose) — little redundancy to remove
- ❌ Single-sentence instructions — format overhead not worth it
- ❌ Prompts where exact wording is critical (legal, medical) — always verify with
--quality
General caveats:
- Compression quality varies by prompt type and content; results are not guaranteed
aggressivelevel may lose nuance; validate outputs on your specific use case- Hieratic is a Tokuin-specific format — LLMs understand it, but it is not a standard
- The compressed output is not human-readable without familiarity with the format
Embedding-based semantic scoring (--scoring semantic / --scoring hybrid) requires loading an ONNX Runtime model at runtime. The ONNX Runtime ships prebuilt native libraries, and not all of them are compatible with every CI build toolchain. As a result, release binaries only include embedding support on a subset of platforms.
| Platform | Binary includes embeddings | Scoring available | Model loading |
|---|---|---|---|
| Linux x86_64 | ✅ Yes | semantic, hybrid, heuristic | tokuin setup models |
| macOS Apple Silicon (aarch64) | ✅ Yes | semantic, hybrid, heuristic | tokuin setup models |
| macOS Intel (x86_64) | ❌ No | heuristic only | N/A — no ONNX prebuilt |
| Linux aarch64 | ❌ No | heuristic only | N/A — ABI mismatch in cross build |
| Windows x86_64 | ❌ No | heuristic only | N/A — CRT conflict |
macOS Intel (x86_64): The ONNX Runtime does not publish prebuilt binaries for x86_64-apple-darwin with the feature set used in CI builds. Intel Macs are increasingly rare and ort has shifted focus to Apple Silicon.
Linux aarch64: The CI builds aarch64-unknown-linux-gnu via cross-compilation using a Docker container. The container's C++ standard library is too old to link against the prebuilt ONNX Runtime binary (missing __cxa_call_terminate and std::filesystem symbols). Native aarch64 Linux hosts (e.g. AWS Graviton, Raspberry Pi 4) can build with full embedding support.
Windows x86_64: The ONNX Runtime DLL is linked against the dynamic C runtime (msvcprt.lib), but Rust on MSVC links against the static C runtime (libcpmt.lib). This causes LNK2005 multiply-defined symbol errors at link time. There is no simple workaround without switching the entire build to dynamic CRT, which has other tradeoffs.
If you run --scoring semantic or --scoring hybrid on a binary built without compression-embeddings, Tokuin automatically falls back to heuristic scoring and prints a notice:
ℹ️ Embedding scoring requested but not available in this build.
Falling back to heuristic scoring.
Build with --features compression-embeddings for semantic/hybrid scoring.
Quality metrics still work — they just use keyword matching and position-based weighting instead of embedding cosine similarity.
Build from source on a native host (not cross-compiled):
git clone https://github.com/nooscraft/tokuin.git
cd tokuin
# Build with embedding support
cargo build --release --features all,compression-embeddings
# Download embedding models
./target/release/tokuin setup models
# Verify
./target/release/tokuin compress prompt.txt --scoring semantic --qualityThe setup models command downloads all-MiniLM-L6-v2 from HuggingFace and caches it at ~/.cache/tokuin/models/. It requires internet access on first run only.
# Download tokenizer (required for semantic scoring)
tokuin setup models
# Also download ONNX model file (optional, higher quality inference)
tokuin setup models --onnx
# Force re-download
tokuin setup models --forceTroubleshooting model download:
- If
setup modelsfails, check your internet connection and try again - The models are fetched directly from HuggingFace over HTTPS using rustls (no system TLS dependency)
- If the download consistently fails, you can manually place
tokenizer.jsonat~/.cache/tokuin/models/all-MiniLM-L6-v2/tokenizer.json - Compression itself (
--scoring heuristic) never requires model download
Run concurrent load tests against LLM APIs:
# Basic load test
export OPENAI_API_KEY="sk-openai-..."
echo "What is 2+2?" | tokuin load-test \
--model gpt-4 \
--runs 100 \
--concurrency 10
# With OpenRouter (400+ models)
export OPENROUTER_API_KEY="sk-or-..."
echo "Hello!" | tokuin load-test \
--model openai/gpt-4 \
--runs 50 \
--concurrency 5 \
--provider openrouter
# Dry run — cost estimate only, no API calls
echo "Test prompt" | tokuin load-test \
--model gpt-4 \
--runs 1000 \
--concurrency 50 \
--dry-run \
--estimate-cost
# With think time and retries
tokuin load-test \
--model gpt-4 \
--runs 200 \
--concurrency 20 \
--think-time "250-750ms" \
--retry 3 \
--prompt-file prompts.txtOutput:
=== Load Test Results ===
Total Requests: 100
Successful: 98 (98.0%)
Failed: 2 (2.0%)
Latency (ms):
Average: 1234.56
p50: 1200
p95: 1850
Cost Estimation:
Input tokens: 5000
Output tokens: 12000
Total cost: $0.870000
Environment Variables:
OPENAI_API_KEYANTHROPIC_API_KEYOPENROUTER_API_KEYAPI_KEY— generic (provider auto-detected unless--provideris set)
tokuin [OPTIONS] [FILE|TEXT]
tokuin estimate [OPTIONS] [FILE|TEXT]
Options:
-m, --model <MODEL> Model for tokenization (e.g. gpt-4)
-c, --compare <MODELS>... Compare across multiple models
-b, --breakdown Show breakdown by role
-f, --format <FORMAT> text | json | markdown [default: text]
-p, --price Show cost estimate
--pricing-file <FILE> Custom pricing TOML (or TOKUIN_PRICING_FILE)
--minify Strip markdown (requires markdown feature)
--diff <FILE> Compare with another file
-w, --watch Re-run on file change (requires watch feature)
tokuin compress <FILE> [OPTIONS]
Options:
-o, --output <FILE> Output file [default: <input>.hieratic]
-l, --level <LEVEL> light | medium | aggressive [default: medium]
--structured Enable structured document mode
--inline Force inline mode
--incremental Incremental compression (delta only)
--previous <FILE> Custom state file path
--anchor-threshold <N> Anchor interval in tokens [default: 1000]
--retention-threshold <N> Recent tokens kept uncompressed [default: 500]
-m, --model <MODEL> Tokenizer model [default: gpt-4]
-f, --format <FORMAT> hieratic | expanded | json [default: hieratic]
--scoring <MODE> heuristic | semantic | hybrid [default: heuristic]
--quality Show quality metrics
--llm-judge LLM judge evaluation (requires load-test + API key)
--evaluation-model <MODEL> Model for LLM judge output generation
--judge-model <MODEL> Model for judging [default: anthropic/claude-3-opus]
--judge-api-key <KEY> API key for judge
--judge-provider <PROVIDER> openrouter | openai | anthropic [default: openrouter]
--pricing-file <FILE> Custom pricing TOML
tokuin load-test [OPTIONS] --model <MODEL> --runs <RUNS>
Options:
-m, --model <MODEL> Model name
--endpoint <URL> API endpoint (optional)
--api-key <KEY> Generic API key
--openai-api-key <KEY> OpenAI API key
--anthropic-api-key <KEY> Anthropic API key
--openrouter-api-key <KEY> OpenRouter API key
-c, --concurrency <N> Concurrent requests [default: 10]
-r, --runs <N> Total requests
-p, --prompt-file <FILE> Prompt file (or stdin)
--think-time <TIME> Think time e.g. "250-750ms"
--retry <N> Retries on failure [default: 3]
-f, --output-format <FORMAT> text | json | csv | prometheus | markdown
--dry-run Estimate cost, no API calls
--estimate-cost Show cost in results
--pricing-file <FILE> Custom pricing TOML
| Feature | Flag | What it adds |
|---|---|---|
| Token counting | (default) | Core estimation, cost, comparison |
| Compression | compression |
Hieratic compress/expand, heuristic quality metrics |
| Embedding scoring | compression-embeddings |
Semantic + hybrid scoring via ONNX |
| Load testing | load-test |
Concurrent API testing, LLM judge |
| Markdown | markdown |
Markdown output format, --minify |
| Watch mode | watch |
--watch file change detection |
| Gemini | gemini |
Google Gemini tokenization |
| All (release) | all |
Everything above except compression-embeddings* |
* compression-embeddings is excluded from all because ONNX Runtime prebuilt binaries are not available for all cross-compiled release targets. Add it explicitly with --features all,compression-embeddings on a native host.
# Standard release build
cargo build --release --features all
# With embedding support (native host only)
cargo build --release --features all,compression-embeddings
# Minimal compression (no embeddings)
cargo build --release --features compression
# Load testing only
cargo build --release --features load-testOnly needed when building with compression-embeddings:
./target/release/tokuin setup models # download tokenizer
./target/release/tokuin setup models --onnx # also download ONNX model
./target/release/tokuin setup models --force # re-downloadModels are stored in ~/.cache/tokuin/models/ and reused across sessions.
gpt-4,gpt-4-turbo,gpt-3.5-turbo,gpt-3.5-turbo-16k
gemini-pro,gemini-2.5-pro,gemini-2.5-flash
Unified access to 400+ models via provider/model format:
anthropic/claude-3-opus,openai/gpt-4,meta-llama/llama-2-70b-chat,google/gemini-pro, and hundreds more
claude-3-opus,claude-3-sonnet,claude-3-haiku
Use --provider generic with --endpoint to target any OpenAI-compatible API.
Mistral, Cohere, AI21 Labs, Meta LLaMA. See PROVIDERS_PLAN.md.
src/
├── cli.rs # Argument parsing and command dispatch
├── error.rs # Central error type (thiserror)
├── tokenizers/ # OpenAI, Gemini tokenizer implementations
├── models/ # Model registry and pricing tables
├── parsers/ # Input parsing (plain text, JSON chat)
├── output/ # Formatters (text, JSON, Markdown)
├── http/ # HTTP client layer (load-test feature)
│ └── providers/ # OpenAI, OpenRouter, Anthropic, stubs
├── simulator/ # Load test engine, concurrency, metrics
├── compression/ # Hieratic compression pipeline
│ ├── compressor.rs # Orchestration and compression levels
│ ├── parser.rs # Hieratic format parser
│ ├── hieratic_encoder.rs # Encode to Hieratic
│ ├── hieratic_decoder.rs # Expand Hieratic back to text
│ ├── quality.rs # Quality metrics
│ ├── embeddings.rs # ONNX embedding provider
│ ├── llm_judge.rs # LLM-as-a-judge evaluation
│ ├── pattern_extractor.rs
│ └── context_library.rs
└── utils/ # Shared helpers
cargo test
cargo test --features all
cargo test -- --nocapture # with output
cargo fmt
cargo clippy -- -D warningsTargets Rust 1.70+. CI runs on stable and beta. Due to dependency constraints, a specific MSRV is not enforced. See CONTRIBUTING.md.
We welcome contributions! See CONTRIBUTING.md for guidelines.
Vibe coding welcome — if you collaborate with an AI pair programmer, skim
AGENTS.mdfor a quick project brief and mention the session in your PR.
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT License (LICENSE-MIT)
at your option.
- Built with tiktoken-rs for OpenAI tokenization
- Inspired by the need for accurate token estimation and cost control in LLM development
- Hieratic format named after the ancient Egyptian script that compressed hieroglyphics into practical everyday writing