🧮 Tokuin

A fast, CLI tool to estimate token usage, control API costs, and compress prompts for LLM providers. Built in Rust for performance, portability, and safety.

✨ What's in v0.2.0

Prompt Compression — Reduce prompt tokens by 70-90% using the Hieratic format
Quality Metrics — Measure compression fidelity (semantic similarity, critical instruction preservation, structural integrity)
LLM-as-a-Judge — Compare outputs from original vs compressed prompts using an LLM judge
Incremental Compression — Compress long conversations turn-by-turn without re-processing history
Structured Mode — Compression-aware handling of JSON, code blocks, HTML tables, and technical docs
Everything from v0.1.x: token counting, cost estimation, multi-model comparison, load testing

🚀 Installation

Quick Install (macOS & Linux)

curl -fsSL https://raw.githubusercontent.com/nooscraft/tokuin/main/install.sh | bash

Detects your platform, downloads the latest release, verifies its checksum, and installs tokuin to /usr/local/bin (or ~/.local/bin if root access is unavailable).

What's included per platform

Release binaries are built with --features all. Embedding-based semantic scoring (compression-embeddings) requires the ONNX Runtime, which only has compatible prebuilt binaries for some platforms:

Platform	Token counting	Compression	Embedding scoring	LLM judge
Linux x86_64	✅	✅	✅	✅
macOS Apple Silicon (aarch64)	✅	✅	✅	✅
macOS Intel (x86_64)	✅	✅	❌ heuristic only	✅
Linux aarch64	✅	✅	❌ heuristic only	✅
Windows x86_64	✅	✅	❌ heuristic only	✅

Platforms without embedding scoring still have full compression — quality metrics use heuristic scoring instead of embedding-based semantic similarity. See Platform Support for Embedding Model Loading for the technical reasons and how to get full support on any platform.

Quick Install (Windows PowerShell)

irm https://raw.githubusercontent.com/nooscraft/tokuin/main/install.ps1 | iex

Binary is placed in %LOCALAPPDATA%\Programs\tokuin. To customize: download the script first and invoke .\install.ps1 -InstallDir "C:\Tools".

From Releases

Release archives for all platforms are at GitHub Releases. Download the archive for your OS/architecture, verify against checksums.txt, and place tokuin on your PATH.

From Source

Minimal build (token counting only):

git clone https://github.com/nooscraft/tokuin.git
cd tokuin
cargo build --release

Build with all features:

cargo build --release --features all

Build with embedding support (semantic scoring):

cargo build --release --features all,compression-embeddings
# Then set up models:
./target/release/tokuin setup models

📖 Usage

Token Counting

echo "Hello, world!" | tokuin --model gpt-4
tokuin prompt.txt --model gpt-4 --price

Multi-Model Comparison

echo "Hello, world!" | tokuin --compare gpt-4 gpt-3.5-turbo --price

JSON Output / Breakdown

echo "Hello, world!" | tokuin --model gpt-4 --format json
echo '[{"role":"system","content":"You are helpful"},{"role":"user","content":"Hi!"}]' | \
  tokuin --model gpt-4 --breakdown --price

Diff & Watch

tokuin prompt.txt --model gpt-4 --diff prompt-v2.txt --price
tokuin prompt.txt --model gpt-4 --watch

🗜️ Prompt Compression

Tokuin includes a prompt compression system using the Hieratic format — a structured, LLM-parseable representation that reduces token usage by 70-90% while preserving semantic meaning.

Why Hieratic?

Named after ancient Egypt's compressed cursive script (a practical simplification of hieroglyphics), Hieratic represents the same information in far fewer tokens. Any modern LLM reads it directly without decoding.

Quick Start

# Compress a prompt (medium level by default)
tokuin compress my-prompt.txt

# With quality metrics
tokuin compress my-prompt.txt --quality

# Aggressive compression
tokuin compress my-prompt.txt --level aggressive --quality

# Structured mode (JSON, code blocks, HTML tables)
tokuin compress api-spec.txt --structured --level medium

Compression Levels

Level	Token Reduction	Quality Trade-off
`light`	30–50%	Minimal — safe for most prompts
`medium`	50–70%	Balanced — good default
`aggressive`	70–90%	Maximum — verify with `--quality`

Hieratic Format Example

Original (850 tokens):

You are an expert programmer with 10 years of experience in building
distributed systems, microservices, and cloud-native applications...
[full verbose role description]

Example 1: Authentication Bug Fix
Given a service that was experiencing session token bypass attacks...
[detailed example spanning 200 tokens]

Hieratic (285 tokens — 66% reduction):

@HIERATIC v1.0

@ROLE[inline]
"Expert engineer: 10y distributed systems, microservices, cloud-native"

@EXAMPLES[inline]
1. Auth bug: session bypass → HMAC signing → 94% bot reduction
2. DB perf: 2.3s queries → pooling+cache → 0.1s, 10x capacity

@TASK
Analyze code and provide recommendations

@FOCUS: performance, security, maintainability
@STYLE: concise, actionable

LLMs read Hieratic natively — no expansion step needed when sending to an API.

Quality Metrics

Use --quality to measure compression fidelity:

tokuin compress prompt.txt --quality

Quality Metrics:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Overall Score: 82.3% (Good)
  ├─ Semantic Similarity:      85.1%
  ├─ Critical Instructions:    3/3 preserved (100.0%)
  ├─ Information Retention:    78.2%
  └─ Structural Integrity:     100.0%

✅ Quality is acceptable (>= 70%)

Scoring modes:

Mode	Speed	Accuracy	Requires
`heuristic`	Fast	Keyword-based	Nothing (default)
`semantic`	Slower	Embedding-based	`compression-embeddings`
`hybrid`	Moderate	Best of both	`compression-embeddings`

# Use semantic scoring (on supported platforms)
tokuin compress prompt.txt --scoring semantic --quality

# Heuristic scoring (always available)
tokuin compress prompt.txt --scoring heuristic --quality

Structured Document Mode

For prompts with JSON, HTML tables, BNF grammars, or code blocks:

tokuin compress technical-spec.txt --structured --level medium

Structured mode:

Preserves JSON document structure
Keeps HTML tables intact
Detects and consolidates repetitive instruction patterns
Segments by logical sections (definitions, examples, format specs)
Applies structure-aware importance scoring

Use structured mode for: extraction prompts, API documentation, data processing specs.
Use default mode for: conversational prompts, natural language instructions, role descriptions.

Incremental Compression

For multi-turn conversations or continuously growing documents:

# First turn — creates conversation-turn1.txt.state.json
tokuin compress conversation-turn1.txt --incremental

# Subsequent turns — state file is auto-detected
tokuin compress conversation-turn2.txt --incremental
tokuin compress conversation-turn3.txt --incremental

Only the delta since the last anchor is compressed — no re-processing of history.

Options:

--anchor-threshold <N>: Tokens before creating an anchor summary (default: 1000)
--retention-threshold <N>: Recent tokens kept uncompressed (default: 500)
--previous <PATH>: Override default state file path

Incremental Mode:
  Anchors: 3
  Anchor tokens: 7300
  Retained tokens: 500

LLM-as-a-Judge Evaluation

Test whether compressed prompts produce equivalent outputs to the original — the most reliable quality signal:

export OPENROUTER_API_KEY="sk-or-..."

tokuin compress prompt.txt --quality --llm-judge

How it works:

Sends the original prompt to the evaluation model → output A
Sends the compressed prompt → output B
Judge model scores A vs B on equivalence, instruction compliance, information completeness

LLM Judge Evaluation:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Output Equivalence:       92/100
Instruction Compliance:   95/100
Information Completeness: 88/100
Quality Preservation:     90/100
Overall Fidelity:         91/100 (Excellent)
Evaluation Cost: $0.012

Options:

--llm-judge: Enable LLM judge (requires load-test feature + API key)
--evaluation-model <MODEL>: Model for generating outputs (default: anthropic/claude-3-opus)
--judge-model <MODEL>: Model for scoring (default: anthropic/claude-3-opus)
--judge-api-key <KEY>: API key (or use OPENROUTER_API_KEY env var)
--judge-provider <PROVIDER>: openrouter (default), openai, anthropic

Cost per evaluation: ~$0.01–0.05 (2 output calls + 1 judge call). Use cheaper models like anthropic/claude-3-haiku for evaluation and keep the high-quality model for judging.

Expand Compressed Prompts

tokuin expand compressed.hieratic --output expanded.txt

# Pipe directly to an LLM tool
tokuin expand compressed.hieratic | your-llm-tool

Limitations and Known Drawbacks

Where compression works best:

✅ Prompts ≥ 100 tokens (70–90% reduction)
✅ Prompts with repetitive role descriptions, examples, or constraints
✅ Technical docs with clear sections (JSON extraction, API specs)
✅ Multi-turn conversations (use --incremental)

Where it may not help or hurt:

❌ Very short prompts (< 50 tokens) — Hieratic header overhead exceeds savings
❌ Already dense text (URLs, code, minimal prose) — little redundancy to remove
❌ Single-sentence instructions — format overhead not worth it
❌ Prompts where exact wording is critical (legal, medical) — always verify with --quality

General caveats:

Compression quality varies by prompt type and content; results are not guaranteed
aggressive level may lose nuance; validate outputs on your specific use case
Hieratic is a Tokuin-specific format — LLMs understand it, but it is not a standard
The compressed output is not human-readable without familiarity with the format

Platform Support for Embedding Model Loading

Embedding-based semantic scoring (--scoring semantic / --scoring hybrid) requires loading an ONNX Runtime model at runtime. The ONNX Runtime ships prebuilt native libraries, and not all of them are compatible with every CI build toolchain. As a result, release binaries only include embedding support on a subset of platforms.

What each platform gets

Platform	Binary includes embeddings	Scoring available	Model loading
Linux x86_64	✅ Yes	semantic, hybrid, heuristic	`tokuin setup models`
macOS Apple Silicon (aarch64)	✅ Yes	semantic, hybrid, heuristic	`tokuin setup models`
macOS Intel (x86_64)	❌ No	heuristic only	N/A — no ONNX prebuilt
Linux aarch64	❌ No	heuristic only	N/A — ABI mismatch in cross build
Windows x86_64	❌ No	heuristic only	N/A — CRT conflict

Why each platform is limited

macOS Intel (x86_64): The ONNX Runtime does not publish prebuilt binaries for x86_64-apple-darwin with the feature set used in CI builds. Intel Macs are increasingly rare and ort has shifted focus to Apple Silicon.

Linux aarch64: The CI builds aarch64-unknown-linux-gnu via cross-compilation using a Docker container. The container's C++ standard library is too old to link against the prebuilt ONNX Runtime binary (missing __cxa_call_terminate and std::filesystem symbols). Native aarch64 Linux hosts (e.g. AWS Graviton, Raspberry Pi 4) can build with full embedding support.

Windows x86_64: The ONNX Runtime DLL is linked against the dynamic C runtime (msvcprt.lib), but Rust on MSVC links against the static C runtime (libcpmt.lib). This causes LNK2005 multiply-defined symbol errors at link time. There is no simple workaround without switching the entire build to dynamic CRT, which has other tradeoffs.

What happens at runtime on unsupported platforms

If you run --scoring semantic or --scoring hybrid on a binary built without compression-embeddings, Tokuin automatically falls back to heuristic scoring and prints a notice:

ℹ️  Embedding scoring requested but not available in this build.
   Falling back to heuristic scoring.
   Build with --features compression-embeddings for semantic/hybrid scoring.

Quality metrics still work — they just use keyword matching and position-based weighting instead of embedding cosine similarity.

Getting full embedding support on any platform

Build from source on a native host (not cross-compiled):

git clone https://github.com/nooscraft/tokuin.git
cd tokuin

# Build with embedding support
cargo build --release --features all,compression-embeddings

# Download embedding models
./target/release/tokuin setup models

# Verify
./target/release/tokuin compress prompt.txt --scoring semantic --quality

The setup models command downloads all-MiniLM-L6-v2 from HuggingFace and caches it at ~/.cache/tokuin/models/. It requires internet access on first run only.

Model setup options

# Download tokenizer (required for semantic scoring)
tokuin setup models

# Also download ONNX model file (optional, higher quality inference)
tokuin setup models --onnx

# Force re-download
tokuin setup models --force

Troubleshooting model download:

If setup models fails, check your internet connection and try again
The models are fetched directly from HuggingFace over HTTPS using rustls (no system TLS dependency)
If the download consistently fails, you can manually place tokenizer.json at ~/.cache/tokuin/models/all-MiniLM-L6-v2/tokenizer.json
Compression itself (--scoring heuristic) never requires model download

🧪 Load Testing

Run concurrent load tests against LLM APIs:

# Basic load test
export OPENAI_API_KEY="sk-openai-..."
echo "What is 2+2?" | tokuin load-test \
  --model gpt-4 \
  --runs 100 \
  --concurrency 10

# With OpenRouter (400+ models)
export OPENROUTER_API_KEY="sk-or-..."
echo "Hello!" | tokuin load-test \
  --model openai/gpt-4 \
  --runs 50 \
  --concurrency 5 \
  --provider openrouter

# Dry run — cost estimate only, no API calls
echo "Test prompt" | tokuin load-test \
  --model gpt-4 \
  --runs 1000 \
  --concurrency 50 \
  --dry-run \
  --estimate-cost

# With think time and retries
tokuin load-test \
  --model gpt-4 \
  --runs 200 \
  --concurrency 20 \
  --think-time "250-750ms" \
  --retry 3 \
  --prompt-file prompts.txt

Output:

=== Load Test Results ===
Total Requests: 100
Successful: 98 (98.0%)
Failed: 2 (2.0%)

Latency (ms):
  Average: 1234.56
  p50:     1200
  p95:     1850

Cost Estimation:
  Input tokens:   5000
  Output tokens: 12000
  Total cost:    $0.870000

Environment Variables:

OPENAI_API_KEY
ANTHROPIC_API_KEY
OPENROUTER_API_KEY
API_KEY — generic (provider auto-detected unless --provider is set)

📋 Command Reference

Token Estimate (default)

tokuin [OPTIONS] [FILE|TEXT]
tokuin estimate [OPTIONS] [FILE|TEXT]

Options:
  -m, --model <MODEL>          Model for tokenization (e.g. gpt-4)
  -c, --compare <MODELS>...    Compare across multiple models
  -b, --breakdown              Show breakdown by role
  -f, --format <FORMAT>        text | json | markdown  [default: text]
  -p, --price                  Show cost estimate
      --pricing-file <FILE>    Custom pricing TOML (or TOKUIN_PRICING_FILE)
      --minify                 Strip markdown (requires markdown feature)
      --diff <FILE>            Compare with another file
  -w, --watch                  Re-run on file change (requires watch feature)

Compress

tokuin compress <FILE> [OPTIONS]

Options:
  -o, --output <FILE>                   Output file [default: <input>.hieratic]
  -l, --level <LEVEL>                   light | medium | aggressive  [default: medium]
      --structured                       Enable structured document mode
      --inline                           Force inline mode
      --incremental                      Incremental compression (delta only)
      --previous <FILE>                  Custom state file path
      --anchor-threshold <N>             Anchor interval in tokens [default: 1000]
      --retention-threshold <N>          Recent tokens kept uncompressed [default: 500]
  -m, --model <MODEL>                   Tokenizer model [default: gpt-4]
  -f, --format <FORMAT>                 hieratic | expanded | json  [default: hieratic]
      --scoring <MODE>                   heuristic | semantic | hybrid  [default: heuristic]
      --quality                          Show quality metrics
      --llm-judge                        LLM judge evaluation (requires load-test + API key)
      --evaluation-model <MODEL>         Model for LLM judge output generation
      --judge-model <MODEL>              Model for judging [default: anthropic/claude-3-opus]
      --judge-api-key <KEY>              API key for judge
      --judge-provider <PROVIDER>        openrouter | openai | anthropic  [default: openrouter]
      --pricing-file <FILE>              Custom pricing TOML

Load Test

tokuin load-test [OPTIONS] --model <MODEL> --runs <RUNS>

Options:
  -m, --model <MODEL>              Model name
      --endpoint <URL>             API endpoint (optional)
      --api-key <KEY>              Generic API key
      --openai-api-key <KEY>       OpenAI API key
      --anthropic-api-key <KEY>    Anthropic API key
      --openrouter-api-key <KEY>   OpenRouter API key
  -c, --concurrency <N>            Concurrent requests [default: 10]
  -r, --runs <N>                   Total requests
  -p, --prompt-file <FILE>         Prompt file (or stdin)
      --think-time <TIME>          Think time e.g. "250-750ms"
      --retry <N>                  Retries on failure [default: 3]
  -f, --output-format <FORMAT>     text | json | csv | prometheus | markdown
      --dry-run                    Estimate cost, no API calls
      --estimate-cost              Show cost in results
      --pricing-file <FILE>        Custom pricing TOML

🔧 Features & Build Flags

Feature	Flag	What it adds
Token counting	(default)	Core estimation, cost, comparison
Compression	`compression`	Hieratic compress/expand, heuristic quality metrics
Embedding scoring	`compression-embeddings`	Semantic + hybrid scoring via ONNX
Load testing	`load-test`	Concurrent API testing, LLM judge
Markdown	`markdown`	Markdown output format, `--minify`
Watch mode	`watch`	`--watch` file change detection
Gemini	`gemini`	Google Gemini tokenization
All (release)	`all`	Everything above except `compression-embeddings`*

* compression-embeddings is excluded from all because ONNX Runtime prebuilt binaries are not available for all cross-compiled release targets. Add it explicitly with --features all,compression-embeddings on a native host.

# Standard release build
cargo build --release --features all

# With embedding support (native host only)
cargo build --release --features all,compression-embeddings

# Minimal compression (no embeddings)
cargo build --release --features compression

# Load testing only
cargo build --release --features load-test

Post-Build: Set Up Embedding Models

Only needed when building with compression-embeddings:

./target/release/tokuin setup models        # download tokenizer
./target/release/tokuin setup models --onnx # also download ONNX model
./target/release/tokuin setup models --force # re-download

Models are stored in ~/.cache/tokuin/models/ and reused across sessions.

🎯 Supported Models

OpenAI

gpt-4, gpt-4-turbo, gpt-3.5-turbo, gpt-3.5-turbo-16k

Google Gemini (requires `gemini` feature)

gemini-pro, gemini-2.5-pro, gemini-2.5-flash

OpenRouter (requires `load-test` feature)

Unified access to 400+ models via provider/model format:

anthropic/claude-3-opus, openai/gpt-4, meta-llama/llama-2-70b-chat, google/gemini-pro, and hundreds more

Anthropic (requires `load-test` feature)

claude-3-opus, claude-3-sonnet, claude-3-haiku

Generic Endpoint

Use --provider generic with --endpoint to target any OpenAI-compatible API.

Planned

Mistral, Cohere, AI21 Labs, Meta LLaMA. See PROVIDERS_PLAN.md.

🏗️ Architecture

src/
├── cli.rs                  # Argument parsing and command dispatch
├── error.rs                # Central error type (thiserror)
├── tokenizers/             # OpenAI, Gemini tokenizer implementations
├── models/                 # Model registry and pricing tables
├── parsers/                # Input parsing (plain text, JSON chat)
├── output/                 # Formatters (text, JSON, Markdown)
├── http/                   # HTTP client layer (load-test feature)
│   └── providers/          # OpenAI, OpenRouter, Anthropic, stubs
├── simulator/              # Load test engine, concurrency, metrics
├── compression/            # Hieratic compression pipeline
│   ├── compressor.rs       # Orchestration and compression levels
│   ├── parser.rs           # Hieratic format parser
│   ├── hieratic_encoder.rs # Encode to Hieratic
│   ├── hieratic_decoder.rs # Expand Hieratic back to text
│   ├── quality.rs          # Quality metrics
│   ├── embeddings.rs       # ONNX embedding provider
│   ├── llm_judge.rs        # LLM-as-a-judge evaluation
│   ├── pattern_extractor.rs
│   └── context_library.rs
└── utils/                  # Shared helpers

🧪 Testing

cargo test
cargo test --features all
cargo test -- --nocapture  # with output
cargo fmt
cargo clippy -- -D warnings

MSRV

Targets Rust 1.70+. CI runs on stable and beta. Due to dependency constraints, a specific MSRV is not enforced. See CONTRIBUTING.md.

🤝 Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Vibe coding welcome — if you collaborate with an AI pair programmer, skim AGENTS.md for a quick project brief and mention the session in your PR.

📄 License

Licensed under either of:

Apache License, Version 2.0 (LICENSE-APACHE)
MIT License (LICENSE-MIT)

at your option.

🙏 Acknowledgments

Built with tiktoken-rs for OpenAI tokenization
Inspired by the need for accurate token estimation and cost control in LLM development
Hieratic format named after the ancient Egyptian script that compressed hieroglyphics into practical everyday writing

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github		.github
docs		docs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
ADDING_MODELS_GUIDE.md		ADDING_MODELS_GUIDE.md
AGENTS.md		AGENTS.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Cross.toml		Cross.toml
HIERATIC_FORMAT.md		HIERATIC_FORMAT.md
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
PRICING_TEMPLATE.toml		PRICING_TEMPLATE.toml
PROVIDERS_PLAN.md		PROVIDERS_PLAN.md
README.md		README.md
contexts.toml		contexts.toml
install.ps1		install.ps1
install.sh		install.sh

Folders and files

Latest commit

History

Repository files navigation

🧮 Tokuin

✨ What's in v0.2.0

🚀 Installation

Quick Install (macOS & Linux)

What's included per platform

Quick Install (Windows PowerShell)

From Releases

From Source

📖 Usage

Token Counting

Multi-Model Comparison

JSON Output / Breakdown

Diff & Watch

🗜️ Prompt Compression

Why Hieratic?

Quick Start

Compression Levels

Hieratic Format Example

Quality Metrics

Structured Document Mode

Incremental Compression

LLM-as-a-Judge Evaluation

Expand Compressed Prompts

Limitations and Known Drawbacks

Platform Support for Embedding Model Loading

What each platform gets

Why each platform is limited

What happens at runtime on unsupported platforms

Getting full embedding support on any platform

Model setup options

🧪 Load Testing

📋 Command Reference

Token Estimate (default)

Compress

Load Test

🔧 Features & Build Flags

Post-Build: Set Up Embedding Models

🎯 Supported Models

OpenAI

Google Gemini (requires gemini feature)

OpenRouter (requires load-test feature)

Anthropic (requires load-test feature)

Generic Endpoint

Planned

🏗️ Architecture

🧪 Testing

MSRV

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Topics

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Google Gemini (requires `gemini` feature)

OpenRouter (requires `load-test` feature)

Anthropic (requires `load-test` feature)

Packages