Skip to content

inimaz/mothertoken

Repository files navigation

mothertoken logo

mothertoken

Every model has a native tongue. The question is whether yours matches.

Toolkit for comparing tokenizer efficiency across languages, model families, and user-supplied Hugging Face refs.

Note

The bundled benchmark is curated and representative, not exhaustive. Use direct Hugging Face refs when you want to compare tokenizers outside the starter set.

Installation

# Published package
pip install mothertoken

For local development:

git clone https://github.com/inimaz/mothertoken
cd mothertoken
uv sync
uv pip install -e .

CLI Usage

The mothertoken command is available after installation.

Common questions

I speak a language that is not English. Which tokenizer is most efficient for it?

mothertoken rank spanish
Show demo

Demo of mothertoken rank spanish

I have this text and a chosen model. How many tokens does it use?

mothertoken tokenize "Hola Mundo" --model gpt-4o
Show demo

Demo of mothertoken tokenize Hola Mundo

I am choosing between a few models. Which one tokenizes my text best?

mothertoken compare "Travesura realizada" --model gpt-oss --model Qwen/Qwen3-0.6B
Show demo

Demo of mothertoken compare Travesura realizada

I have this model. Which languages does it tokenize best, which ones worst?

mothertoken benchmark run --models gpt-oss,YOUR_MODEL1,YOUR_MODEL2
Show demo

Demo of mothertoken benchmark run

Rank tokenizers for a language

Rank supported tokenizers for a specific language using the precomputed benchmark data.

mothertoken rank spanish

# Raw FLORES+ codes still work
mothertoken rank spa_Latn

List tokenizers

See which tokenizer IDs can be used and which familiar models use them.

mothertoken list

Tokenize exact text

Count tokens for exact text using local tokenizers by default. Add --language to estimate the English-equivalent token count from the benchmark multiplier.

mothertoken tokenize "Hola Mundo" --language es

# Check one model
mothertoken tokenize "Hello" --model gpt-4o

# Check a Hugging Face model/tokenizer ref directly
mothertoken tokenize "Hello" --model Qwen/Qwen3-0.6B

# Estimate the English-equivalent count for a known language
mothertoken tokenize "مرحبا بالعالم" --language ar --model gpt-4o

# Compare against your own English translation
mothertoken tokenize "مرحبا بالعالم" --language ar --english-text "Hello world"

# Tokenize a file
mothertoken tokenize --file prompt.txt

# Compare translated files
mothertoken tokenize --file prompt.ar.txt --language ar --english-file prompt.en.txt

Compare selected tokenizers

Compare aliases from mothertoken list with direct Hugging Face refs. This is the main workflow when you care about a specific set of models.

mothertoken compare "Travesura realizada" \
  --model gpt-4o \
  --model Qwen/Qwen3-0.6B \
  --model mistralai/Mistral-7B-v0.1

mothertoken compare --file prompt.txt \
  --model mistralai/Mistral-7B-v0.1 \
  --model deepseek-ai/DeepSeek-V4-Pro

Benchmark data

mothertoken benchmark shows benchmark help. Use benchmark run to create benchmark data, benchmark use to choose the active benchmark, and benchmark status to inspect what commands will use.

When --output is omitted, benchmark run writes to the user-owned benchmark file and makes it active:

mothertoken benchmark run --languages eng_Latn,arb_Arab --models gpt-4o,Qwen/Qwen3-0.6B

Before it starts, the command prints the file it will write. The default user config locations are platform-specific:

OS User config directory
Linux / XDG $XDG_CONFIG_HOME/mothertoken or ~/.config/mothertoken
macOS ~/Library/Application Support/mothertoken
Windows %APPDATA%\mothertoken

To write somewhere else:

mothertoken benchmark run \
  --languages eng_Latn,arb_Arab \
  --models gpt-4o,Qwen/Qwen3-0.6B \
  --output benchmark.json

Make an existing benchmark active:

mothertoken benchmark use benchmark.json
mothertoken benchmark status

Return to the bundled default benchmark:

mothertoken benchmark use --default

Researcher Workflow

Benchmark regeneration and model-extension docs live in docs/benchmarking.md.

You can also benchmark a direct Hugging Face ref without adding it to default_tokenizers.yaml:

uv run mothertoken benchmark run --languages eng_Latn,arb_Arab --models Qwen/Qwen3-0.6B

License

MIT

About

Find out the mother tongue of your LLM. How tokenizers work accross languages

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors