Every model has a native tongue. The question is whether yours matches.
Toolkit for comparing tokenizer efficiency across languages, model families, and user-supplied Hugging Face refs.
Note
The bundled benchmark is curated and representative, not exhaustive. Use direct Hugging Face refs when you want to compare tokenizers outside the starter set.
# Published package
pip install mothertokenFor local development:
git clone https://github.com/inimaz/mothertoken
cd mothertoken
uv sync
uv pip install -e .The mothertoken command is available after installation.
mothertoken rank spanishmothertoken tokenize "Hola Mundo" --model gpt-4omothertoken compare "Travesura realizada" --model gpt-oss --model Qwen/Qwen3-0.6Bmothertoken benchmark run --models gpt-oss,YOUR_MODEL1,YOUR_MODEL2Rank supported tokenizers for a specific language using the precomputed benchmark data.
mothertoken rank spanish
# Raw FLORES+ codes still work
mothertoken rank spa_LatnSee which tokenizer IDs can be used and which familiar models use them.
mothertoken listCount tokens for exact text using local tokenizers by default. Add --language to estimate the English-equivalent token count from the benchmark multiplier.
mothertoken tokenize "Hola Mundo" --language es
# Check one model
mothertoken tokenize "Hello" --model gpt-4o
# Check a Hugging Face model/tokenizer ref directly
mothertoken tokenize "Hello" --model Qwen/Qwen3-0.6B
# Estimate the English-equivalent count for a known language
mothertoken tokenize "مرحبا بالعالم" --language ar --model gpt-4o
# Compare against your own English translation
mothertoken tokenize "مرحبا بالعالم" --language ar --english-text "Hello world"
# Tokenize a file
mothertoken tokenize --file prompt.txt
# Compare translated files
mothertoken tokenize --file prompt.ar.txt --language ar --english-file prompt.en.txtCompare aliases from mothertoken list with direct Hugging Face refs. This is the main workflow when you care about a specific set of models.
mothertoken compare "Travesura realizada" \
--model gpt-4o \
--model Qwen/Qwen3-0.6B \
--model mistralai/Mistral-7B-v0.1
mothertoken compare --file prompt.txt \
--model mistralai/Mistral-7B-v0.1 \
--model deepseek-ai/DeepSeek-V4-Promothertoken benchmark shows benchmark help. Use benchmark run to create benchmark data, benchmark use to choose the active benchmark, and benchmark status to inspect what commands will use.
When --output is omitted, benchmark run writes to the user-owned benchmark file and makes it active:
mothertoken benchmark run --languages eng_Latn,arb_Arab --models gpt-4o,Qwen/Qwen3-0.6BBefore it starts, the command prints the file it will write. The default user config locations are platform-specific:
| OS | User config directory |
|---|---|
| Linux / XDG | $XDG_CONFIG_HOME/mothertoken or ~/.config/mothertoken |
| macOS | ~/Library/Application Support/mothertoken |
| Windows | %APPDATA%\mothertoken |
To write somewhere else:
mothertoken benchmark run \
--languages eng_Latn,arb_Arab \
--models gpt-4o,Qwen/Qwen3-0.6B \
--output benchmark.jsonMake an existing benchmark active:
mothertoken benchmark use benchmark.json
mothertoken benchmark statusReturn to the bundled default benchmark:
mothertoken benchmark use --defaultBenchmark regeneration and model-extension docs live in docs/benchmarking.md.
You can also benchmark a direct Hugging Face ref without adding it to default_tokenizers.yaml:
uv run mothertoken benchmark run --languages eng_Latn,arb_Arab --models Qwen/Qwen3-0.6BMIT



