An artificial language designed for AI-to-AI communication, optimized for token efficiency.
ULP is not a protocol — it's a language. Like French or English, it has a grammar, a lexicon, and composition rules. But unlike natural languages, every word in ULP carries maximum semantic density with zero redundancy: no articles, no conjugations, no filler words.
An AI recognizes ULP the same way it recognizes French — by structure and vocabulary. No header, no flag. If it receives ULP, it responds in ULP. If it receives English, it responds in English.
Natural languages waste tokens. Consider this system prompt:
You are an assistant specialized in French law. Respond precisely and cite
your sources. If you're unsure, say so clearly. Never give definitive legal
advice. Respond in French. Use a professional tone.
60 tokens in tiktoken (cl100k_base).
The same instruction in ULP:
imag exp jur fra pro cert exp neg def lang fra
10 tokens. Same meaning. 6x compression. Repeated at every API call.
All numbers measured with tiktoken cl100k_base (GPT-4 tokenizer) on 110 real-world prompt pairs across 12 categories and 4 languages.
| Metric | Value |
|---|---|
| Global compression ratio | 1.50x |
| Token savings | 33.4% |
| ULP wins | 97/110 pairs (88%) |
| Best case (system prompt FR) | 6.33x |
| Worst case (short EN query) | 0.67x |
| Category | Ratio | Best use case |
|---|---|---|
| Summarization | 2.36x | "Summarize in 3 bullet points" |
| Translation | 2.17x | "Translate FR→EN, formal tone" |
| System prompts | 2.12x | Persona + constraints + tone |
| Format conversion | 2.00x | "Convert JSON to CSV" |
| Rewriting | 1.70x | "Simplify for general audience" |
| Extraction | 1.39x | "Extract dates and names" |
| Knowledge QA | 1.39x | "Explain X simply" |
| Coding | 1.38x | "Implement REST API" |
| Email writing | 1.34x | "Write professional follow-up" |
| Classification | 1.33x | "Classify sentiment pos/neg/neu" |
| Creative writing | 1.32x | "Write a sci-fi short story" |
| Multi-step | 1.30x | "Parse → extract → rank → format" |
Note: These are conservative results on short prompts (avg 14.5 tokens). Real-world system prompts (200-500 tokens) would show higher ratios.
1. Order is fixed: ACTION [TARGET] [MODIFIERS] [CONSTRAINTS]
sum txt lon 3 nod → Summarize the text in 3 points
│ │ │ │ └─ nodes/points
│ │ │ └─ quantity: 3
│ │ └─ length
│ └─ target: provided text
└─ action: summarize
2. Semicolons chain sequential instructions (like Unix pipes)
pars doc ; sum lon 5 nod ; trad eng
= Read the document → summarize in 5 points → translate to English
3. Brackets [ ] escape to natural language or define options
cls [spam|ham] txt → classify as spam or not
ext [dates|names|locations] txt → extract these types
gen [poem ocean sunset] → literal creative prompt
4. That's it. No articles, no conjugations, no prepositions. The AI infers the rest from morpheme types and position.
Every morpheme in ULP is exactly 1 BPE token in tiktoken cl100k_base. This was validated empirically — 81 candidates were tested, 19 failed and were replaced with alternatives that pass.
| Category | Count | Examples |
|---|---|---|
| Actions | 19 | gen ext impl sum trad eval cls fmt rank ... |
| Domains | 8 | exp jur med fin edu sci dev rol |
| Formats | 10 | json csv md html tbl bul par step rap nar |
| Languages | 10 | eng fra esp ger ita por rus ara hin kor |
| Mod types | 6 | ton lon lang urg cert pub |
| Mod values | 9 | pro inf form tech ami fun pol det short |
| Numerics | 3 | tri pent dek |
| Structures | 7 | ask cond neg rez prev akt lim |
| Targets | 6 | txt dat doc src usr tgt |
| Natural language | ULP | NL tokens | ULP tokens | Ratio |
|---|---|---|---|---|
| Summarize this paper as bullet points | sum doc fmt bul |
12 | 4 | 3.0x |
| Translate FR→EN, formal tone | trad fra eng txt form |
12 | 5 | 2.4x |
| Classify sentiment: pos/neg/neu | eval emo txt cls [pos|neg|neu] |
20 | 12 | 1.7x |
| System prompt: legal expert FR | imag exp jur fra pro cert exp neg def lang fra |
60 | 10 | 6.0x |
| Read → summarize → translate | pars doc ; sum lon 5 nod ; trad eng |
22 | 10 | 2.2x |
| Write professional email | gen email pro [client payment reminder] |
15 | 8 | 1.9x |
| Compare TCP vs UDP | cmp [TCP|UDP] |
8 | 6 | 1.3x |
| Python: sort dicts by key | impl src [Python sort dicts by key] |
14 | 9 | 1.6x |
This project is built on empirical data, not intuition.
Analyzed 933,921 real prompts from the LMSYS-Chat-1M dataset (1M conversations with 25 LLMs). Key findings:
- 5 action morphemes cover 69% of all detected intentions (
gen,ext,impl,imag,rank) - System prompts average 214 tokens with 35% redundancy — the highest-impact use case
- French has the highest redundancy rate (38%), Chinese the lowest (0%)
- 54.6% of prompts contain enumerations → validates the
|separator - "the" appears 481,625 times in the sample — pure waste
- 81 candidate morphemes tested against tiktoken cl100k_base
- 62 passed (1 token), 19 failed (2 tokens) → replaced with alternatives
- Final lexicon: 78 morphemes, 100% at 1 token BPE, 9 categories
- Missing: Chinese (
zho) and Japanese (jpn) — no 1-token alternative found (planned for v1.0 with dedicated tokenizer)
First benchmark (v0.1) showed 0.99x — ULP wasn't compressing at all. The · separator cost 1 extra token per composition. Diagnosed 4 problems, applied 4 optimizations, re-benchmarked:
| v0.1 | v0.2 | |
|---|---|---|
| Global ratio | 0.99x | 1.50x |
| ULP wins | 38/110 | 97/110 |
| ULP loses | 56/110 | 7/110 |
v0.1/v0.2 (current): Morphemes are English-like fragments optimized for existing BPE tokenizers. This proves the concept works.
v1.0 (planned): Morphemes will be language-neutral forms that don't exist in any natural language, paired with a dedicated tokenizer or an extension to existing open-source tokenizers. This will eliminate the remaining overhead and push ratios significantly higher.
├── README.md ← You are here
├── docs/
│ ├── ULP-Whitepaper-v2.0.pdf ← Whitepaper (vision, état de l'art, benchmark)
│ ├── ULP-Phase1.1-Analyse-Empirique.md ← Corpus analysis (933K prompts)
│ ├── ULP-Grammaire-v0.2.md ← Grammar specification
│ ├── ULP-Lexique-v0.2.md ← Lexicon (78 morphemes)
│ ├── ULP-Benchmark-v0.1-Resultats.md ← First benchmark (honest 0.99x)
│ ├── ULP-Benchmark-v0.2-Resultats.md ← Optimized benchmark (1.50x)
│ └── ULP-100-Traductions.md ← 110 translation examples
├── data/
│ ├── ulp-lexicon-v0.1.json ← Lexicon with token IDs
│ ├── ulp-benchmark-100.json ← v0.1 benchmark raw data
│ ├── ulp-benchmark-v02.json ← v0.2 benchmark raw data
│ ├── ulp-analysis-full-report.json ← Corpus analysis results
│ ├── ulp-morpheme-candidates.json ← Initial candidates
│ └── ulp-tokenizer-validation.json ← Tokenizer test results
├── scripts/
│ ├── ulp_analyze_lmsys.py ← Corpus analysis script
│ ├── ulp_validate_tokenizer.py ← Tokenizer validation
│ ├── ulp_revalidate.py ← Replacement validation
│ ├── ulp_benchmark_100.py ← v0.1 benchmark
│ ├── ulp_benchmark_v02.py ← v0.2 benchmark
│ ├── ulp_optimize.py ← Optimization tests
│ └── ulp_diag.py ← Parquet diagnostic
└── LICENSE
- Phase 0 — Whitepaper & publication
- Phase 1.1 — Corpus analysis (933K prompts from LMSYS-Chat-1M)
- Phase 1.2 — Lexicon v0.1 (78 morphemes, 100% at 1 token BPE)
- Phase 1.2 — Grammar v0.2 + benchmark (1.50x compression)
- Phase 2 — Training dataset (NL→ULP pairs for fine-tuning)
- Phase 3 — PoC fine-tuning (prove an LLM can natively produce ULP)
- Phase 4 — Consortium (standardization proposal)
- Bijection — 1 morpheme = 1 concept. 1 concept = 1 morpheme. Zero homonymy, zero synonymy, zero polysemy.
- Native detection — No header, no flag. The AI detects ULP like it detects French.
- Token efficiency — Every morpheme = exactly 1 BPE token on current tokenizers.
- Empiricism first — Lexicon and grammar are derived from real corpus analysis, not theoretical intuition.
- Minimal punctuation — Only 3 symbols:
|(alternation),;(chaining),[ ](groups). Space separates morphemes.
ULP is an open research project. Contributions welcome:
- New morphemes — Propose additions to the lexicon (must pass 1-token BPE validation)
- Benchmark extensions — Test on longer prompts, different languages, new categories
- Tokenizer research — Help design the language-neutral v1.0 morphemes
- Fine-tuning experiments — Train a model to natively recognize and produce ULP
- Translations — Add translation examples in your language
Open an issue or submit a PR.
@misc{ulp2026,
title={ULP: Universal Language Protocol — An Artificial Language for Token-Efficient AI Communication},
author={Yacine Ayari},
year={2026},
url={https://github.com/YacineAyari/Universal-Language-Protocol}
}Corpus source:
@misc{zheng2023lmsyschat1m,
title={LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset},
author={Lianmin Zheng et al.},
year={2023},
eprint={2309.11998},
archivePrefix={arXiv}
}MIT License — see LICENSE.