ULP — Universal Language Protocol

An artificial language designed for AI-to-AI communication, optimized for token efficiency.

ULP is not a protocol — it's a language. Like French or English, it has a grammar, a lexicon, and composition rules. But unlike natural languages, every word in ULP carries maximum semantic density with zero redundancy: no articles, no conjugations, no filler words.

An AI recognizes ULP the same way it recognizes French — by structure and vocabulary. No header, no flag. If it receives ULP, it responds in ULP. If it receives English, it responds in English.

Why ULP?

Natural languages waste tokens. Consider this system prompt:

You are an assistant specialized in French law. Respond precisely and cite
your sources. If you're unsure, say so clearly. Never give definitive legal
advice. Respond in French. Use a professional tone.

60 tokens in tiktoken (cl100k_base).

The same instruction in ULP:

imag exp jur fra pro cert exp neg def lang fra

10 tokens. Same meaning. 6x compression. Repeated at every API call.

Benchmark results (measured, not estimated)

All numbers measured with tiktoken cl100k_base (GPT-4 tokenizer) on 110 real-world prompt pairs across 12 categories and 4 languages.

Metric	Value
Global compression ratio	1.50x
Token savings	33.4%
ULP wins	97/110 pairs (88%)
Best case (system prompt FR)	6.33x
Worst case (short EN query)	0.67x

By category

Category	Ratio	Best use case
Summarization	2.36x	"Summarize in 3 bullet points"
Translation	2.17x	"Translate FR→EN, formal tone"
System prompts	2.12x	Persona + constraints + tone
Format conversion	2.00x	"Convert JSON to CSV"
Rewriting	1.70x	"Simplify for general audience"
Extraction	1.39x	"Extract dates and names"
Knowledge QA	1.39x	"Explain X simply"
Coding	1.38x	"Implement REST API"
Email writing	1.34x	"Write professional follow-up"
Classification	1.33x	"Classify sentiment pos/neg/neu"
Creative writing	1.32x	"Write a sci-fi short story"
Multi-step	1.30x	"Parse → extract → rank → format"

Note: These are conservative results on short prompts (avg 14.5 tokens). Real-world system prompts (200-500 tokens) would show higher ratios.

How it works

Grammar: 4 rules

1. Order is fixed: ACTION [TARGET] [MODIFIERS] [CONSTRAINTS]

sum txt lon 3 nod          → Summarize the text in 3 points
│   │   │   │ └─ nodes/points
│   │   │   └─ quantity: 3
│   │   └─ length
│   └─ target: provided text
└─ action: summarize

2. Semicolons chain sequential instructions (like Unix pipes)

pars doc ; sum lon 5 nod ; trad eng
= Read the document → summarize in 5 points → translate to English

3. Brackets [ ] escape to natural language or define options

cls [spam|ham] txt                    → classify as spam or not
ext [dates|names|locations] txt       → extract these types
gen [poem ocean sunset]               → literal creative prompt

4. That's it. No articles, no conjugations, no prepositions. The AI infers the rest from morpheme types and position.

Lexicon: 78 morphemes, all validated at 1 token BPE

Every morpheme in ULP is exactly 1 BPE token in tiktoken cl100k_base. This was validated empirically — 81 candidates were tested, 19 failed and were replaced with alternatives that pass.

Category	Count	Examples
Actions	19	`gen` `ext` `impl` `sum` `trad` `eval` `cls` `fmt` `rank` ...
Domains	8	`exp` `jur` `med` `fin` `edu` `sci` `dev` `rol`
Formats	10	`json` `csv` `md` `html` `tbl` `bul` `par` `step` `rap` `nar`
Languages	10	`eng` `fra` `esp` `ger` `ita` `por` `rus` `ara` `hin` `kor`
Mod types	6	`ton` `lon` `lang` `urg` `cert` `pub`
Mod values	9	`pro` `inf` `form` `tech` `ami` `fun` `pol` `det` `short`
Numerics	3	`tri` `pent` `dek`
Structures	7	`ask` `cond` `neg` `rez` `prev` `akt` `lim`
Targets	6	`txt` `dat` `doc` `src` `usr` `tgt`

Examples

Natural language	ULP	NL tokens	ULP tokens	Ratio
Summarize this paper as bullet points	`sum doc fmt bul`	12	4	3.0x
Translate FR→EN, formal tone	`trad fra eng txt form`	12	5	2.4x
Classify sentiment: pos/neg/neu	`eval emo txt cls [pos\|neg\|neu]`	20	12	1.7x
System prompt: legal expert FR	`imag exp jur fra pro cert exp neg def lang fra`	60	10	6.0x
Read → summarize → translate	`pars doc ; sum lon 5 nod ; trad eng`	22	10	2.2x
Write professional email	`gen email pro [client payment reminder]`	15	8	1.9x
Compare TCP vs UDP	`cmp [TCP\|UDP]`	8	6	1.3x
Python: sort dicts by key	`impl src [Python sort dicts by key]`	14	9	1.6x

Research methodology

This project is built on empirical data, not intuition.

Phase 1.1 — Corpus analysis

Analyzed 933,921 real prompts from the LMSYS-Chat-1M dataset (1M conversations with 25 LLMs). Key findings:

5 action morphemes cover 69% of all detected intentions (gen, ext, impl, imag, rank)
System prompts average 214 tokens with 35% redundancy — the highest-impact use case
French has the highest redundancy rate (38%), Chinese the lowest (0%)
54.6% of prompts contain enumerations → validates the | separator
"the" appears 481,625 times in the sample — pure waste

Phase 1.2 — Lexicon design & tokenizer validation

81 candidate morphemes tested against tiktoken cl100k_base
62 passed (1 token), 19 failed (2 tokens) → replaced with alternatives
Final lexicon: 78 morphemes, 100% at 1 token BPE, 9 categories
Missing: Chinese (zho) and Japanese (jpn) — no 1-token alternative found (planned for v1.0 with dedicated tokenizer)

Benchmark — Honest results

First benchmark (v0.1) showed 0.99x — ULP wasn't compressing at all. The · separator cost 1 extra token per composition. Diagnosed 4 problems, applied 4 optimizations, re-benchmarked:

	v0.1	v0.2
Global ratio	0.99x	1.50x
ULP wins	38/110	97/110
ULP loses	56/110	7/110

Tokenizer strategy: two phases

v0.1/v0.2 (current): Morphemes are English-like fragments optimized for existing BPE tokenizers. This proves the concept works.

v1.0 (planned): Morphemes will be language-neutral forms that don't exist in any natural language, paired with a dedicated tokenizer or an extension to existing open-source tokenizers. This will eliminate the remaining overhead and push ratios significantly higher.

Repository structure

├── README.md                              ← You are here
├── docs/
│   ├── ULP-Whitepaper-v2.0.pdf            ← Whitepaper (vision, état de l'art, benchmark)
│   ├── ULP-Phase1.1-Analyse-Empirique.md  ← Corpus analysis (933K prompts)
│   ├── ULP-Grammaire-v0.2.md              ← Grammar specification
│   ├── ULP-Lexique-v0.2.md                ← Lexicon (78 morphemes)
│   ├── ULP-Benchmark-v0.1-Resultats.md    ← First benchmark (honest 0.99x)
│   ├── ULP-Benchmark-v0.2-Resultats.md    ← Optimized benchmark (1.50x)
│   └── ULP-100-Traductions.md             ← 110 translation examples
├── data/
│   ├── ulp-lexicon-v0.1.json              ← Lexicon with token IDs
│   ├── ulp-benchmark-100.json             ← v0.1 benchmark raw data
│   ├── ulp-benchmark-v02.json             ← v0.2 benchmark raw data
│   ├── ulp-analysis-full-report.json      ← Corpus analysis results
│   ├── ulp-morpheme-candidates.json       ← Initial candidates
│   └── ulp-tokenizer-validation.json      ← Tokenizer test results
├── scripts/
│   ├── ulp_analyze_lmsys.py               ← Corpus analysis script
│   ├── ulp_validate_tokenizer.py          ← Tokenizer validation
│   ├── ulp_revalidate.py                  ← Replacement validation
│   ├── ulp_benchmark_100.py               ← v0.1 benchmark
│   ├── ulp_benchmark_v02.py               ← v0.2 benchmark
│   ├── ulp_optimize.py                    ← Optimization tests
│   └── ulp_diag.py                        ← Parquet diagnostic
└── LICENSE

Roadmap

Phase 0 — Whitepaper & publication
Phase 1.1 — Corpus analysis (933K prompts from LMSYS-Chat-1M)
Phase 1.2 — Lexicon v0.1 (78 morphemes, 100% at 1 token BPE)
Phase 1.2 — Grammar v0.2 + benchmark (1.50x compression)
Phase 2 — Training dataset (NL→ULP pairs for fine-tuning)
Phase 3 — PoC fine-tuning (prove an LLM can natively produce ULP)
Phase 4 — Consortium (standardization proposal)

Design principles

Bijection — 1 morpheme = 1 concept. 1 concept = 1 morpheme. Zero homonymy, zero synonymy, zero polysemy.
Native detection — No header, no flag. The AI detects ULP like it detects French.
Token efficiency — Every morpheme = exactly 1 BPE token on current tokenizers.
Empiricism first — Lexicon and grammar are derived from real corpus analysis, not theoretical intuition.
Minimal punctuation — Only 3 symbols: | (alternation), ; (chaining), [ ] (groups). Space separates morphemes.

Contributing

ULP is an open research project. Contributions welcome:

New morphemes — Propose additions to the lexicon (must pass 1-token BPE validation)
Benchmark extensions — Test on longer prompts, different languages, new categories
Tokenizer research — Help design the language-neutral v1.0 morphemes
Fine-tuning experiments — Train a model to natively recognize and produce ULP
Translations — Add translation examples in your language

Open an issue or submit a PR.

Citation

@misc{ulp2026,
  title={ULP: Universal Language Protocol — An Artificial Language for Token-Efficient AI Communication},
  author={Yacine Ayari},
  year={2026},
  url={https://github.com/YacineAyari/Universal-Language-Protocol}
}

Corpus source:

@misc{zheng2023lmsyschat1m,
  title={LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset},
  author={Lianmin Zheng et al.},
  year={2023},
  eprint={2309.11998},
  archivePrefix={arXiv}
}

License

MIT License — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ULP — Universal Language Protocol

Why ULP?

Benchmark results (measured, not estimated)

By category

How it works

Grammar: 4 rules

Lexicon: 78 morphemes, all validated at 1 token BPE

Examples

Research methodology

Phase 1.1 — Corpus analysis

Phase 1.2 — Lexicon design & tokenizer validation

Benchmark — Honest results

Tokenizer strategy: two phases

Repository structure

Roadmap

Design principles

Contributing

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
docs		docs
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

ULP — Universal Language Protocol

Why ULP?

Benchmark results (measured, not estimated)

By category

How it works

Grammar: 4 rules

Lexicon: 78 morphemes, all validated at 1 token BPE

Examples

Research methodology

Phase 1.1 — Corpus analysis

Phase 1.2 — Lexicon design & tokenizer validation

Benchmark — Honest results

Tokenizer strategy: two phases

Repository structure

Roadmap

Design principles

Contributing

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages