Skip to content

YacineAyari/Universal-Language-Protocol

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ULP — Universal Language Protocol

An artificial language designed for AI-to-AI communication, optimized for token efficiency.

ULP is not a protocol — it's a language. Like French or English, it has a grammar, a lexicon, and composition rules. But unlike natural languages, every word in ULP carries maximum semantic density with zero redundancy: no articles, no conjugations, no filler words.

An AI recognizes ULP the same way it recognizes French — by structure and vocabulary. No header, no flag. If it receives ULP, it responds in ULP. If it receives English, it responds in English.


Why ULP?

Natural languages waste tokens. Consider this system prompt:

You are an assistant specialized in French law. Respond precisely and cite
your sources. If you're unsure, say so clearly. Never give definitive legal
advice. Respond in French. Use a professional tone.

60 tokens in tiktoken (cl100k_base).

The same instruction in ULP:

imag exp jur fra pro cert exp neg def lang fra

10 tokens. Same meaning. 6x compression. Repeated at every API call.


Benchmark results (measured, not estimated)

All numbers measured with tiktoken cl100k_base (GPT-4 tokenizer) on 110 real-world prompt pairs across 12 categories and 4 languages.

Metric Value
Global compression ratio 1.50x
Token savings 33.4%
ULP wins 97/110 pairs (88%)
Best case (system prompt FR) 6.33x
Worst case (short EN query) 0.67x

By category

Category Ratio Best use case
Summarization 2.36x "Summarize in 3 bullet points"
Translation 2.17x "Translate FR→EN, formal tone"
System prompts 2.12x Persona + constraints + tone
Format conversion 2.00x "Convert JSON to CSV"
Rewriting 1.70x "Simplify for general audience"
Extraction 1.39x "Extract dates and names"
Knowledge QA 1.39x "Explain X simply"
Coding 1.38x "Implement REST API"
Email writing 1.34x "Write professional follow-up"
Classification 1.33x "Classify sentiment pos/neg/neu"
Creative writing 1.32x "Write a sci-fi short story"
Multi-step 1.30x "Parse → extract → rank → format"

Note: These are conservative results on short prompts (avg 14.5 tokens). Real-world system prompts (200-500 tokens) would show higher ratios.


How it works

Grammar: 4 rules

1. Order is fixed: ACTION [TARGET] [MODIFIERS] [CONSTRAINTS]

sum txt lon 3 nod          → Summarize the text in 3 points
│   │   │   │ └─ nodes/points
│   │   │   └─ quantity: 3
│   │   └─ length
│   └─ target: provided text
└─ action: summarize

2. Semicolons chain sequential instructions (like Unix pipes)

pars doc ; sum lon 5 nod ; trad eng
= Read the document → summarize in 5 points → translate to English

3. Brackets [ ] escape to natural language or define options

cls [spam|ham] txt                    → classify as spam or not
ext [dates|names|locations] txt       → extract these types
gen [poem ocean sunset]               → literal creative prompt

4. That's it. No articles, no conjugations, no prepositions. The AI infers the rest from morpheme types and position.

Lexicon: 78 morphemes, all validated at 1 token BPE

Every morpheme in ULP is exactly 1 BPE token in tiktoken cl100k_base. This was validated empirically — 81 candidates were tested, 19 failed and were replaced with alternatives that pass.

Category Count Examples
Actions 19 gen ext impl sum trad eval cls fmt rank ...
Domains 8 exp jur med fin edu sci dev rol
Formats 10 json csv md html tbl bul par step rap nar
Languages 10 eng fra esp ger ita por rus ara hin kor
Mod types 6 ton lon lang urg cert pub
Mod values 9 pro inf form tech ami fun pol det short
Numerics 3 tri pent dek
Structures 7 ask cond neg rez prev akt lim
Targets 6 txt dat doc src usr tgt

Examples

Natural language ULP NL tokens ULP tokens Ratio
Summarize this paper as bullet points sum doc fmt bul 12 4 3.0x
Translate FR→EN, formal tone trad fra eng txt form 12 5 2.4x
Classify sentiment: pos/neg/neu eval emo txt cls [pos|neg|neu] 20 12 1.7x
System prompt: legal expert FR imag exp jur fra pro cert exp neg def lang fra 60 10 6.0x
Read → summarize → translate pars doc ; sum lon 5 nod ; trad eng 22 10 2.2x
Write professional email gen email pro [client payment reminder] 15 8 1.9x
Compare TCP vs UDP cmp [TCP|UDP] 8 6 1.3x
Python: sort dicts by key impl src [Python sort dicts by key] 14 9 1.6x

Research methodology

This project is built on empirical data, not intuition.

Phase 1.1 — Corpus analysis

Analyzed 933,921 real prompts from the LMSYS-Chat-1M dataset (1M conversations with 25 LLMs). Key findings:

  • 5 action morphemes cover 69% of all detected intentions (gen, ext, impl, imag, rank)
  • System prompts average 214 tokens with 35% redundancy — the highest-impact use case
  • French has the highest redundancy rate (38%), Chinese the lowest (0%)
  • 54.6% of prompts contain enumerations → validates the | separator
  • "the" appears 481,625 times in the sample — pure waste

Phase 1.2 — Lexicon design & tokenizer validation

  • 81 candidate morphemes tested against tiktoken cl100k_base
  • 62 passed (1 token), 19 failed (2 tokens) → replaced with alternatives
  • Final lexicon: 78 morphemes, 100% at 1 token BPE, 9 categories
  • Missing: Chinese (zho) and Japanese (jpn) — no 1-token alternative found (planned for v1.0 with dedicated tokenizer)

Benchmark — Honest results

First benchmark (v0.1) showed 0.99x — ULP wasn't compressing at all. The · separator cost 1 extra token per composition. Diagnosed 4 problems, applied 4 optimizations, re-benchmarked:

v0.1 v0.2
Global ratio 0.99x 1.50x
ULP wins 38/110 97/110
ULP loses 56/110 7/110

Tokenizer strategy: two phases

v0.1/v0.2 (current): Morphemes are English-like fragments optimized for existing BPE tokenizers. This proves the concept works.

v1.0 (planned): Morphemes will be language-neutral forms that don't exist in any natural language, paired with a dedicated tokenizer or an extension to existing open-source tokenizers. This will eliminate the remaining overhead and push ratios significantly higher.


Repository structure

├── README.md                              ← You are here
├── docs/
│   ├── ULP-Whitepaper-v2.0.pdf            ← Whitepaper (vision, état de l'art, benchmark)
│   ├── ULP-Phase1.1-Analyse-Empirique.md  ← Corpus analysis (933K prompts)
│   ├── ULP-Grammaire-v0.2.md              ← Grammar specification
│   ├── ULP-Lexique-v0.2.md                ← Lexicon (78 morphemes)
│   ├── ULP-Benchmark-v0.1-Resultats.md    ← First benchmark (honest 0.99x)
│   ├── ULP-Benchmark-v0.2-Resultats.md    ← Optimized benchmark (1.50x)
│   └── ULP-100-Traductions.md             ← 110 translation examples
├── data/
│   ├── ulp-lexicon-v0.1.json              ← Lexicon with token IDs
│   ├── ulp-benchmark-100.json             ← v0.1 benchmark raw data
│   ├── ulp-benchmark-v02.json             ← v0.2 benchmark raw data
│   ├── ulp-analysis-full-report.json      ← Corpus analysis results
│   ├── ulp-morpheme-candidates.json       ← Initial candidates
│   └── ulp-tokenizer-validation.json      ← Tokenizer test results
├── scripts/
│   ├── ulp_analyze_lmsys.py               ← Corpus analysis script
│   ├── ulp_validate_tokenizer.py          ← Tokenizer validation
│   ├── ulp_revalidate.py                  ← Replacement validation
│   ├── ulp_benchmark_100.py               ← v0.1 benchmark
│   ├── ulp_benchmark_v02.py               ← v0.2 benchmark
│   ├── ulp_optimize.py                    ← Optimization tests
│   └── ulp_diag.py                        ← Parquet diagnostic
└── LICENSE

Roadmap

  • Phase 0 — Whitepaper & publication
  • Phase 1.1 — Corpus analysis (933K prompts from LMSYS-Chat-1M)
  • Phase 1.2 — Lexicon v0.1 (78 morphemes, 100% at 1 token BPE)
  • Phase 1.2 — Grammar v0.2 + benchmark (1.50x compression)
  • Phase 2 — Training dataset (NL→ULP pairs for fine-tuning)
  • Phase 3 — PoC fine-tuning (prove an LLM can natively produce ULP)
  • Phase 4 — Consortium (standardization proposal)

Design principles

  1. Bijection — 1 morpheme = 1 concept. 1 concept = 1 morpheme. Zero homonymy, zero synonymy, zero polysemy.
  2. Native detection — No header, no flag. The AI detects ULP like it detects French.
  3. Token efficiency — Every morpheme = exactly 1 BPE token on current tokenizers.
  4. Empiricism first — Lexicon and grammar are derived from real corpus analysis, not theoretical intuition.
  5. Minimal punctuation — Only 3 symbols: | (alternation), ; (chaining), [ ] (groups). Space separates morphemes.

Contributing

ULP is an open research project. Contributions welcome:

  • New morphemes — Propose additions to the lexicon (must pass 1-token BPE validation)
  • Benchmark extensions — Test on longer prompts, different languages, new categories
  • Tokenizer research — Help design the language-neutral v1.0 morphemes
  • Fine-tuning experiments — Train a model to natively recognize and produce ULP
  • Translations — Add translation examples in your language

Open an issue or submit a PR.


Citation

@misc{ulp2026,
  title={ULP: Universal Language Protocol — An Artificial Language for Token-Efficient AI Communication},
  author={Yacine Ayari},
  year={2026},
  url={https://github.com/YacineAyari/Universal-Language-Protocol}
}

Corpus source:

@misc{zheng2023lmsyschat1m,
  title={LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset},
  author={Lianmin Zheng et al.},
  year={2023},
  eprint={2309.11998},
  archivePrefix={arXiv}
}

License

MIT License — see LICENSE.

About

Une langue artificielle native pour la communication inter-IA Réduire la consommation de tokens par la densité linguistique

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages