Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 59 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -284,43 +284,81 @@ Available via **Settings β†’ Languages β†’ Download Language Packs**:

### Creating Custom Language Packs

You can create dictionaries for any language using the included Python scripts:
You can create dictionaries for any language using the included Python scripts.

#### Quick Start (Recommended)

The easiest way to build a language pack is the two-step `wordfreq` pipeline. This automatically generates a word list with frequency data and packages it into a ready-to-install ZIP:

```bash
# Navigate to scripts directory
cd scripts/

# Install prerequisite
pip install wordfreq

# Option 1: Two-step build from wordfreq (any language wordfreq supports)
python get_wordlist.py --lang fr --output fr_words.txt --count 50000
python build_langpack.py --lang fr --name "French" --input fr_words.txt --use-wordfreq --output langpack-fr.zip
# Step 1: Generate a word list from wordfreq (supports 50+ languages)
python get_wordlist.py --lang hu --output hu_words.txt --count 25000

# Step 2: Build the language pack from the word list
python build_langpack.py --lang hu --name "Hungarian" --input hu_words.txt --use-wordfreq --output langpack-hu.zip
```

The `--use-wordfreq` flag enriches word frequencies using the [wordfreq](https://github.com/rspeer/wordfreq) library, which produces better prediction results.

#### Input File Format

The `--input` file for `build_langpack.py` and `build_dictionary.py` is a plain text word list. Supported formats:

| Format | Example | Notes |
|--------|---------|-------|
| One word per line | `hello` | Frequencies are looked up via `wordfreq` (use `--use-wordfreq`) |
| Word + TAB + frequency | `hello\t50000` | Uses the provided integer frequency |
| Word + space + frequency | `hello 50000` | Uses the provided integer frequency |
Comment on lines +304 to +314
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README implies --use-wordfreq is required for one-word-per-line inputs to get real frequencies, but build_dictionary.py currently uses wordfreq whenever it’s installed regardless of the flag (the flag only errors when wordfreq isn’t available). Either update the docs to reflect that behavior, or change build_dictionary.py to honor --use-wordfreq as an actual switch so the docs and CLI semantics match.

Copilot uses AI. Check for mistakes.

# Option 2: Build from pre-existing binary dictionary (.bin file)
Lines starting with `#` are treated as comments and skipped. Words longer than 50 characters are ignored.
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This states β€œWords longer than 50 characters are ignored,” but build_dictionary.py also applies --max-length (default 30) after loading, so words of length 31–50 are ignored by default as well. Consider documenting the default --min-length/--max-length behavior (and how to override it) to avoid confusing users who expect 40–50 char tokens to be kept.

Suggested change
Lines starting with `#` are treated as comments and skipped. Words longer than 50 characters are ignored.
Lines starting with `#` are treated as comments and skipped. Words longer than 50 characters are ignored during input parsing. In addition, `build_dictionary.py` applies `--min-length`/`--max-length` filtering after loading (default `--max-length` is 30), so words with lengths 31-50 are also excluded unless you override that limit, for example with `--max-length 50`.

Copilot uses AI. Check for mistakes.

> **Tip:** Using `get_wordlist.py` to generate the input file is the recommended approach β€” it produces a clean one-word-per-line file from `wordfreq`'s curated data. If you provide your own file (e.g., scraped text), make sure it contains **one word per line** (not sentences or paragraphs), otherwise only a small fraction of entries will be recognized as valid dictionary words.

#### Alternative Build Methods

```bash
# From a pre-existing binary dictionary (.bin file)
python build_langpack.py --lang sv --name "Swedish" --dict ../src/main/assets/dictionaries/sv_enhanced.bin --output langpack-sv.zip

# Option 3: Build from custom word frequency CSV (format: word,frequency per line)
python build_dictionary.py --input my_words.csv --output my_lang.bin
# From a custom word+frequency file (two-step: build dictionary, then package)
python build_dictionary.py --lang xx --input my_words.txt --output my_lang.bin
python build_langpack.py --lang xx --name "MyLang" --dict my_lang.bin --output langpack-xx.zip

# Option 4: Batch build all bundled languages (en, es, fr, de, it, pt, nl, id, ms, tl, sw)
# Batch build all supported languages (en, es, fr, de, it, pt, hu, nl, id, ms, tl, sw)
python build_all_languages.py
```

**Script Details:**
- `build_langpack.py` β€” Creates complete .zip language packs from wordfreq
- `build_dictionary.py` β€” Builds binary dictionary from CSV word lists
- `build_all_languages.py` β€” Batch builds all supported languages
- `get_wordlist.py` β€” Extracts top N words from wordfreq for a language
#### Script Reference

| Script | Purpose |
|--------|---------|
| `get_wordlist.py` | Extracts top N words from `wordfreq` for a given language code |
| `build_langpack.py` | Creates a complete `.zip` language pack (dictionary + unigrams + manifest) |
| `build_dictionary.py` | Builds a V2 binary dictionary (`.bin`) from a word list |
| `build_all_languages.py` | Batch builds all supported languages |
| `generate_unigrams.py` | Generates unigram frequency lists for language detection |

#### Language Pack Contents

Language packs are `.zip` files containing:

| File | Description |
|------|-------------|
| `manifest.json` | Metadata β€” language code, name, version, word count |
| `dictionary.bin` | V2 binary dictionary with accent normalization and frequency ranks |
| `unigrams.txt` | Top words ordered by frequency (used for language detection) |
| `contractions.json` | *(optional)* Apostrophe word mappings for languages that use them |
Comment on lines +350 to +353
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contents table implies unigrams.txt is always present, but build_langpack.py will omit it when --unigrams isn’t provided and unigram generation fails (or when building from --dict only). Either mark unigrams.txt as optional in the docs, or make build_langpack.py require/provide unigrams for a β€œcomplete” pack.

Copilot uses AI. Check for mistakes.
| `prefix_boost.bin` | *(optional)* Aho-Corasick trie for prefix boosting (non-English) |

#### Installing a Language Pack

Language packs are simple .zip files containing:
- `{lang}_enhanced.bin` β€” Binary dictionary with frequency data
- `{lang}_enhanced.json` β€” Human-readable word list with frequencies
- `manifest.json` β€” Metadata (language code, version, word count)
Copy the `.zip` file to your device and import it in **CleverKeys Settings β†’ Multi-Language**.

**Pre-built Language Packs:**
Available in [`scripts/dictionaries/`](./scripts/dictionaries/) for testing, or download directly from the app.
Available in [`scripts/dictionaries/`](./scripts/dictionaries/) for immediate use, or download directly from within the app.

<div align="center">

Expand Down
2 changes: 2 additions & 0 deletions scripts/build_all_languages.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
- Portuguese (pt)
- Italian (it)
- German (de)
- Hungarian (hu)
- Indonesian (id)
- Swahili (sw)
- Malay (ms)
Expand Down Expand Up @@ -58,6 +59,7 @@
'pt': {'name': 'Portuguese', 'words': 25000, 'bundle': True},
'it': {'name': 'Italian', 'words': 25000, 'bundle': True},
'de': {'name': 'German', 'words': 25000, 'bundle': True},
'hu': {'name': 'Hungarian', 'words': 25000, 'bundle': False},
'nl': {'name': 'Dutch', 'words': 20000, 'bundle': False},
'id': {'name': 'Indonesian', 'words': 20000, 'bundle': False},
'ms': {'name': 'Malay', 'words': 20000, 'bundle': False},
Expand Down
Binary file added scripts/dictionaries/langpack-hu.zip
Binary file not shown.
Loading