Add Hungarian language pack and improve langpack build documentation#126
Add Hungarian language pack and improve langpack build documentation#126
Conversation
- Generate langpack-hu.zip with 25,000 Hungarian words from wordfreq - Add Hungarian (hu) to build_all_languages.py supported languages - Rewrite README "Creating Custom Language Packs" section with: - Clear quick-start guide using Hungarian as the example - Documented input file format (one word per line, word+tab+freq, word+space+freq) - Tip explaining why scraped text produces poor results - Structured tables for script reference and langpack contents - Installation instructions Agent-Logs-Url: https://github.com/tribixbite/CleverKeys/sessions/5fe261e0-01ca-4df8-b1b2-f31f9cc7a512 Co-authored-by: tribixbite <381345+tribixbite@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds Hungarian (hu) to the supported language build pipeline and updates the README instructions for building custom language packs so users can successfully generate and package dictionaries (with correct CLI usage and clearer input format expectations).
Changes:
- Added Hungarian (
hu) toSUPPORTED_LANGUAGESin the batch language build script. - Reworked the README “Creating Custom Language Packs” section with a correct Quick Start, explicit input formats, and updated pack contents/script reference.
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
scripts/build_all_languages.py |
Registers Hungarian as a supported language in the batch build configuration. |
README.md |
Updates custom language pack build documentation (Quick Start, input formats, script reference, pack contents). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| | Word + space + frequency | `hello 50000` | Uses the provided integer frequency | | ||
|
|
||
| # Option 2: Build from pre-existing binary dictionary (.bin file) | ||
| Lines starting with `#` are treated as comments and skipped. Words longer than 50 characters are ignored. |
There was a problem hiding this comment.
This states “Words longer than 50 characters are ignored,” but build_dictionary.py also applies --max-length (default 30) after loading, so words of length 31–50 are ignored by default as well. Consider documenting the default --min-length/--max-length behavior (and how to override it) to avoid confusing users who expect 40–50 char tokens to be kept.
| Lines starting with `#` are treated as comments and skipped. Words longer than 50 characters are ignored. | |
| Lines starting with `#` are treated as comments and skipped. Words longer than 50 characters are ignored during input parsing. In addition, `build_dictionary.py` applies `--min-length`/`--max-length` filtering after loading (default `--max-length` is 30), so words with lengths 31-50 are also excluded unless you override that limit, for example with `--max-length 50`. |
| | `manifest.json` | Metadata — language code, name, version, word count | | ||
| | `dictionary.bin` | V2 binary dictionary with accent normalization and frequency ranks | | ||
| | `unigrams.txt` | Top words ordered by frequency (used for language detection) | | ||
| | `contractions.json` | *(optional)* Apostrophe word mappings for languages that use them | |
There was a problem hiding this comment.
The contents table implies unigrams.txt is always present, but build_langpack.py will omit it when --unigrams isn’t provided and unigram generation fails (or when building from --dict only). Either mark unigrams.txt as optional in the docs, or make build_langpack.py require/provide unigrams for a “complete” pack.
| The `--use-wordfreq` flag enriches word frequencies using the [wordfreq](https://github.com/rspeer/wordfreq) library, which produces better prediction results. | ||
|
|
||
| #### Input File Format | ||
|
|
||
| The `--input` file for `build_langpack.py` and `build_dictionary.py` is a plain text word list. Supported formats: | ||
|
|
||
| | Format | Example | Notes | | ||
| |--------|---------|-------| | ||
| | One word per line | `hello` | Frequencies are looked up via `wordfreq` (use `--use-wordfreq`) | | ||
| | Word + TAB + frequency | `hello\t50000` | Uses the provided integer frequency | | ||
| | Word + space + frequency | `hello 50000` | Uses the provided integer frequency | |
There was a problem hiding this comment.
The README implies --use-wordfreq is required for one-word-per-line inputs to get real frequencies, but build_dictionary.py currently uses wordfreq whenever it’s installed regardless of the flag (the flag only errors when wordfreq isn’t available). Either update the docs to reflect that behavior, or change build_dictionary.py to honor --use-wordfreq as an actual switch so the docs and CLI semantics match.
Pull Request
Description
Users couldn't build custom language packs because the README had an incorrect example command (missing required
--inputflag), didn't document the expected input file format, and didn't explain why scraped text produces near-empty dictionaries (the script expects one word per line, not prose).Type of Change
Related Issues
Changes Made
scripts/dictionaries/langpack-hu.zip— Pre-built Hungarian language pack (25,000 words, 296 KB) generated fromwordfreqwith accent normalization for á/é/í/ó/ö/ő/ú/ü/űscripts/build_all_languages.py— AddedhutoSUPPORTED_LANGUAGESdict and docstringREADME.md— Rewrote "Creating Custom Language Packs" section:{lang}_enhanced.bin/.jsonwhich don't exist in the ZIP)Quick start now reads:
Testing Performed
Manual Testing
Test Scenarios:
get_wordlist.py— 25,000 words, proper Hungarian vocabularylangpack-hu.zipviabuild_langpack.py— manifest shows 25,000 words, ZIP containsmanifest.json,dictionary.bin,unigrams.txtbuild_all_languages.py --listincludes HungarianAutomated Testing
Screenshots/Videos
N/A — documentation and data-file changes only.
Performance Impact
Privacy & Security Checklist
Privacy Impact: None
Explanation: Dictionary data only; all processing remains on-device.
Code Quality Checklist
Documentation
Breaking Changes
N/A
Additional Context
The original reporter scraped a Hungarian Wikipedia page and fed the raw text to
build_langpack.py, which expects one word per line. Since prose lines are treated as single entries (and most fail validation), only 73 "words" made it into the dictionary. The updated docs make this format requirement explicit and recommend usingget_wordlist.pyas the input source.Checklist for Maintainers