Add Hungarian language pack and improve langpack build documentation by Copilot · Pull Request #126 · tribixbite/CleverKeys

Copilot · 2026-04-19T06:52:10Z

Pull Request

Description

Users couldn't build custom language packs because the README had an incorrect example command (missing required --input flag), didn't document the expected input file format, and didn't explain why scraped text produces near-empty dictionaries (the script expects one word per line, not prose).

Type of Change

🐛 Bug fix (non-breaking change which fixes an issue)
✨ New feature (non-breaking change which adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📝 Documentation update
🎨 Style/UI change (formatting, visual improvements)
♻️ Code refactoring (no functional changes)
⚡ Performance improvement
✅ Test update/addition
🔧 Build/configuration change

Related Issues

Changes Made

scripts/dictionaries/langpack-hu.zip — Pre-built Hungarian language pack (25,000 words, 296 KB) generated from wordfreq with accent normalization for á/é/í/ó/ö/ő/ú/ü/ű
scripts/build_all_languages.py — Added hu to SUPPORTED_LANGUAGES dict and docstring
README.md — Rewrote "Creating Custom Language Packs" section:
- Quick Start using Hungarian as the worked example
- Input file format table (one-word-per-line, word+TAB+freq, word+space+freq)
- Tip explaining why scraped text fails (root cause of the issue)
- Accurate language pack contents table (old docs listed {lang}_enhanced.bin/.json which don't exist in the ZIP)
- Script reference table

Quick start now reads:

cd scripts/
pip install wordfreq
python get_wordlist.py --lang hu --output hu_words.txt --count 25000
python build_langpack.py --lang hu --name "Hungarian" --input hu_words.txt --use-wordfreq --output langpack-hu.zip

Testing Performed

Manual Testing

Test Scenarios:

Generated Hungarian word list via get_wordlist.py — 25,000 words, proper Hungarian vocabulary
Built langpack-hu.zip via build_langpack.py — manifest shows 25,000 words, ZIP contains manifest.json, dictionary.bin, unigrams.txt
Verified build_all_languages.py --list includes Hungarian

Automated Testing

CodeQL security scan — 0 alerts
Code review — no issues

Screenshots/Videos

N/A — documentation and data-file changes only.

Performance Impact

Memory: No change
CPU: No change
Battery: No impact
Latency: No change

Privacy & Security Checklist

✅ No network code added
✅ No telemetry or analytics added
✅ No third-party SDKs added (except ONNX Runtime if needed)
✅ All data processing remains local
✅ No sensitive data logging
✅ No new permissions required
✅ User privacy maintained

Privacy Impact: None
Explanation: Dictionary data only; all processing remains on-device.

Code Quality Checklist

My code follows the project's style guidelines
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published

Documentation

README.md updated (if needed)
User Manual updated (if user-facing changes)
FAQ updated (if common questions addressed)
Code comments added/updated
CHANGELOG.md updated

Breaking Changes

N/A

Additional Context

The original reporter scraped a Hungarian Wikipedia page and fed the raw text to build_langpack.py, which expects one word per line. Since prose lines are treated as single entries (and most fail validation), only 73 "words" made it into the dictionary. The updated docs make this format requirement explicit and recommend using get_wordlist.py as the input source.

Checklist for Maintainers

- Generate langpack-hu.zip with 25,000 Hungarian words from wordfreq - Add Hungarian (hu) to build_all_languages.py supported languages - Rewrite README "Creating Custom Language Packs" section with: - Clear quick-start guide using Hungarian as the example - Documented input file format (one word per line, word+tab+freq, word+space+freq) - Tip explaining why scraped text produces poor results - Structured tables for script reference and langpack contents - Installation instructions Agent-Logs-Url: https://github.com/tribixbite/CleverKeys/sessions/5fe261e0-01ca-4df8-b1b2-f31f9cc7a512 Co-authored-by: tribixbite <381345+tribixbite@users.noreply.github.com>

Copilot

Pull request overview

Adds Hungarian (hu) to the supported language build pipeline and updates the README instructions for building custom language packs so users can successfully generate and package dictionaries (with correct CLI usage and clearer input format expectations).

Changes:

Added Hungarian (hu) to SUPPORTED_LANGUAGES in the batch language build script.
Reworked the README “Creating Custom Language Packs” section with a correct Quick Start, explicit input formats, and updated pack contents/script reference.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`scripts/build_all_languages.py`	Registers Hungarian as a supported language in the batch build configuration.
`README.md`	Updates custom language pack build documentation (Quick Start, input formats, script reference, pack contents).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-19T07:33:44Z

+| Word + space + frequency | `hello 50000` | Uses the provided integer frequency |

-# Option 2: Build from pre-existing binary dictionary (.bin file)
+Lines starting with `#` are treated as comments and skipped. Words longer than 50 characters are ignored.


This states “Words longer than 50 characters are ignored,” but build_dictionary.py also applies --max-length (default 30) after loading, so words of length 31–50 are ignored by default as well. Consider documenting the default --min-length/--max-length behavior (and how to override it) to avoid confusing users who expect 40–50 char tokens to be kept.

Suggested change

Lines starting with `#` are treated as comments and skipped. Words longer than 50 characters are ignored.

Lines starting with `#` are treated as comments and skipped. Words longer than 50 characters are ignored during input parsing. In addition, `build_dictionary.py` applies `--min-length`/`--max-length` filtering after loading (default `--max-length` is 30), so words with lengths 31-50 are also excluded unless you override that limit, for example with `--max-length 50`.

Copilot · 2026-04-19T07:33:44Z

+| `manifest.json` | Metadata — language code, name, version, word count |
+| `dictionary.bin` | V2 binary dictionary with accent normalization and frequency ranks |
+| `unigrams.txt` | Top words ordered by frequency (used for language detection) |
+| `contractions.json` | *(optional)* Apostrophe word mappings for languages that use them |


The contents table implies unigrams.txt is always present, but build_langpack.py will omit it when --unigrams isn’t provided and unigram generation fails (or when building from --dict only). Either mark unigrams.txt as optional in the docs, or make build_langpack.py require/provide unigrams for a “complete” pack.

Copilot · 2026-04-19T07:33:44Z

+The `--use-wordfreq` flag enriches word frequencies using the [wordfreq](https://github.com/rspeer/wordfreq) library, which produces better prediction results.
+
+#### Input File Format
+
+The `--input` file for `build_langpack.py` and `build_dictionary.py` is a plain text word list. Supported formats:
+
+| Format | Example | Notes |
+|--------|---------|-------|
+| One word per line | `hello` | Frequencies are looked up via `wordfreq` (use `--use-wordfreq`) |
+| Word + TAB + frequency | `hello\t50000` | Uses the provided integer frequency |
+| Word + space + frequency | `hello 50000` | Uses the provided integer frequency |


The README implies --use-wordfreq is required for one-word-per-line inputs to get real frequencies, but build_dictionary.py currently uses wordfreq whenever it’s installed regardless of the flag (the flag only errors when wordfreq isn’t available). Either update the docs to reflect that behavior, or change build_dictionary.py to honor --use-wordfreq as an actual switch so the docs and CLI semantics match.

Initial plan

cc3c0e3

Copilot AI assigned Copilot and tribixbite Apr 19, 2026

Copilot started work on behalf of tribixbite April 19, 2026 06:52 View session

Copilot AI linked an issue Apr 19, 2026 that may be closed by this pull request

Unclear or outdated documentation of build_langpack.py #99

Closed

Copilot AI changed the title ~~[WIP] Update documentation for build_langpack.py usage~~ Add Hungarian language pack and improve langpack build documentation Apr 19, 2026

Copilot AI requested a review from tribixbite April 19, 2026 06:56

Copilot finished work on behalf of tribixbite April 19, 2026 06:56

tribixbite requested a review from Copilot April 19, 2026 07:30

Copilot AI reviewed Apr 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Hungarian language pack and improve langpack build documentation#126

Add Hungarian language pack and improve langpack build documentation#126
Copilot wants to merge 2 commits intomainfrom
copilot/update-build-langpack-docs

Copilot AI commented Apr 19, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 19, 2026

Uh oh!

Copilot AI Apr 19, 2026

Uh oh!

Copilot AI Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	Lines starting with `#` are treated as comments and skipped. Words longer than 50 characters are ignored.
	Lines starting with `#` are treated as comments and skipped. Words longer than 50 characters are ignored during input parsing. In addition, `build_dictionary.py` applies `--min-length`/`--max-length` filtering after loading (default `--max-length` is 30), so words with lengths 31-50 are also excluded unless you override that limit, for example with `--max-length 50`.

Uh oh!

Conversation

Copilot AI commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request

Description

Type of Change

Related Issues

Changes Made

Testing Performed

Manual Testing

Automated Testing

Screenshots/Videos

Performance Impact

Privacy & Security Checklist

Code Quality Checklist

Documentation

Breaking Changes

Additional Context

Checklist for Maintainers

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Apr 19, 2026 •

edited

Loading