Skip to content

Add Hungarian language pack and improve langpack build documentation#126

Draft
Copilot wants to merge 2 commits intomainfrom
copilot/update-build-langpack-docs
Draft

Add Hungarian language pack and improve langpack build documentation#126
Copilot wants to merge 2 commits intomainfrom
copilot/update-build-langpack-docs

Conversation

Copy link
Copy Markdown

Copilot AI commented Apr 19, 2026

Pull Request

Description

Users couldn't build custom language packs because the README had an incorrect example command (missing required --input flag), didn't document the expected input file format, and didn't explain why scraped text produces near-empty dictionaries (the script expects one word per line, not prose).

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📝 Documentation update
  • 🎨 Style/UI change (formatting, visual improvements)
  • ♻️ Code refactoring (no functional changes)
  • ⚡ Performance improvement
  • ✅ Test update/addition
  • 🔧 Build/configuration change

Related Issues

Changes Made

  • scripts/dictionaries/langpack-hu.zip — Pre-built Hungarian language pack (25,000 words, 296 KB) generated from wordfreq with accent normalization for á/é/í/ó/ö/ő/ú/ü/ű
  • scripts/build_all_languages.py — Added hu to SUPPORTED_LANGUAGES dict and docstring
  • README.md — Rewrote "Creating Custom Language Packs" section:
    • Quick Start using Hungarian as the worked example
    • Input file format table (one-word-per-line, word+TAB+freq, word+space+freq)
    • Tip explaining why scraped text fails (root cause of the issue)
    • Accurate language pack contents table (old docs listed {lang}_enhanced.bin/.json which don't exist in the ZIP)
    • Script reference table

Quick start now reads:

cd scripts/
pip install wordfreq
python get_wordlist.py --lang hu --output hu_words.txt --count 25000
python build_langpack.py --lang hu --name "Hungarian" --input hu_words.txt --use-wordfreq --output langpack-hu.zip

Testing Performed

Manual Testing

Test Scenarios:

  • Generated Hungarian word list via get_wordlist.py — 25,000 words, proper Hungarian vocabulary
  • Built langpack-hu.zip via build_langpack.py — manifest shows 25,000 words, ZIP contains manifest.json, dictionary.bin, unigrams.txt
  • Verified build_all_languages.py --list includes Hungarian

Automated Testing

  • CodeQL security scan — 0 alerts
  • Code review — no issues

Screenshots/Videos

N/A — documentation and data-file changes only.

Performance Impact

  • Memory: No change
  • CPU: No change
  • Battery: No impact
  • Latency: No change

Privacy & Security Checklist

  • ✅ No network code added
  • ✅ No telemetry or analytics added
  • ✅ No third-party SDKs added (except ONNX Runtime if needed)
  • ✅ All data processing remains local
  • ✅ No sensitive data logging
  • ✅ No new permissions required
  • ✅ User privacy maintained

Privacy Impact: None
Explanation: Dictionary data only; all processing remains on-device.

Code Quality Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published

Documentation

  • README.md updated (if needed)
  • User Manual updated (if user-facing changes)
  • FAQ updated (if common questions addressed)
  • Code comments added/updated
  • CHANGELOG.md updated

Breaking Changes

N/A

Additional Context

The original reporter scraped a Hungarian Wikipedia page and fed the raw text to build_langpack.py, which expects one word per line. Since prose lines are treated as single entries (and most fail validation), only 73 "words" made it into the dictionary. The updated docs make this format requirement explicit and recommend using get_wordlist.py as the input source.

Checklist for Maintainers

  • Code review completed
  • Tests verified on multiple devices
  • Documentation reviewed
  • No privacy violations
  • Performance acceptable
  • Breaking changes documented
  • CHANGELOG.md updated
  • Ready to merge

- Generate langpack-hu.zip with 25,000 Hungarian words from wordfreq
- Add Hungarian (hu) to build_all_languages.py supported languages
- Rewrite README "Creating Custom Language Packs" section with:
  - Clear quick-start guide using Hungarian as the example
  - Documented input file format (one word per line, word+tab+freq, word+space+freq)
  - Tip explaining why scraped text produces poor results
  - Structured tables for script reference and langpack contents
  - Installation instructions

Agent-Logs-Url: https://github.com/tribixbite/CleverKeys/sessions/5fe261e0-01ca-4df8-b1b2-f31f9cc7a512

Co-authored-by: tribixbite <381345+tribixbite@users.noreply.github.com>
Copilot AI changed the title [WIP] Update documentation for build_langpack.py usage Add Hungarian language pack and improve langpack build documentation Apr 19, 2026
Copilot AI requested a review from tribixbite April 19, 2026 06:56
@tribixbite tribixbite requested a review from Copilot April 19, 2026 07:30
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Hungarian (hu) to the supported language build pipeline and updates the README instructions for building custom language packs so users can successfully generate and package dictionaries (with correct CLI usage and clearer input format expectations).

Changes:

  • Added Hungarian (hu) to SUPPORTED_LANGUAGES in the batch language build script.
  • Reworked the README “Creating Custom Language Packs” section with a correct Quick Start, explicit input formats, and updated pack contents/script reference.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.

File Description
scripts/build_all_languages.py Registers Hungarian as a supported language in the batch build configuration.
README.md Updates custom language pack build documentation (Quick Start, input formats, script reference, pack contents).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread README.md
| Word + space + frequency | `hello 50000` | Uses the provided integer frequency |

# Option 2: Build from pre-existing binary dictionary (.bin file)
Lines starting with `#` are treated as comments and skipped. Words longer than 50 characters are ignored.
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This states “Words longer than 50 characters are ignored,” but build_dictionary.py also applies --max-length (default 30) after loading, so words of length 31–50 are ignored by default as well. Consider documenting the default --min-length/--max-length behavior (and how to override it) to avoid confusing users who expect 40–50 char tokens to be kept.

Suggested change
Lines starting with `#` are treated as comments and skipped. Words longer than 50 characters are ignored.
Lines starting with `#` are treated as comments and skipped. Words longer than 50 characters are ignored during input parsing. In addition, `build_dictionary.py` applies `--min-length`/`--max-length` filtering after loading (default `--max-length` is 30), so words with lengths 31-50 are also excluded unless you override that limit, for example with `--max-length 50`.

Copilot uses AI. Check for mistakes.
Comment thread README.md
Comment on lines +350 to +353
| `manifest.json` | Metadata — language code, name, version, word count |
| `dictionary.bin` | V2 binary dictionary with accent normalization and frequency ranks |
| `unigrams.txt` | Top words ordered by frequency (used for language detection) |
| `contractions.json` | *(optional)* Apostrophe word mappings for languages that use them |
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contents table implies unigrams.txt is always present, but build_langpack.py will omit it when --unigrams isn’t provided and unigram generation fails (or when building from --dict only). Either mark unigrams.txt as optional in the docs, or make build_langpack.py require/provide unigrams for a “complete” pack.

Copilot uses AI. Check for mistakes.
Comment thread README.md
Comment on lines +304 to +314
The `--use-wordfreq` flag enriches word frequencies using the [wordfreq](https://github.com/rspeer/wordfreq) library, which produces better prediction results.

#### Input File Format

The `--input` file for `build_langpack.py` and `build_dictionary.py` is a plain text word list. Supported formats:

| Format | Example | Notes |
|--------|---------|-------|
| One word per line | `hello` | Frequencies are looked up via `wordfreq` (use `--use-wordfreq`) |
| Word + TAB + frequency | `hello\t50000` | Uses the provided integer frequency |
| Word + space + frequency | `hello 50000` | Uses the provided integer frequency |
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README implies --use-wordfreq is required for one-word-per-line inputs to get real frequencies, but build_dictionary.py currently uses wordfreq whenever it’s installed regardless of the flag (the flag only errors when wordfreq isn’t available). Either update the docs to reflect that behavior, or change build_dictionary.py to honor --use-wordfreq as an actual switch so the docs and CLI semantics match.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unclear or outdated documentation of build_langpack.py

3 participants