-
-
Notifications
You must be signed in to change notification settings - Fork 25
Add Hungarian language pack and improve langpack build documentation #126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -284,43 +284,81 @@ Available via **Settings β Languages β Download Language Packs**: | |||||
|
|
||||||
| ### Creating Custom Language Packs | ||||||
|
|
||||||
| You can create dictionaries for any language using the included Python scripts: | ||||||
| You can create dictionaries for any language using the included Python scripts. | ||||||
|
|
||||||
| #### Quick Start (Recommended) | ||||||
|
|
||||||
| The easiest way to build a language pack is the two-step `wordfreq` pipeline. This automatically generates a word list with frequency data and packages it into a ready-to-install ZIP: | ||||||
|
|
||||||
| ```bash | ||||||
| # Navigate to scripts directory | ||||||
| cd scripts/ | ||||||
|
|
||||||
| # Install prerequisite | ||||||
| pip install wordfreq | ||||||
|
|
||||||
| # Option 1: Two-step build from wordfreq (any language wordfreq supports) | ||||||
| python get_wordlist.py --lang fr --output fr_words.txt --count 50000 | ||||||
| python build_langpack.py --lang fr --name "French" --input fr_words.txt --use-wordfreq --output langpack-fr.zip | ||||||
| # Step 1: Generate a word list from wordfreq (supports 50+ languages) | ||||||
| python get_wordlist.py --lang hu --output hu_words.txt --count 25000 | ||||||
|
|
||||||
| # Step 2: Build the language pack from the word list | ||||||
| python build_langpack.py --lang hu --name "Hungarian" --input hu_words.txt --use-wordfreq --output langpack-hu.zip | ||||||
| ``` | ||||||
|
|
||||||
| The `--use-wordfreq` flag enriches word frequencies using the [wordfreq](https://github.com/rspeer/wordfreq) library, which produces better prediction results. | ||||||
|
|
||||||
| #### Input File Format | ||||||
|
|
||||||
| The `--input` file for `build_langpack.py` and `build_dictionary.py` is a plain text word list. Supported formats: | ||||||
|
|
||||||
| | Format | Example | Notes | | ||||||
| |--------|---------|-------| | ||||||
| | One word per line | `hello` | Frequencies are looked up via `wordfreq` (use `--use-wordfreq`) | | ||||||
| | Word + TAB + frequency | `hello\t50000` | Uses the provided integer frequency | | ||||||
| | Word + space + frequency | `hello 50000` | Uses the provided integer frequency | | ||||||
|
|
||||||
| # Option 2: Build from pre-existing binary dictionary (.bin file) | ||||||
| Lines starting with `#` are treated as comments and skipped. Words longer than 50 characters are ignored. | ||||||
|
||||||
| Lines starting with `#` are treated as comments and skipped. Words longer than 50 characters are ignored. | |
| Lines starting with `#` are treated as comments and skipped. Words longer than 50 characters are ignored during input parsing. In addition, `build_dictionary.py` applies `--min-length`/`--max-length` filtering after loading (default `--max-length` is 30), so words with lengths 31-50 are also excluded unless you override that limit, for example with `--max-length 50`. |
Copilot
AI
Apr 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The contents table implies unigrams.txt is always present, but build_langpack.py will omit it when --unigrams isnβt provided and unigram generation fails (or when building from --dict only). Either mark unigrams.txt as optional in the docs, or make build_langpack.py require/provide unigrams for a βcompleteβ pack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The README implies
--use-wordfreqis required for one-word-per-line inputs to get real frequencies, butbuild_dictionary.pycurrently useswordfreqwhenever itβs installed regardless of the flag (the flag only errors when wordfreq isnβt available). Either update the docs to reflect that behavior, or changebuild_dictionary.pyto honor--use-wordfreqas an actual switch so the docs and CLI semantics match.