Normalize all Hebrew text to Unicode NFC by jonathanrobie · Pull Request #139 · Clear-Bible/macula-hebrew

jonathanrobie · 2026-04-23T16:53:43Z

Closes #138.

Summary

Normalizes all Hebrew text in WLC/nodes/*.xml to Unicode NFC (929 files)
Affected fields: @lemma, @unicode, @transliteration, and element text content
NFC reorders combining marks to canonical order: vowel points before dagesh/shin-dot on the same consonant
Lowfat and TSV will be correct after regeneration from nodes

Approach

python/normalize_nfc.py applies unicodedata.normalize("NFC", ...) to the raw file text. This is safe for XML because all structural characters are ASCII (combining class 0) and are unaffected by NFC reordering.

Test plan

test_file_is_nfc added to test_nodes.py — enforces NFC on all node files going forward
python/normalize_nfc.py --dry-run should report 0 files after merge
Note: this PR should be applied after Strip CGJ (U+034F) from lemmas #136 (CGJ strip) and Fix missing @after attributes on 832 morphemes #137 (@after fix), or rebased on top of them — all three touch WLC/nodes/*.xml

🤖 Generated with Claude Code

@unicode

Reorders combining marks in @lemma, @unicode, @Transliteration, and element text content to Unicode canonical order (lower CCC first). For Hebrew, this means vowel points precede dagesh/shin-dot on the same consonant — the form produced by browsers, keyboards, and standard libraries. 929 node files updated. Lowfat and TSV will reflect the fix after regeneration from nodes. Adds python/normalize_nfc.py for future use if needed (e.g. after a new OSHB intake). Adds test_file_is_nfc to enforce NFC going forward. NFC normalization of raw XML text is safe: all XML structural characters are ASCII (combining class 0) and are unaffected. Closes #138 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jonathanrobie · 2026-04-23T17:11:24Z

Merged into PR #136 (fix/strip-cgj-from-lemmas) to reduce the number of PRs to merge. Both CGJ stripping and NFC normalization will ship together.

jonathanrobie closed this Apr 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize all Hebrew text to Unicode NFC#139

Normalize all Hebrew text to Unicode NFC#139
jonathanrobie wants to merge 1 commit into
mainfrom
fix/nfc-normalize-hebrew

jonathanrobie commented Apr 23, 2026

Uh oh!

jonathanrobie commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jonathanrobie commented Apr 23, 2026

Summary

Approach

Test plan

Uh oh!

jonathanrobie commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant