Skip to content

Normalize all Hebrew text to Unicode NFC#139

Closed
jonathanrobie wants to merge 1 commit into
mainfrom
fix/nfc-normalize-hebrew
Closed

Normalize all Hebrew text to Unicode NFC#139
jonathanrobie wants to merge 1 commit into
mainfrom
fix/nfc-normalize-hebrew

Conversation

@jonathanrobie
Copy link
Copy Markdown
Contributor

Closes #138.

Summary

  • Normalizes all Hebrew text in WLC/nodes/*.xml to Unicode NFC (929 files)
  • Affected fields: @lemma, @unicode, @transliteration, and element text content
  • NFC reorders combining marks to canonical order: vowel points before dagesh/shin-dot on the same consonant
  • Lowfat and TSV will be correct after regeneration from nodes

Approach

python/normalize_nfc.py applies unicodedata.normalize("NFC", ...) to the raw file text. This is safe for XML because all structural characters are ASCII (combining class 0) and are unaffected by NFC reordering.

Test plan

🤖 Generated with Claude Code

Reorders combining marks in @lemma, @unicode, @Transliteration, and
element text content to Unicode canonical order (lower CCC first).
For Hebrew, this means vowel points precede dagesh/shin-dot on the
same consonant — the form produced by browsers, keyboards, and
standard libraries.

929 node files updated. Lowfat and TSV will reflect the fix after
regeneration from nodes.

Adds python/normalize_nfc.py for future use if needed (e.g. after
a new OSHB intake). Adds test_file_is_nfc to enforce NFC going forward.

NFC normalization of raw XML text is safe: all XML structural
characters are ASCII (combining class 0) and are unaffected.

Closes #138

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jonathanrobie
Copy link
Copy Markdown
Contributor Author

Merged into PR #136 (fix/strip-cgj-from-lemmas) to reduce the number of PRs to merge. Both CGJ stripping and NFC normalization will ship together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unicode normalization of Hebrew combining marks

1 participant