Skip to content

Fix missing @after attributes on 832 morphemes#137

Open
jonathanrobie wants to merge 2 commits into
mainfrom
fix/missing-after-attributes
Open

Fix missing @after attributes on 832 morphemes#137
jonathanrobie wants to merge 2 commits into
mainfrom
fix/missing-after-attributes

Conversation

@jonathanrobie
Copy link
Copy Markdown
Contributor

Summary

  • 832 <m> elements (last morpheme of orthographic words) had after="" where they should have after=" " (space) or after="־" (maqaf)
  • The bug originates from the OSHB upstream source where some words have empty @after
  • TEI files (WLC/tei/) are the authoritative source for the correct inter-word separator

Approach

Adds python/fix_missing_after.py which:

  • Uses lxml to parse TEI files and build a map of correct @after values
  • Uses lxml to parse nodes files and identify which morphemes need fixing
  • Uses targeted string replacement (not lxml serialization) to write changes — preserves original whitespace and multi-line attribute formatting

Notes

Test plan

  • Verify python/fix_missing_after.py --dry-run reports 0 fixes after merge
  • Spot-check a few fixed morphemes (e.g. GEN 1:12!10, DEU 5:6) against TEI source
  • Confirm no whitespace/formatting changes beyond the after attribute value

🤖 Generated with Claude Code

jonathanrobie and others added 2 commits April 22, 2026 19:43
Words that were not sentence-final had after="" where they should
have after=" " (space) or after="־" (maqaf). The TEI files are the
authoritative source for the correct separator values.

Adds python/fix_missing_after.py which uses lxml for safe XML parsing
to identify fixes, then targeted string replacement to preserve the
original whitespace and attribute formatting.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add tests that would have caught the bugs fixed in PRs #136 and #137:
- No CGJ (U+034F) anywhere in nodes, lowfat, or TSV
- Non-final orthographic words must have non-empty @after
- @after values must be from the known valid set
- @lemma and @morph must be non-empty on <m> elements
- xml:id must match expected format (o<digits>, with optional ה suffix)
- TSV ref format, lemma non-empty, after valid values

Currently failing tests reflect known pending issues:
- CGJ: fixed by PR #136 (pending merge + regeneration)
- Lowfat @after: fixed by PR #137 (pending merge + regeneration)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant