Fix missing @after attributes on 832 morphemes#137
Open
jonathanrobie wants to merge 2 commits into
Open
Conversation
Words that were not sentence-final had after="" where they should have after=" " (space) or after="־" (maqaf). The TEI files are the authoritative source for the correct separator values. Adds python/fix_missing_after.py which uses lxml for safe XML parsing to identify fixes, then targeted string replacement to preserve the original whitespace and attribute formatting. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add tests that would have caught the bugs fixed in PRs #136 and #137: - No CGJ (U+034F) anywhere in nodes, lowfat, or TSV - Non-final orthographic words must have non-empty @after - @after values must be from the known valid set - @lemma and @morph must be non-empty on <m> elements - xml:id must match expected format (o<digits>, with optional ה suffix) - TSV ref format, lemma non-empty, after valid values Currently failing tests reflect known pending issues: - CGJ: fixed by PR #136 (pending merge + regeneration) - Lowfat @after: fixed by PR #137 (pending merge + regeneration) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This was referenced Apr 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
<m>elements (last morpheme of orthographic words) hadafter=""where they should haveafter=" "(space) orafter="־"(maqaf)@afterWLC/tei/) are the authoritative source for the correct inter-word separatorApproach
Adds
python/fix_missing_after.pywhich:@aftervaluesNotes
WLC/nodes/*.xml; if Strip CGJ (U+034F) from lemmas #136 merges first, a rebase will pick it up cleanlyTest plan
python/fix_missing_after.py --dry-runreports 0 fixes after mergeGEN 1:12!10,DEU 5:6) against TEI sourceafterattribute value🤖 Generated with Claude Code