Fix PO file escaping inconsistencies#1059
Merged
Merged
Conversation
67c289f to
66dba64
Compare
8bd5bf2 to
1acc15c
Compare
Author
|
@ehoogerbeets I've split up the large POFilter refactor commit into smaller logical chunks to make it easier to understand.
|
Text unit names generated from msgid should be fully unescaped, matching the source.
Cover escape/unescape test cases handled by Okapi PO implementation to ensure that Mojito's overridden class behaves consistently.
Source strings from PO files would be doubly unescaped in the extraction, while target strings would not be escaped at all when generating localized assets.
Drop the getEncoderManager() override that mapped PO output to Mojito's custom POEncoder. The SimpleEncoder-based POEncoder missed re-escaping of backslashes, tabs and formfeeds. All encoding now flows through Okapi's builtin PO encoder.
Drop UnescapeUtils.replaceEscapedQuotes which was applied on top of Okapi's already-unescaped content, causing double-unescaping bugs. Okapi's builtin PO parser already handles quote unescaping correctly, so the extra pass was both redundant and harmful.
Flatten the indirect dispatch chain (next -> readNextEvents -> getNextWithProcess -> processTextUnit) into a single next() that calls super.next() directly and handles each event type inline. Replace the recursive readPlurals() with processNextPluralGroup() which drains text units in a simple while-until-END_GROUP loop instead of tracking state across recursive calls. Remove intermediate methods: readNextEvents, getNextWithProcess, processTextUnit, isPluralGroupEnding, readPlurals, adaptPlurals, setUsagesOnTextUnits. Reflection-based field loading (msgID, msgIDPlural, poPluralForm) and renameTextUnitWithSourceAndContent are kept unchanged.
Split renameTextUnitWithSourceAndContent into two focused methods: - setTextUnitName(textUnit, baseName): sets the base name + msgctxt - appendPluralFormToName(textUnit, cldrForm): appends the plural suffix In processNextPluralGroup, use a local loop index to derive the CLDR form instead of the stateful poPluralForm field. This removes the poPluralForm field entirely and decouples plural suffix logic from the base naming method.
Remove the msgID/msgIDPlural fields and the reflection helpers (loadMsgIDFromParent, loadMsgIDPluralFromParent, makeAccessibleAndGetString) that loaded them from the parent Okapi POFilter on every event. For non-plural text units, derive the base name directly from the text unit's own source (already unescaped by Okapi). For plural groups, capture singularSource from the first text unit and pluralSource from the second text unit within processNextPluralGroup. Pass both to PoPluralsHolder via constructor parameters so that adaptTextUnitToCLDRForm uses locally-captured values instead of mutable outer-class state. Note: when nPlurals=1 (e.g. Japanese) there is no second text unit, so pluralSource falls back to singularSource. This is addressed in the next commit.
When a locale has only one plural form (nPlurals=1, e.g. Japanese), Okapi emits a single TEXT_UNIT with the singular source (msgid). But CLDR maps this to the "other" form, which should carry the plural source (msgid_plural). Since there is no second text unit to read it from, extract msgid_plural directly from the START_GROUP skeleton and unescape it via the parent filter's private unescape method (accessed through reflection because it has no public alternative). Also updates class-level and method-level Javadoc across the file to document the overridden behavior, the PO/CLDR plural model mismatch, and the CLDR category mapping rationale in adaptTextUnitToCLDRForm.
Generation of localized PO files is now handled by Okapi's builtin default POEncoder.
ehoogerbeets
approved these changes
Apr 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR refactors the
POFilteroverride to fix handling of escape sequences when parsing and serializing PO(T) files.Issue description
We've run into the following problems related to PO escape sequence handling:
1. Double unescaping on source
When parsing POT files, strings would be doubly unescaped, e.g. a source string
encoded in a POT file as
would display in Mojito UI as
2. Lack of unescaping on name
The same string would show up completely unescaped in the text unit name
3. Lack of escaping on target
Finally, after running
mojito-cli pull, a valid translationwould not be unescaped in the generated localized PO file:
resulting in a "dangling" backslash.
Root causes
Mojito's POFilter subclass layered some custom mechanisms on top of Okapi's base PO filter that introduced escaping issues:
unescpaeran a second unescaping pass on the Text Unit content after Okapi had already unescaped it. Introduced in PO file escaping/unescaping #436, likely by accident while fixing insufficient escaping on plurals (described below).adaptTextUnitToCLDRForm. That mechanism replaced the source content withreplaceEscapedQuotes(msgID)/replaceEscapedQuotes(msgIDPlural)which meant plurals don't suffer from double-unescaping (themsgIDandmsgIDPluralare private fields of the parent parser that statefully hold pieces of raw file data during parsing), but in turn they have incomplete unescaping (miss\\,\netc.) and are inconsistent with non-plurals. Reflective access to raw fieldsmsgIDandmsgIDPluralwas introduced as early as Pmaster #210 (i.e. the beginning of plurals support in PO files), while thereplaceEscapedQuotescall on them was added in Unescape plural form in PO files #500.msgID- without any unescaping this time around. Also introduced in Pmaster #210.POEncoder extends SimpleEncoderwas introduced in PO file escaping/unescaping #436 (not sure why, since the only method on that class,toNative, is a literal copy of Okapi'sPOEncoder#toNativeso I can't see any obvious reason for using that instead of Okapi's nativePOEncoder). Anyway,SimpleEncoderimplements its own escaping, which again is incomplete compared toPOEncoder, as it does not escape backslash characters.Solution
TLDR:
Detailed description
Fixes issue 3 (target not escaped). Drops
getEncoderManager()override so Okapi's builtin PO encoder handles all output serialization, correctly re-escaping\,",\t,\n,\f, etc.Fixes issue 1 (source double-unescaped for non-plurals). Removes the redundant
unescpaepass and any references toUnescapeUtils. This temporarily removes any unescaping in plurals (before it was incomplete, now there's none), but it's addressed in a different way in step 5.Flattens and simplifies the logic for intercepting events and buffering plural groups. Gets rid of unnecessary recursion. Refactor of the buffering logic is necessary to simplify name handling and plural index counting (step 4), and to allow capturing plural group sources from the text units accordingly (step 5).
Splits
renameTextUnitWithSourceAndContentinto two separate methods:setTextUnitNamechanges the unit name to the source value, whileappendPluralFormToNameadds the plural form suffix. With that, it's much easier to follow the logic:setTextUnitNameexecutes for every text unit, whileappendPluralFormToNameonly executes in a sub-loop while processing a plural group. Additionally, this replaces apoPluralFormcounter field with an index variable local to the plural group processing sub-loop.Fixes issue 2 (name not unescaped) and follows up on step 2 (lack of unescaping in plurals). Removes
msgID/msgIDPluralfields and all related reflection helpers. Names are now always derived fromtextUnit.getSource().toString(), which is already unescaped by Okapi and needs no further processing. For plural groups, singular/plural sources are captured locally withinprocessNextPluralGroupduring the initial plural group buffering phase. Values are passed toPoPluralsHoldervia constructor, so that it can use them when running its logic: adjusting the names (every plural group unit has name derived from the singular source), sources (every plural unit which is notonehas its source set to the plural source), and injecting/dropping units according to the CLDR plural rules.nPlurals=1edge case via skeleton extractionConsequence of step 5. With
msgIDPluralgone, parentPOFilteremits plural events according to the rules set in file header. When processing a file withnPlurals=1(e.g. importing translations from a translated Japanese PO file), the whole plural group consists of only one text unit - so there is nowhere to read the plural source from (actually, this Japanese case would technically be correct with either source - but Okapi implementation produces a unit with the singular source and we need to work around it). As described in step 5, we need to obtain a unit which has singular source in the name, and plural source in the source. Since there is no 2nd unit, the plural source is re-parsed in a fallback way from theSTART_GROUPevent skeleton (i.e. an excerpt from the raw file content that contains the whole plural group). This sucks, since we just went through all the effort of NOT re-implementing the base class features in our subclass, but I can't see a better way to do it. To at least make sure that we don't mess up the unescaping again, reflective access to the parent's private methodunescapeis used. In my opinion, this approach is slightly less fragile than the previous one, because themsgIDPluralfield is a stateful field of the parent parser - when using it, on top of the postprocessing (unescaping) we also need worry about timing the parent class correctly - plus this fallback is only used in the edge case of parsing a target file withnPlurals=1, so the impact of any potential bugs is limited.Followup to step 1. POEncoder.java is dead code after
getEncoderManager()override removalFixes an unexpected logical bug introduced in step 4 by skipping
appendPluralFormToNamefor text units whose PO form index has no CLDR mapping (e.g. form 1 for Japanese). This is a weird one.When generating a localized Japanese PO file, the parent
POFilterstill emits two text units (because on the input it has a source POT file withnPlurals=2). These two units get buffered in our subclass and processed byPoPluralsHolder. Japanese PO plural rules (nPlurals=1) state that there should be only one PO plural form0which maps to the CLDRothercategory - sopoFormToCldrForm(1)returnsnull.In the old code, unit renaming code would actually always append the mapping suffix, so these units would end up with literal
_nullin the name - butPluralsHolder#getCldrPluralFormOfEventwould backconvert it tonulland the unit would be silently dropped. This also dates back to Pmaster #210.Changes from step 4 skip
_nullcategory suffix in advance (since it doesn't really make sense) - but this means that these units wouldn't be dropped anymore. Rather than relying on the hidden, hacky_nullfiltering deep withinPluralsHolder, we now explicitly filter out events with no CLDR category mapping in advance. Note: we should probably get rid of this hidden logic inPluralsHolderaltogether, but it might impact handling of plurals in other file types, so I'll leave it for now.Also updated tests to verify the fixed behaviours:
• 0b14c77 — Updates existing tests to expect unescaped names
• 232bf01 — Ports escape/unescape tests from Okapi's POFilterTest to ensure override parity
• 426f4b9 — Adds the exact reproduction case from the issue (double-unescape on parse + missing re-escape on write)
• 84c022b — Removes tests for deleted reflection methods
• 88cee79 — Updates
\'test: not a standard PO escape, so Okapi correctly preserves the backslash as literal