Skip to content

Fix PO file escaping inconsistencies#1059

Merged
wadimw merged 13 commits into
upstream-patchedfrom
fix-po-escaping
Apr 21, 2026
Merged

Fix PO file escaping inconsistencies#1059
wadimw merged 13 commits into
upstream-patchedfrom
fix-po-escaping

Conversation

@wadimw
Copy link
Copy Markdown

@wadimw wadimw commented Mar 17, 2026

This PR refactors the POFilter override to fix handling of escape sequences when parsing and serializing PO(T) files.

Issue description

We've run into the following problems related to PO escape sequence handling:

1. Double unescaping on source

When parsing POT files, strings would be doubly unescaped, e.g. a source string

Name cannot contain "/", "\", or characters outside the basic multilingual plane.

encoded in a POT file as

msgid "Name cannot contain \"/\", \"\\\", or characters outside the basic multilingual plane."

would display in Mojito UI as

Name cannot contain "/", "", or characters outside the basic multilingual plane.

2. Lack of unescaping on name

The same string would show up completely unescaped in the text unit name

Name cannot contain \"/\", \"\\\", or characters outside the basic multilingual plane.

3. Lack of escaping on target

Finally, after running mojito-cli pull, a valid translation

Nazwa nie może zawierać „/”, „\”, ani znaków spoza podstawowej klawiatury wielojęzycznej.

would not be unescaped in the generated localized PO file:

msgid "Name cannot contain \"/\", \"\\\", or characters outside the basic multilingual plane."
msgstr "Nazwa nie może zawierać „/”, „\”, ani znaków spoza podstawowej klawiatury wielojęzycznej."

resulting in a "dangling" backslash.

Root causes

Mojito's POFilter subclass layered some custom mechanisms on top of Okapi's base PO filter that introduced escaping issues:

  • Parsing (content — non-plurals): unescpae ran a second unescaping pass on the Text Unit content after Okapi had already unescaped it. Introduced in PO file escaping/unescaping #436, likely by accident while fixing insufficient escaping on plurals (described below).
  • Parsing (content — plurals): The double-unescaping technically ran here too, but the content was subsequently overwritten by adaptTextUnitToCLDRForm. That mechanism replaced the source content with replaceEscapedQuotes(msgID) / replaceEscapedQuotes(msgIDPlural) which meant plurals don't suffer from double-unescaping (the msgID and msgIDPlural are private fields of the parent parser that statefully hold pieces of raw file data during parsing), but in turn they have incomplete unescaping (miss \\, \n etc.) and are inconsistent with non-plurals. Reflective access to raw fields msgID and msgIDPlural was introduced as early as Pmaster #210 (i.e. the beginning of plurals support in PO files), while the replaceEscapedQuotes call on them was added in Unescape plural form in PO files #500.
  • Naming: Text unit names generated by Okapi PO filter were overwritten with the source string, but its value was again sourced from raw msgID - without any unescaping this time around. Also introduced in Pmaster #210.
  • Serialization (target): A custom POEncoder extends SimpleEncoder was introduced in PO file escaping/unescaping #436 (not sure why, since the only method on that class, toNative, is a literal copy of Okapi's POEncoder#toNative so I can't see any obvious reason for using that instead of Okapi's native POEncoder). Anyway, SimpleEncoder implements its own escaping, which again is incomplete compared to POEncoder, as it does not escape backslash characters.

Solution

TLDR:

  • Removed Mojito's custom unescaping (UnescapeUtils.replaceEscapedQuotes) which caused double-unescaping bugs, and the getEncoderManager override whose SimpleEncoder-based POEncoder missed backslash/tab/formfeed re-escaping. All escape handling now flows through Okapi's builtin methods.
  • Removed stateful msgID/msgIDPlural/poPluralForm fields and the reflection that loaded them on every event. Text unit names and sources are now derived from the already-unescaped content on Okapi's emitted text units. Plural group processing collects all events locally instead of tracking form index across method calls.
  • Explicitly annotated the nPlurals=1 edge case (e.g. Japanese) where Okapi emits only one text unit with singular source. Implemented a workaround where the plural source is parsed from the START_GROUP skeleton's msgid_plural line instead of being loaded from parent class' stateful, private field. Didn't manage to fully get rid of reflection - parsing from skeleton utilizes parent's private unescape method. The new approach is arguably less fragile though, since it does not depend on parent class state at least.
  • Added multiple comments explaining overridden class functionalities.

Detailed description

  1. 40a951e Remove custom EncoderManager/POEncoder override
    Fixes issue 3 (target not escaped). Drops getEncoderManager() override so Okapi's builtin PO encoder handles all output serialization, correctly re-escaping \, ", \t, \n, \f, etc.
  2. 2acb0fa Remove custom unescaping
    Fixes issue 1 (source double-unescaped for non-plurals). Removes the redundant unescpae pass and any references to UnescapeUtils. This temporarily removes any unescaping in plurals (before it was incomplete, now there's none), but it's addressed in a different way in step 5.
  3. 5d025d9 Simplify event loop structure
    Flattens and simplifies the logic for intercepting events and buffering plural groups. Gets rid of unnecessary recursion. Refactor of the buffering logic is necessary to simplify name handling and plural index counting (step 4), and to allow capturing plural group sources from the text units accordingly (step 5).
  4. 813dea7 Simplify text unit name handling
    Splits renameTextUnitWithSourceAndContent into two separate methods: setTextUnitName changes the unit name to the source value, while appendPluralFormToName adds the plural form suffix. With that, it's much easier to follow the logic: setTextUnitName executes for every text unit, while appendPluralFormToName only executes in a sub-loop while processing a plural group. Additionally, this replaces a poPluralForm counter field with an index variable local to the plural group processing sub-loop.
  5. 33ada2a Remove reflection; capture sources locally
    Fixes issue 2 (name not unescaped) and follows up on step 2 (lack of unescaping in plurals). Removes msgID/msgIDPlural fields and all related reflection helpers. Names are now always derived from textUnit.getSource().toString(), which is already unescaped by Okapi and needs no further processing. For plural groups, singular/plural sources are captured locally within processNextPluralGroup during the initial plural group buffering phase. Values are passed to PoPluralsHolder via constructor, so that it can use them when running its logic: adjusting the names (every plural group unit has name derived from the singular source), sources (every plural unit which is not one has its source set to the plural source), and injecting/dropping units according to the CLDR plural rules.
  6. c25185d Handle parsing of nPlurals=1 edge case via skeleton extraction
    Consequence of step 5. With msgIDPlural gone, parent POFilter emits plural events according to the rules set in file header. When processing a file with nPlurals=1 (e.g. importing translations from a translated Japanese PO file), the whole plural group consists of only one text unit - so there is nowhere to read the plural source from (actually, this Japanese case would technically be correct with either source - but Okapi implementation produces a unit with the singular source and we need to work around it). As described in step 5, we need to obtain a unit which has singular source in the name, and plural source in the source. Since there is no 2nd unit, the plural source is re-parsed in a fallback way from the START_GROUP event skeleton (i.e. an excerpt from the raw file content that contains the whole plural group). This sucks, since we just went through all the effort of NOT re-implementing the base class features in our subclass, but I can't see a better way to do it. To at least make sure that we don't mess up the unescaping again, reflective access to the parent's private method unescape is used. In my opinion, this approach is slightly less fragile than the previous one, because the msgIDPlural field is a stateful field of the parent parser - when using it, on top of the postprocessing (unescaping) we also need worry about timing the parent class correctly - plus this fallback is only used in the edge case of parsing a target file with nPlurals=1, so the impact of any potential bugs is limited.
  7. aeca119 Remove unused POEncoder
    Followup to step 1. POEncoder.java is dead code after getEncoderManager() override removal
  8. 6d03130 Filter out events with no CLDR category mapping
    Fixes an unexpected logical bug introduced in step 4 by skipping appendPluralFormToName for text units whose PO form index has no CLDR mapping (e.g. form 1 for Japanese). This is a weird one.
    When generating a localized Japanese PO file, the parent POFilter still emits two text units (because on the input it has a source POT file with nPlurals=2). These two units get buffered in our subclass and processed by PoPluralsHolder. Japanese PO plural rules (nPlurals=1) state that there should be only one PO plural form 0 which maps to the CLDR other category - so poFormToCldrForm(1) returns null.
    In the old code, unit renaming code would actually always append the mapping suffix, so these units would end up with literal _null in the name - but PluralsHolder#getCldrPluralFormOfEvent would backconvert it to null and the unit would be silently dropped. This also dates back to Pmaster #210.
    Changes from step 4 skip _null category suffix in advance (since it doesn't really make sense) - but this means that these units wouldn't be dropped anymore. Rather than relying on the hidden, hacky _null filtering deep within PluralsHolder, we now explicitly filter out events with no CLDR category mapping in advance. Note: we should probably get rid of this hidden logic in PluralsHolder altogether, but it might impact handling of plurals in other file types, so I'll leave it for now.

Also updated tests to verify the fixed behaviours:

0b14c77 — Updates existing tests to expect unescaped names
232bf01 — Ports escape/unescape tests from Okapi's POFilterTest to ensure override parity
426f4b9 — Adds the exact reproduction case from the issue (double-unescape on parse + missing re-escape on write)
84c022b — Removes tests for deleted reflection methods
88cee79 — Updates \' test: not a standard PO escape, so Okapi correctly preserves the backslash as literal

@wadimw wadimw force-pushed the fix-po-escaping branch 7 times, most recently from 67c289f to 66dba64 Compare March 17, 2026 17:25
@wadimw wadimw requested a review from ehoogerbeets March 18, 2026 17:37
@wadimw wadimw changed the title Fix po escaping Fix PO file escaping inconsistencies Mar 18, 2026
@wadimw wadimw added the upstream-patched Experimental features ported from legacy branch label Mar 18, 2026
@wadimw wadimw marked this pull request as ready for review March 18, 2026 18:02
@wadimw wadimw requested review from jaknas and soygitana March 19, 2026 12:15
@wadimw wadimw force-pushed the fix-po-escaping branch 2 times, most recently from 8bd5bf2 to 1acc15c Compare March 24, 2026 22:34
@wadimw
Copy link
Copy Markdown
Author

wadimw commented Mar 24, 2026

@ehoogerbeets I've split up the large POFilter refactor commit into smaller logical chunks to make it easier to understand.

  1. dropped the POEncoder
  2. dropped the custom escaping/unescaping
  3. simplified the event loop buffering
  4. simplified text unit rename logic
  5. removed access to stateful private parent class fields via reflection
  6. added re-parsing of plural source from group start event skeleton

Base automatically changed from refactor-filetype-tests to upstream-patched March 25, 2026 11:55
wadimw added 11 commits March 25, 2026 13:00
Text unit names generated from msgid should be fully unescaped, matching the source.
Cover escape/unescape test cases handled by Okapi PO implementation to ensure that Mojito's overridden class behaves consistently.
Source strings from PO files would be doubly unescaped in the extraction,
while target strings would not be escaped at all when generating localized assets.
Drop the getEncoderManager() override that mapped PO output to Mojito's
custom POEncoder. The SimpleEncoder-based POEncoder missed re-escaping of
backslashes, tabs and formfeeds. All encoding now flows through Okapi's
builtin PO encoder.
Drop UnescapeUtils.replaceEscapedQuotes which was applied on top of
Okapi's already-unescaped content, causing double-unescaping bugs.
Okapi's builtin PO parser already handles quote unescaping correctly,
so the extra pass was both redundant and harmful.
Flatten the indirect dispatch chain (next -> readNextEvents ->
getNextWithProcess -> processTextUnit) into a single next() that calls
super.next() directly and handles each event type inline.

Replace the recursive readPlurals() with processNextPluralGroup() which
drains text units in a simple while-until-END_GROUP loop instead of
tracking state across recursive calls.

Remove intermediate methods: readNextEvents, getNextWithProcess,
processTextUnit, isPluralGroupEnding, readPlurals, adaptPlurals,
setUsagesOnTextUnits.

Reflection-based field loading (msgID, msgIDPlural, poPluralForm) and
renameTextUnitWithSourceAndContent are kept unchanged.
Split renameTextUnitWithSourceAndContent into two focused methods:
- setTextUnitName(textUnit, baseName): sets the base name + msgctxt
- appendPluralFormToName(textUnit, cldrForm): appends the plural suffix

In processNextPluralGroup, use a local loop index to derive the CLDR
form instead of the stateful poPluralForm field. This removes the
poPluralForm field entirely and decouples plural suffix logic from the
base naming method.
Remove the msgID/msgIDPlural fields and the reflection helpers
(loadMsgIDFromParent, loadMsgIDPluralFromParent, makeAccessibleAndGetString)
that loaded them from the parent Okapi POFilter on every event.

For non-plural text units, derive the base name directly from the text
unit's own source (already unescaped by Okapi).

For plural groups, capture singularSource from the first text unit and
pluralSource from the second text unit within processNextPluralGroup.
Pass both to PoPluralsHolder via constructor parameters so that
adaptTextUnitToCLDRForm uses locally-captured values instead of mutable
outer-class state.

Note: when nPlurals=1 (e.g. Japanese) there is no second text unit, so
pluralSource falls back to singularSource. This is addressed in the
next commit.
When a locale has only one plural form (nPlurals=1, e.g. Japanese),
Okapi emits a single TEXT_UNIT with the singular source (msgid). But
CLDR maps this to the "other" form, which should carry the plural source
(msgid_plural). Since there is no second text unit to read it from,
extract msgid_plural directly from the START_GROUP skeleton and unescape
it via the parent filter's private unescape method (accessed through
reflection because it has no public alternative).

Also updates class-level and method-level Javadoc across the file to
document the overridden behavior, the PO/CLDR plural model mismatch,
and the CLDR category mapping rationale in adaptTextUnitToCLDRForm.
wadimw added 2 commits March 25, 2026 13:00
Generation of localized PO files is now handled by Okapi's builtin default POEncoder.
@wadimw wadimw merged commit 7793add into upstream-patched Apr 21, 2026
4 checks passed
@wadimw wadimw deleted the fix-po-escaping branch April 21, 2026 12:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

upstream-patched Experimental features ported from legacy branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants