Fix PO file escaping inconsistencies by wadimw · Pull Request #1059 · box/mojito

wadimw · 2026-03-17T10:18:17Z

This PR refactors the POFilter override to fix handling of escape sequences when parsing and serializing PO(T) files.

Issue description

We've run into the following problems related to PO escape sequence handling:

1. Double unescaping on source

When parsing POT files, strings would be doubly unescaped, e.g. a source string

Name cannot contain "/", "\", or characters outside the basic multilingual plane.

encoded in a POT file as

msgid "Name cannot contain \"/\", \"\\\", or characters outside the basic multilingual plane."

would display in Mojito UI as

Name cannot contain "/", "", or characters outside the basic multilingual plane.

2. Lack of unescaping on name

The same string would show up completely unescaped in the text unit name

Name cannot contain \"/\", \"\\\", or characters outside the basic multilingual plane.

3. Lack of escaping on target

Finally, after running mojito-cli pull, a valid translation

Nazwa nie może zawierać „/”, „\”, ani znaków spoza podstawowej klawiatury wielojęzycznej.

would not be unescaped in the generated localized PO file:

msgid "Name cannot contain \"/\", \"\\\", or characters outside the basic multilingual plane."
msgstr "Nazwa nie może zawierać „/”, „\”, ani znaków spoza podstawowej klawiatury wielojęzycznej."

resulting in a "dangling" backslash.

Root causes

Mojito's POFilter subclass layered some custom mechanisms on top of Okapi's base PO filter that introduced escaping issues:

Parsing (content — non-plurals): unescpae ran a second unescaping pass on the Text Unit content after Okapi had already unescaped it. Introduced in PO file escaping/unescaping #436, likely by accident while fixing insufficient escaping on plurals (described below).
Parsing (content — plurals): The double-unescaping technically ran here too, but the content was subsequently overwritten by adaptTextUnitToCLDRForm. That mechanism replaced the source content with replaceEscapedQuotes(msgID) / replaceEscapedQuotes(msgIDPlural) which meant plurals don't suffer from double-unescaping (the msgID and msgIDPlural are private fields of the parent parser that statefully hold pieces of raw file data during parsing), but in turn they have incomplete unescaping (miss \\, \n etc.) and are inconsistent with non-plurals. Reflective access to raw fields msgID and msgIDPlural was introduced as early as Pmaster #210 (i.e. the beginning of plurals support in PO files), while the replaceEscapedQuotes call on them was added in Unescape plural form in PO files #500.
Naming: Text unit names generated by Okapi PO filter were overwritten with the source string, but its value was again sourced from raw msgID - without any unescaping this time around. Also introduced in Pmaster #210.
Serialization (target): A custom POEncoder extends SimpleEncoder was introduced in PO file escaping/unescaping #436 (not sure why, since the only method on that class, toNative, is a literal copy of Okapi's POEncoder#toNative so I can't see any obvious reason for using that instead of Okapi's native POEncoder). Anyway, SimpleEncoder implements its own escaping, which again is incomplete compared to POEncoder, as it does not escape backslash characters.

Solution

TLDR:

Removed Mojito's custom unescaping (UnescapeUtils.replaceEscapedQuotes) which caused double-unescaping bugs, and the getEncoderManager override whose SimpleEncoder-based POEncoder missed backslash/tab/formfeed re-escaping. All escape handling now flows through Okapi's builtin methods.
Removed stateful msgID/msgIDPlural/poPluralForm fields and the reflection that loaded them on every event. Text unit names and sources are now derived from the already-unescaped content on Okapi's emitted text units. Plural group processing collects all events locally instead of tracking form index across method calls.
Explicitly annotated the nPlurals=1 edge case (e.g. Japanese) where Okapi emits only one text unit with singular source. Implemented a workaround where the plural source is parsed from the START_GROUP skeleton's msgid_plural line instead of being loaded from parent class' stateful, private field. Didn't manage to fully get rid of reflection - parsing from skeleton utilizes parent's private unescape method. The new approach is arguably less fragile though, since it does not depend on parent class state at least.
Added multiple comments explaining overridden class functionalities.

Detailed description

40a951e Remove custom EncoderManager/POEncoder override
Fixes issue 3 (target not escaped). Drops getEncoderManager() override so Okapi's builtin PO encoder handles all output serialization, correctly re-escaping \, ", \t, \n, \f, etc.
2acb0fa Remove custom unescaping
Fixes issue 1 (source double-unescaped for non-plurals). Removes the redundant unescpae pass and any references to UnescapeUtils. This temporarily removes any unescaping in plurals (before it was incomplete, now there's none), but it's addressed in a different way in step 5.
5d025d9 Simplify event loop structure
Flattens and simplifies the logic for intercepting events and buffering plural groups. Gets rid of unnecessary recursion. Refactor of the buffering logic is necessary to simplify name handling and plural index counting (step 4), and to allow capturing plural group sources from the text units accordingly (step 5).
813dea7 Simplify text unit name handling
Splits renameTextUnitWithSourceAndContent into two separate methods: setTextUnitName changes the unit name to the source value, while appendPluralFormToName adds the plural form suffix. With that, it's much easier to follow the logic: setTextUnitName executes for every text unit, while appendPluralFormToName only executes in a sub-loop while processing a plural group. Additionally, this replaces a poPluralForm counter field with an index variable local to the plural group processing sub-loop.
33ada2a Remove reflection; capture sources locally
Fixes issue 2 (name not unescaped) and follows up on step 2 (lack of unescaping in plurals). Removes msgID/msgIDPlural fields and all related reflection helpers. Names are now always derived from textUnit.getSource().toString(), which is already unescaped by Okapi and needs no further processing. For plural groups, singular/plural sources are captured locally within processNextPluralGroup during the initial plural group buffering phase. Values are passed to PoPluralsHolder via constructor, so that it can use them when running its logic: adjusting the names (every plural group unit has name derived from the singular source), sources (every plural unit which is not one has its source set to the plural source), and injecting/dropping units according to the CLDR plural rules.
c25185d Handle parsing of nPlurals=1 edge case via skeleton extraction
Consequence of step 5. With msgIDPlural gone, parent POFilter emits plural events according to the rules set in file header. When processing a file with nPlurals=1 (e.g. importing translations from a translated Japanese PO file), the whole plural group consists of only one text unit - so there is nowhere to read the plural source from (actually, this Japanese case would technically be correct with either source - but Okapi implementation produces a unit with the singular source and we need to work around it). As described in step 5, we need to obtain a unit which has singular source in the name, and plural source in the source. Since there is no 2nd unit, the plural source is re-parsed in a fallback way from the START_GROUP event skeleton (i.e. an excerpt from the raw file content that contains the whole plural group). This sucks, since we just went through all the effort of NOT re-implementing the base class features in our subclass, but I can't see a better way to do it. To at least make sure that we don't mess up the unescaping again, reflective access to the parent's private method unescape is used. In my opinion, this approach is slightly less fragile than the previous one, because the msgIDPlural field is a stateful field of the parent parser - when using it, on top of the postprocessing (unescaping) we also need worry about timing the parent class correctly - plus this fallback is only used in the edge case of parsing a target file with nPlurals=1, so the impact of any potential bugs is limited.
aeca119 Remove unused POEncoder
Followup to step 1. POEncoder.java is dead code after getEncoderManager() override removal
6d03130 Filter out events with no CLDR category mapping
Fixes an unexpected logical bug introduced in step 4 by skipping appendPluralFormToName for text units whose PO form index has no CLDR mapping (e.g. form 1 for Japanese). This is a weird one.
When generating a localized Japanese PO file, the parent POFilter still emits two text units (because on the input it has a source POT file with nPlurals=2). These two units get buffered in our subclass and processed by PoPluralsHolder. Japanese PO plural rules (nPlurals=1) state that there should be only one PO plural form 0 which maps to the CLDR other category - so poFormToCldrForm(1) returns null.
In the old code, unit renaming code would actually always append the mapping suffix, so these units would end up with literal _null in the name - but PluralsHolder#getCldrPluralFormOfEvent would backconvert it to null and the unit would be silently dropped. This also dates back to Pmaster #210.
Changes from step 4 skip _null category suffix in advance (since it doesn't really make sense) - but this means that these units wouldn't be dropped anymore. Rather than relying on the hidden, hacky _null filtering deep within PluralsHolder, we now explicitly filter out events with no CLDR category mapping in advance. Note: we should probably get rid of this hidden logic in PluralsHolder altogether, but it might impact handling of plurals in other file types, so I'll leave it for now.

Also updated tests to verify the fixed behaviours:

• 0b14c77 — Updates existing tests to expect unescaped names
• 232bf01 — Ports escape/unescape tests from Okapi's POFilterTest to ensure override parity
• 426f4b9 — Adds the exact reproduction case from the issue (double-unescape on parse + missing re-escape on write)
• 84c022b — Removes tests for deleted reflection methods
• 88cee79 — Updates \' test: not a standard PO escape, so Okapi correctly preserves the backslash as literal

wadimw · 2026-03-24T22:42:06Z

@ehoogerbeets I've split up the large POFilter refactor commit into smaller logical chunks to make it easier to understand.

dropped the POEncoder
dropped the custom escaping/unescaping
simplified the event loop buffering
simplified text unit rename logic
removed access to stateful private parent class fields via reflection
added re-parsing of plural source from group start event skeleton

Text unit names generated from msgid should be fully unescaped, matching the source.

Cover escape/unescape test cases handled by Okapi PO implementation to ensure that Mojito's overridden class behaves consistently.

Source strings from PO files would be doubly unescaped in the extraction, while target strings would not be escaped at all when generating localized assets.

Drop the getEncoderManager() override that mapped PO output to Mojito's custom POEncoder. The SimpleEncoder-based POEncoder missed re-escaping of backslashes, tabs and formfeeds. All encoding now flows through Okapi's builtin PO encoder.

Drop UnescapeUtils.replaceEscapedQuotes which was applied on top of Okapi's already-unescaped content, causing double-unescaping bugs. Okapi's builtin PO parser already handles quote unescaping correctly, so the extra pass was both redundant and harmful.

Flatten the indirect dispatch chain (next -> readNextEvents -> getNextWithProcess -> processTextUnit) into a single next() that calls super.next() directly and handles each event type inline. Replace the recursive readPlurals() with processNextPluralGroup() which drains text units in a simple while-until-END_GROUP loop instead of tracking state across recursive calls. Remove intermediate methods: readNextEvents, getNextWithProcess, processTextUnit, isPluralGroupEnding, readPlurals, adaptPlurals, setUsagesOnTextUnits. Reflection-based field loading (msgID, msgIDPlural, poPluralForm) and renameTextUnitWithSourceAndContent are kept unchanged.

Split renameTextUnitWithSourceAndContent into two focused methods: - setTextUnitName(textUnit, baseName): sets the base name + msgctxt - appendPluralFormToName(textUnit, cldrForm): appends the plural suffix In processNextPluralGroup, use a local loop index to derive the CLDR form instead of the stateful poPluralForm field. This removes the poPluralForm field entirely and decouples plural suffix logic from the base naming method.

Remove the msgID/msgIDPlural fields and the reflection helpers (loadMsgIDFromParent, loadMsgIDPluralFromParent, makeAccessibleAndGetString) that loaded them from the parent Okapi POFilter on every event. For non-plural text units, derive the base name directly from the text unit's own source (already unescaped by Okapi). For plural groups, capture singularSource from the first text unit and pluralSource from the second text unit within processNextPluralGroup. Pass both to PoPluralsHolder via constructor parameters so that adaptTextUnitToCLDRForm uses locally-captured values instead of mutable outer-class state. Note: when nPlurals=1 (e.g. Japanese) there is no second text unit, so pluralSource falls back to singularSource. This is addressed in the next commit.

When a locale has only one plural form (nPlurals=1, e.g. Japanese), Okapi emits a single TEXT_UNIT with the singular source (msgid). But CLDR maps this to the "other" form, which should carry the plural source (msgid_plural). Since there is no second text unit to read it from, extract msgid_plural directly from the START_GROUP skeleton and unescape it via the parent filter's private unescape method (accessed through reflection because it has no public alternative). Also updates class-level and method-level Javadoc across the file to document the overridden behavior, the PO/CLDR plural model mismatch, and the CLDR category mapping rationale in adaptTextUnitToCLDRForm.

Generation of localized PO files is now handled by Okapi's builtin default POEncoder.

wadimw force-pushed the fix-po-escaping branch 7 times, most recently from 67c289f to 66dba64 Compare March 17, 2026 17:25

wadimw requested a review from ehoogerbeets March 18, 2026 17:37

wadimw changed the title ~~Fix po escaping~~ Fix PO file escaping inconsistencies Mar 18, 2026

wadimw added the upstream-patched Experimental features ported from legacy branch label Mar 18, 2026

wadimw marked this pull request as ready for review March 18, 2026 18:02

wadimw mentioned this pull request Mar 18, 2026

fix(po): escape backslash characters in SimpleEncoder and fix unescape roundtrip #1053

Closed

wadimw requested review from jaknas and soygitana March 19, 2026 12:15

wadimw force-pushed the fix-po-escaping branch 2 times, most recently from 8bd5bf2 to 1acc15c Compare March 24, 2026 22:34

wadimw force-pushed the fix-po-escaping branch from 1acc15c to 8b9bd5b Compare March 24, 2026 22:45

Base automatically changed from refactor-filetype-tests to upstream-patched March 25, 2026 11:55

wadimw added 11 commits March 25, 2026 13:00

Update PO escaping tests to expect unescaped names

0b14c77

Text unit names generated from msgid should be fully unescaped, matching the source.

Add PO escaping tests ported from Okapi

232bf01

Cover escape/unescape test cases handled by Okapi PO implementation to ensure that Mojito's overridden class behaves consistently.

Add tests for unescaping roundtrip asymmetry in PO

426f4b9

Source strings from PO files would be doubly unescaped in the extraction, while target strings would not be escaped at all when generating localized assets.

Remove reflection tests

84c022b

Update test for single quote escape to match Okapi behavior

88cee79

Remove custom EncoderManager/POEncoder override

40a951e

Drop the getEncoderManager() override that mapped PO output to Mojito's custom POEncoder. The SimpleEncoder-based POEncoder missed re-escaping of backslashes, tabs and formfeeds. All encoding now flows through Okapi's builtin PO encoder.

wadimw added 2 commits March 25, 2026 13:00

Remove unused POEncoder

aeca119

Generation of localized PO files is now handled by Okapi's builtin default POEncoder.

Filter out events with no CLDR category mapping

6d03130

wadimw force-pushed the fix-po-escaping branch from 8b9bd5b to 6d03130 Compare March 25, 2026 12:01

ehoogerbeets approved these changes Apr 17, 2026

View reviewed changes

wadimw merged commit 7793add into upstream-patched Apr 21, 2026
4 checks passed

wadimw deleted the fix-po-escaping branch April 21, 2026 12:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix PO file escaping inconsistencies#1059

Fix PO file escaping inconsistencies#1059
wadimw merged 13 commits into
upstream-patchedfrom
fix-po-escaping

wadimw commented Mar 17, 2026 •

edited

Loading

Uh oh!

wadimw commented Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wadimw commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue description

1. Double unescaping on source

2. Lack of unescaping on name

3. Lack of escaping on target

Root causes

Solution

Detailed description

Uh oh!

wadimw commented Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wadimw commented Mar 17, 2026 •

edited

Loading