feat: serialize subscript and superscript in Markdown export#661
Open
LucasArray wants to merge 1 commit into
Open
feat: serialize subscript and superscript in Markdown export#661LucasArray wants to merge 1 commit into
LucasArray wants to merge 1 commit into
Conversation
Signed-off-by: Lucas Araujo <29403436+LucasArray@users.noreply.github.com>
Contributor
|
✅ DCO Check Passed Thanks @LucasArray, all your commits are properly signed off. 🎉 |
Contributor
Merge Protections🔴 1 of 2 protections blocking · waiting on 👀 reviews
🔴 Require two reviewer for test updatesWaiting for
This rule is failing.When test data is updated, we require two reviewers
Show 1 satisfied protection🟢 Enforce conventional commitMake sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The Markdown serializer drops subscript and superscript formatting. A text span marked
Formatting(script=Script.SUB)orScript.SUPERis written out as plain text, so a subscript like the 2 inH2Oor an exponent likeE=mc2becomes an ordinary digit with no way to recover it. The HTML serializer already preserves these as<sub>/<sup>; the Markdown serializer never implemented the hooks.This adds them. #319 introduced the sub/superscript model and the HTML support but intentionally left Markdown out, since the Pandoc
~x~/^x^syntax is uncommon and does not render in editors like VS Code. Emitting inline<sub>/<sup>avoids that: it renders on GitHub and in any CommonMark viewer, and it matches what the HTML serializer already produces, so Markdown and HTML stay in sync.Addresses docling-project/docling#520.
Changes
docling_core/transforms/serializer/markdown.py: implementserialize_subscriptandserialize_superscript, returning<sub>{text}</sub>and<sup>{text}</sup>. They run inpost_processafter the content is HTML-escaped, like the existing bold/italic/strikethrough hooks, so only the wrapping tags are literal.docling_core/transforms/serializer/plain_text.py:PlainTextDocSerializersubclasses the Markdown serializer, so it would otherwise inherit the new tags. Override both hooks to return the text unchanged, consistent with how it already strips bold/italic/strikethrough.test/test_serialization.py: addtest_md_subscript_formattingandtest_md_superscript_formatting.test/data/doc/. The shared test fixture already contains sub/superscript spans, so their Markdown ground truth now renders<sub>/<sup>(one changed line per file). Note: this updates reference test data, which requires a double review per CONTRIBUTING.Example
Before:
H2OAfter:
<sub>H2O</sub>I also checked this directly by exporting the same document with and without the change and rendering both outputs. The subscript and superscript text now displays as actual sub/superscript instead of flat text.
Testing
All run locally and passing:
uv run pytest: 544 passed, 6 skippeduv run ruff checkanduv run ruff format --check: cleanuv run mypy docling_core test: clean<sub>/<sup>in the Markdown output; the plain-text serializer test confirms the tags are stripped there