Skip to content

feat: serialize subscript and superscript in Markdown export#661

Open
LucasArray wants to merge 1 commit into
docling-project:mainfrom
LucasArray:feat/markdown-sub-superscript
Open

feat: serialize subscript and superscript in Markdown export#661
LucasArray wants to merge 1 commit into
docling-project:mainfrom
LucasArray:feat/markdown-sub-superscript

Conversation

@LucasArray

Copy link
Copy Markdown

Summary

The Markdown serializer drops subscript and superscript formatting. A text span marked Formatting(script=Script.SUB) or Script.SUPER is written out as plain text, so a subscript like the 2 in H2O or an exponent like E=mc2 becomes an ordinary digit with no way to recover it. The HTML serializer already preserves these as <sub>/<sup>; the Markdown serializer never implemented the hooks.

This adds them. #319 introduced the sub/superscript model and the HTML support but intentionally left Markdown out, since the Pandoc ~x~/^x^ syntax is uncommon and does not render in editors like VS Code. Emitting inline <sub>/<sup> avoids that: it renders on GitHub and in any CommonMark viewer, and it matches what the HTML serializer already produces, so Markdown and HTML stay in sync.

Addresses docling-project/docling#520.

Changes

  • docling_core/transforms/serializer/markdown.py: implement serialize_subscript and serialize_superscript, returning <sub>{text}</sub> and <sup>{text}</sup>. They run in post_process after the content is HTML-escaped, like the existing bold/italic/strikethrough hooks, so only the wrapping tags are literal.
  • docling_core/transforms/serializer/plain_text.py: PlainTextDocSerializer subclasses the Markdown serializer, so it would otherwise inherit the new tags. Override both hooks to return the text unchanged, consistent with how it already strips bold/italic/strikethrough.
  • test/test_serialization.py: add test_md_subscript_formatting and test_md_superscript_formatting.
  • Regenerated the Markdown reference data under test/data/doc/. The shared test fixture already contains sub/superscript spans, so their Markdown ground truth now renders <sub>/<sup> (one changed line per file). Note: this updates reference test data, which requires a double review per CONTRIBUTING.

Example

doc.add_text(label=DocItemLabel.TEXT, text="H2O", formatting=Formatting(script=Script.SUB))
doc.export_to_markdown()

Before: H2O
After: <sub>H2O</sub>

I also checked this directly by exporting the same document with and without the change and rendering both outputs. The subscript and superscript text now displays as actual sub/superscript instead of flat text.

Testing

All run locally and passing:

  • uv run pytest: 544 passed, 6 skipped
  • uv run ruff check and uv run ruff format --check: clean
  • uv run mypy docling_core test: clean
  • New unit tests assert <sub>/<sup> in the Markdown output; the plain-text serializer test confirms the tags are stripped there

Signed-off-by: Lucas Araujo <29403436+LucasArray@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown
Contributor

DCO Check Passed

Thanks @LucasArray, all your commits are properly signed off. 🎉

@mergify

mergify Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

🔴 1 of 2 protections blocking · waiting on 👀 reviews

Protection Waiting on
🔴 Require two reviewer for test updates 👀 reviews
🟢 Enforce conventional commit

🔴 Require two reviewer for test updates

Waiting for

  • #approved-reviews-by >= 2
This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

Show 1 satisfied protection

🟢 Enforce conventional commit

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant