Skip to content

feat(markdown): add footnote serialization support#569

Open
ShrillHarrier wants to merge 32 commits into
docling-project:mainfrom
ShrillHarrier:dev/md-footnote-serializer
Open

feat(markdown): add footnote serialization support#569
ShrillHarrier wants to merge 32 commits into
docling-project:mainfrom
ShrillHarrier:dev/md-footnote-serializer

Conversation

@ShrillHarrier

@ShrillHarrier ShrillHarrier commented Mar 26, 2026

Copy link
Copy Markdown

This PR is related to the Improved Footnote Serialization in MarkdownDocSerializer.

It is a Feature Request submitted by simonschoe in docling-project.

The features added include serializing a footnote in the form [^{Identifier}]: {Description}. This is done for Table and Picture items, as footnotes are linked to those.

In general, footnotes in .md files should look like:

[^5]: https://github.com/tesseract-ocr/tesseract
[^6]: https://github.com/VikParuchuri/surya
[^7]: https://github.com/lukas-blecher/LaTeX-OCR

Resolves docling-project/docling#3128

Tests Added:

  • test_table_with_footnotes_markdown()
  • test_picture_with_footnotes_markdown()
  • test_table_export_to_markdown_with_footnotes()

@github-actions

github-actions Bot commented Mar 26, 2026

Copy link
Copy Markdown
Contributor

DCO Check Passed

Thanks @ShrillHarrier, all your commits are properly signed off. 🎉

@mergify

mergify Bot commented Mar 26, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

🔴 1 of 2 protections blocking · waiting on 👀 reviews

Protection Waiting on
🔴 Require two reviewer for test updates 👀 reviews
🟢 Enforce conventional commit

🔴 Require two reviewer for test updates

Waiting for

  • #approved-reviews-by >= 2
This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

Show 1 satisfied protection

🟢 Enforce conventional commit

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@dosubot

dosubot Bot commented Mar 26, 2026

Copy link
Copy Markdown

Related Documentation

1 document(s) may need updating based on files changed in this PR:

Docling

What are the differences between vlm_pipeline_model_local and picture_description_local in Docling, and how do image descriptions, OCR, and table extraction work together? Also, how do the include_annotations and mark_annotations properties affect exported output?
View Suggested Changes
@@ -67,6 +67,9 @@
 - `compact_tables` (bool): Whether to use compact table format without column padding (default: `False`, Markdown only)
 - `traverse_pictures` (bool): Whether to traverse into picture items and serialize their text children (default: `False`)
 
+**Footnote Serialization in Markdown:**
+When tables or pictures have associated footnotes in the document, these footnotes are automatically serialized in the markdown output using standard markdown footnote syntax: `[^{Identifier}]: {Description}`. The identifier is extracted from the first part of the footnote text, and the remaining text becomes the footnote description. This formatting ensures that footnotes attached to Table and Picture items appear correctly in the exported markdown.
+
 **Handling OCR Text in Scanned/Image-Based PDFs:**
 When processing scanned or image-based PDFs with `force_full_page_ocr=True`, the layout model classifies full-page scans as `PictureItem` nodes. OCR text items are added as children of that picture node in the document tree.
 

[Accept] [Decline]

Note: You must be authenticated to accept/decline updates.

How did I do? Any feedback?  Join Discord

@ShrillHarrier ShrillHarrier changed the title Improved Footnote Serialization in MarkdownDocSerializer #3128 feat(markdown-serializer): add footnote serialization support for markdown Mar 26, 2026
@ShrillHarrier ShrillHarrier changed the title feat(markdown-serializer): add footnote serialization support for markdown feat(markdown): add footnote serialization support Mar 26, 2026
@ShrillHarrier ShrillHarrier force-pushed the dev/md-footnote-serializer branch from b849797 to 27e8b28 Compare March 30, 2026 19:41
@ShrillHarrier

Copy link
Copy Markdown
Author

Signing off with personal email.

@codecov

codecov Bot commented Apr 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.90909% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling_core/transforms/serializer/markdown.py 90.90% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

@ceberam ceberam left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ShrillHarrier for suggesting this PR. Please, see my comments.
In general:

  • Adding new tests with a programmatic example is fine but it is much more illustrative to show the impact of the new feature in a serialization with a ground truth data file. This allows us to check how the output markdown file gets rendered in applications like Github or VSC. Please, check how this is done in other test modules.
  • To keep the repository consistent, I would suggest that you add the tests in the module test/test_serialization.py, together with the other tests of the markdown serialization, instead of creating a separate module.
  • There are some DoclingDocument files (.json) that do not serialize as expected. Please, check some of them and ensure that the markdown serialization generates the right footnotes hook and text. For instance, test/data/doc/2408.09869v3_enriched.json has a footnote. If you regenerate the ground truth files (by running the tests with env variable DOCLING_GEN_TEST_DATA=1 ), I would expect that the markdown serialization gets updated with the new footnote serialization.

Comment thread docling_core/transforms/serializer/markdown.py Outdated
results = []
for footnote in item.footnotes:
if isinstance(ftn := footnote.resolve(self.doc), TextItem):
parts = ftn.text.split(" ", 1)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The footnote parsing logic assumes a specific format (the identifier and the footnote text). This format is not clearly represented or documented. It would be good that the format is explicit, validated, and clearly documented. We should keep in mind that for the markdown footnote to work, identifiers can be numbers or words, but they can’t contain spaces or tabs..
In addition, I don't think that we keep the footnote references with the correct formatting (a caret and an identifier inside brackets, e.g., [^1]).

If I try to serialize a DoclingDocument from the test dataset, you'll see that the footnote 1 see huggingface.co/ds4sd/docling-models/ (doc item with reference #/texts/29) is not properly serialized as in the markdown specification.

from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.base import ImageRefMode

with open("test/data/doc/2408.09869v3_enriched.json") as handler:
  content = handler.read()
doc = DoclingDocument.model_validate_json(content)
doc.export_to_markdown(image_mode=ImageRefMode.PLACEHOLDER)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, ah I see. Okay, I will refine the serialization and retest on the DoclingDocument files.

Comment thread test/test_markdown_footnotes.py Outdated
@ceberam

ceberam commented Apr 30, 2026

Copy link
Copy Markdown
Member

@ShrillHarrier please let me know when the PR is ready for another review.
I would also encourage you to rebase the branch to the latest main and force-push your commits. We have recently done some changes in the CI/CD checks and it would be good to have them in your PR.

ShrillHarrier and others added 13 commits April 30, 2026 11:00
I, Matthew Panizza <shrillharrier1@gmail.com>, hereby add my Signed-off-by to this commit: 73a9a40
I, Matthew Panizza <shrillharrier1@gmail.com>, hereby add my Signed-off-by to this commit: 929c11f

Signed-off-by: Matthew Panizza <shrillharrier1@gmail.com>
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
- Add special case handling for DocItemLabel.FOOTNOTE in _serialize_text_item
- Standalone footnotes now serialize as [^id]: text instead of plain text
- Fixes issue where footnotes not referenced by tables/pictures were not formatted correctly

Note: Ground truth files will be updated in a follow-up commit
…thub.com>

I, Matthew Panizza <username@users.noreply.github.com>, hereby add my Signed-off-by to this commit: dfca98f
I, Matthew Panizza <username@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 030c0b4
I, Matthew Panizza <username@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 97de619
I, Matthew Panizza <username@users.noreply.github.com>, hereby add my Signed-off-by to this commit: a534839
I, Matthew Panizza <username@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 690c931
I, Matthew Panizza <username@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 4cfa21d

Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
@ShrillHarrier ShrillHarrier force-pushed the dev/md-footnote-serializer branch from 3f71a9a to c73f3a5 Compare April 30, 2026 15:07
Matthew Panizza added 2 commits April 30, 2026 11:10
…thub.com>

I, Matthew Panizza <username@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 4541deb
I, Matthew Panizza <username@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 41e67c9
I, Matthew Panizza <username@users.noreply.github.com>, hereby add my Signed-off-by to this commit: e6cc4e3
I, Matthew Panizza <username@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 1ef1e75
I, Matthew Panizza <username@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 91af6cd
I, Matthew Panizza <username@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 5fb30a4

Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
@ShrillHarrier

ShrillHarrier commented Apr 30, 2026

Copy link
Copy Markdown
Author

Hello @ceberam,

I have cleaned up the functionality and code in this PR. Since review I have completed the following:

  • Migrated tests to test/test_serialization.py and make two of them compare to ground truth files
  • For the .json -> .md serialization test (test_md_footnotes_json) added a small gt file for page 2 of the original json specifications
  • Added validation to ensure identifiers are not blank, contain a space, or contain a \t character
  • Included documentation comments to explicitly denote rules for parsing and identify footnotes
  • Removed unused imports
  • Rebased with the upstream repo and pushed

Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
@ceberam

ceberam commented May 19, 2026

Copy link
Copy Markdown
Member

@ShrillHarrier getting back to this after an absence...could you please resolve the conflicts with main?

Matthew Panizza added 2 commits May 19, 2026 11:47
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
@ShrillHarrier

ShrillHarrier commented May 22, 2026

Copy link
Copy Markdown
Author

Hello @ceberam, I have merged with main. I have also increased code coverage to ensure codecov/patch passes. Please let me know if additional changes are required. Thanks.

@ShrillHarrier ShrillHarrier requested a review from ceberam June 1, 2026 17:43
@ShrillHarrier

Copy link
Copy Markdown
Author

Hello @ceberam I have re-merged with main to keep the branch updated with the docling-project.

@ceberam ceberam left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @ShrillHarrier for all the efforts on this PR.

I left some minor technical comments.

However, my biggest concern is the proper rendering of those footnotes. The footnote itself looks well serialized according to the markdown specifications. However, this formatting (e.g., [^2]: Example) seems to require an anchor in the text. If there is no anchor, the footnote simply does not get displayed in some applications. Unfortunately, the DoclingDocument examples we have from PDFs, which were parsed from PDFs, do not keep a proper link between the anchored text (the reference) and the footnote itself. The changes from this PR would actually mask it. For instance, in test/data/doc/2206.01062.yaml.paged.md, the markdown serialization from this PR generates the text:

To ensure that future benchmarks in the document-layout analysis community can be easily compared, we have split up DocLayNet into pre-defined train-, test- and validation-sets. In this way, we can avoid spurious variations in the evaluation scores due to random splitting in train-, test- and validation-sets. We also ensured that less frequent labels are represented in train and test sets in equal proportions.

[^2]: e.g. AAPL from https://www.annualreports.com/

Table 1 shows the overall frequency and distribution of the labels among the different sets. Importantly, we ensure that subsets are only split on full-document boundaries. This avoids that pages of the same document are spread over train, test and validation set, which can give an undesired evaluation advantage to models and lead to overestimation of their prediction accuracy. We will show the impact of this decision in Section 5.

Github renders this section like this:

Image

While before the changes, we could see this (not ideal but at least visible):

Image

How to move forward

My idea would be to apply the new footnote formatting only if we see an anchor, so that we follow the Markdown syntax strictly and we are sure that Markdown editors will understand it.

The question is how the anchored text should be linked with the footnote. I would then apply the same pattern as the captions with the PictureItem objects.

  • Check if a footnote is contained in a FloatingItem (through its footnotes field)
  • Use that floating item to store the anchor and render it through the Markdown specs: a caret and an identifier inside brackets (e.g., [^2])
  • Render the footnote as you have already done: with a caret and the number inside brackets with a colon and text ([^2]: My footnote.).

It could help to create a sample .json document that includes a reference (anchor) in a paragraph and the footnote with the right structure from above.

Later on, we could enhance the declarative backends in docling repository to follow this formatting (e.g., JATS, USPTO, Markdown,...) in a separate PR. Actually it would be helpful to have a round trip example with footnotes, parsing from and exporting to markdown (.md -> .json -> .md) to fully test this functionality.

Hope this helps and feel free to add any thoughts here.

Comment thread docling_core/transforms/serializer/markdown.py Outdated
Comment thread docling_core/transforms/serializer/markdown.py Outdated
Comment thread docling_core/transforms/serializer/markdown.py Outdated
Comment thread docling_core/transforms/serializer/markdown.py Outdated
Comment thread docling_core/transforms/serializer/markdown.py Outdated
Comment thread test/test_serialization.py Outdated
Matthew Panizza added 13 commits June 15, 2026 18:59
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
…thub.com>

I, Matthew Panizza <username@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 7e8f664

Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improved Footnote Serialization in MarkdownDocSerializer

2 participants