Skip to content

fix(serializer): collapse multi-row column headers into one markdown header#602

Open
scottf007 wants to merge 2 commits into
docling-project:mainfrom
scottf007:fix/markdown-multi-row-headers
Open

fix(serializer): collapse multi-row column headers into one markdown header#602
scottf007 wants to merge 2 commits into
docling-project:mainfrom
scottf007:fix/markdown-multi-row-headers

Conversation

@scottf007

Copy link
Copy Markdown

Summary

MarkdownTableSerializer hardcoded headers=rows[0] and rows[1:]. When TableFormer correctly marks multiple leading rows as column_header=True (e.g. "Cash per Security" + "($)" on successive grid rows), only the first rendered as the markdown header and the continuation leaked into the body as a spurious data row.

Mirrors _export_to_dataframe_with_options (document.py:2219-2245): count leading grid rows containing any column_header cell, concatenate per column for the header, use the rest as body. Spanning siblings render empty so a colspan=N header isn't concatenated N times.

Falls back to "first row is header" when no column_header cells are marked — preserves prior behaviour for that path.

Empirical result

10-PDF corpus, byte-comparison of rendered markdown:

  • IOZ_2025_04_17.pdf: 2-row header collapses into 1. Body unchanged.
  • 9 other fixtures: byte-identical.

Related: docling-project/docling#2985.

…header

MarkdownTableSerializer hardcoded `headers=rows[0]` and `rows[1:]` as the
body. When TableFormer (correctly) marks multiple leading rows as
column_header=True — the case where a column title wraps onto two visual
lines like "Cash per Security" + "($)" — only the first row rendered as
the markdown header and the continuation leaked into the body as a
spurious "data" row.

Mirrors the logic that already exists in
`_export_to_dataframe_with_options` (document.py:2219-2245): count leading
grid rows containing any column_header cell, concatenate their cell text
per column to build the markdown header, and use the remaining grid rows
as the body. Spanning siblings render empty in both cases so a colspan=N
header is not concatenated N times.

Falls back to "first row is header" if no row has any column_header cell,
preserving prior behaviour for tables that arrive without header marking.

Empirical result on a 10-PDF dividend-statement corpus
(finance_nexus tests/fixtures/dividend_statement/):

* IOZ_Reinvestment_Plan_Advice_2025_04_17.pdf: the wrapped header
  collapses from two rows into one. Body unchanged.
* All 9 other fixtures: byte-identical markdown output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

DCO Check Passed

Thanks @scottf007, all your commits are properly signed off. 🎉

@mergify

mergify Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

🟢 Merge protection satisfied — ready to merge.

Show 1 satisfied protection

🟢 Enforce conventional commit

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@cau-git

cau-git commented Jun 17, 2026

Copy link
Copy Markdown
Member

@scottf007 Thanks for this proposal, having headers correctly separated from table bodies would clearly be an improvement. However, this behaviour is exactly intended in the default serializer: "Spanning siblings render empty so a colspan=N header isn't concatenated N times." so we would like to keep it. Do you want to update the PR with this?

I, scott <scott@fletchcorp.com>, hereby add my Signed-off-by to this commit: 93b77bc

Signed-off-by: scott <scott@fletchcorp.com>
@scottf007 scottf007 force-pushed the fix/markdown-multi-row-headers branch from 92c81c2 to f9f83b2 Compare June 19, 2026 22:11
@scottf007

Copy link
Copy Markdown
Author

Thanks @cau-git — appreciate you taking a look.

Quick clarification so I update the right thing: the PR already preserves the "spanning siblings render empty so a colspan=N header isn't concatenated N times" behaviour — that line in the description is describing what the patch keeps, not something it changes. The only behavioural change is that multiple leading rows marked column_header=True now collapse into a single markdown header row instead of the continuation rows leaking into the table body (e.g. a column title that wraps onto two visual lines like "Cash per Security" + "($)"). On the 10-PDF corpus only that one fixture changed; the other 9 were byte-identical.

So I'm not sure which part you'd like me to keep vs. change — happy to adjust, just want to make sure I understand. Is the concern that the multi-row collapse itself shouldn't be the default, or something about how the concatenation is done?

Separately: I've fixed the DCO check (the previous remediation commit referenced the wrong SHA).

Some broader feedback while I'm here, in case it's useful: the cases that bit me most were multi-column tables with sparse/empty columns where cells lose their x-position alignment, content occasionally getting dropped from the parse entirely, and flaky header detection. Alignment guards (inter-line/word/column distance heuristics) as a sanity check over the neural output might help — I know that's messy across hundreds of table shapes. Either way I think docling's a great project and want to help where I can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants