fix(serializer): collapse multi-row column headers into one markdown header by scottf007 · Pull Request #602 · docling-project/docling-core

scottf007 · 2026-05-05T04:40:02Z

Summary

MarkdownTableSerializer hardcoded headers=rows[0] and rows[1:]. When TableFormer correctly marks multiple leading rows as column_header=True (e.g. "Cash per Security" + "($)" on successive grid rows), only the first rendered as the markdown header and the continuation leaked into the body as a spurious data row.

Mirrors _export_to_dataframe_with_options (document.py:2219-2245): count leading grid rows containing any column_header cell, concatenate per column for the header, use the rest as body. Spanning siblings render empty so a colspan=N header isn't concatenated N times.

Falls back to "first row is header" when no column_header cells are marked — preserves prior behaviour for that path.

Empirical result

10-PDF corpus, byte-comparison of rendered markdown:

IOZ_2025_04_17.pdf: 2-row header collapses into 1. Body unchanged.
9 other fixtures: byte-identical.

Related: docling-project/docling#2985.

…header MarkdownTableSerializer hardcoded `headers=rows[0]` and `rows[1:]` as the body. When TableFormer (correctly) marks multiple leading rows as column_header=True — the case where a column title wraps onto two visual lines like "Cash per Security" + "($)" — only the first row rendered as the markdown header and the continuation leaked into the body as a spurious "data" row. Mirrors the logic that already exists in `_export_to_dataframe_with_options` (document.py:2219-2245): count leading grid rows containing any column_header cell, concatenate their cell text per column to build the markdown header, and use the remaining grid rows as the body. Spanning siblings render empty in both cases so a colspan=N header is not concatenated N times. Falls back to "first row is header" if no row has any column_header cell, preserving prior behaviour for tables that arrive without header marking. Empirical result on a 10-PDF dividend-statement corpus (finance_nexus tests/fixtures/dividend_statement/): * IOZ_Reinvestment_Plan_Advice_2025_04_17.pdf: the wrapped header collapses from two rows into one. Body unchanged. * All 9 other fixtures: byte-identical markdown output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-05T04:40:12Z

✅ DCO Check Passed

Thanks @scottf007, all your commits are properly signed off. 🎉

mergify · 2026-05-05T04:40:40Z

Merge Protections

🟢 Merge protection satisfied — ready to merge.

Show 1 satisfied protection

🟢 Enforce conventional commit

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:$.+$)?(!)?:

cau-git · 2026-06-17T11:46:49Z

@scottf007 Thanks for this proposal, having headers correctly separated from table bodies would clearly be an improvement. However, this behaviour is exactly intended in the default serializer: "Spanning siblings render empty so a colspan=N header isn't concatenated N times." so we would like to keep it. Do you want to update the PR with this?

I, scott <scott@fletchcorp.com>, hereby add my Signed-off-by to this commit: 93b77bc Signed-off-by: scott <scott@fletchcorp.com>

scottf007 · 2026-06-19T22:25:42Z

Thanks @cau-git — appreciate you taking a look.

Quick clarification so I update the right thing: the PR already preserves the "spanning siblings render empty so a colspan=N header isn't concatenated N times" behaviour — that line in the description is describing what the patch keeps, not something it changes. The only behavioural change is that multiple leading rows marked column_header=True now collapse into a single markdown header row instead of the continuation rows leaking into the table body (e.g. a column title that wraps onto two visual lines like "Cash per Security" + "($)"). On the 10-PDF corpus only that one fixture changed; the other 9 were byte-identical.

So I'm not sure which part you'd like me to keep vs. change — happy to adjust, just want to make sure I understand. Is the concern that the multi-row collapse itself shouldn't be the default, or something about how the concatenation is done?

Separately: I've fixed the DCO check (the previous remediation commit referenced the wrong SHA).

Some broader feedback while I'm here, in case it's useful: the cases that bit me most were multi-column tables with sparse/empty columns where cells lose their x-position alignment, content occasionally getting dropped from the parse entirely, and flaky header detection. Alignment guards (inter-line/word/column distance heuristics) as a sanity check over the neural output might help — I know that's messy across hundreds of table shapes. Either way I think docling's a great project and want to help where I can.

DCO Remediation Commit for scott <scott@fletchcorp.com>

f9f83b2

I, scott <scott@fletchcorp.com>, hereby add my Signed-off-by to this commit: 93b77bc Signed-off-by: scott <scott@fletchcorp.com>

scottf007 force-pushed the fix/markdown-multi-row-headers branch from 92c81c2 to f9f83b2 Compare June 19, 2026 22:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(serializer): collapse multi-row column headers into one markdown header#602

fix(serializer): collapse multi-row column headers into one markdown header#602
scottf007 wants to merge 2 commits into
docling-project:mainfrom
scottf007:fix/markdown-multi-row-headers

scottf007 commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026 •

edited

Loading

Uh oh!

mergify Bot commented May 5, 2026 •

edited

Loading

🟢 Enforce conventional commit

Uh oh!

cau-git commented Jun 17, 2026

Uh oh!

scottf007 commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

scottf007 commented May 5, 2026

Summary

Empirical result

Uh oh!

github-actions Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🟢 Enforce conventional commit

Uh oh!

cau-git commented Jun 17, 2026

Uh oh!

scottf007 commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented May 5, 2026 •

edited

Loading

mergify Bot commented May 5, 2026 •

edited

Loading