fix(serializer): collapse multi-row column headers into one markdown header#602
fix(serializer): collapse multi-row column headers into one markdown header#602scottf007 wants to merge 2 commits into
Conversation
…header MarkdownTableSerializer hardcoded `headers=rows[0]` and `rows[1:]` as the body. When TableFormer (correctly) marks multiple leading rows as column_header=True — the case where a column title wraps onto two visual lines like "Cash per Security" + "($)" — only the first row rendered as the markdown header and the continuation leaked into the body as a spurious "data" row. Mirrors the logic that already exists in `_export_to_dataframe_with_options` (document.py:2219-2245): count leading grid rows containing any column_header cell, concatenate their cell text per column to build the markdown header, and use the remaining grid rows as the body. Spanning siblings render empty in both cases so a colspan=N header is not concatenated N times. Falls back to "first row is header" if no row has any column_header cell, preserving prior behaviour for tables that arrive without header marking. Empirical result on a 10-PDF dividend-statement corpus (finance_nexus tests/fixtures/dividend_statement/): * IOZ_Reinvestment_Plan_Advice_2025_04_17.pdf: the wrapped header collapses from two rows into one. Body unchanged. * All 9 other fixtures: byte-identical markdown output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
✅ DCO Check Passed Thanks @scottf007, all your commits are properly signed off. 🎉 |
Merge Protections🟢 Merge protection satisfied — ready to merge. Show 1 satisfied protection🟢 Enforce conventional commitMake sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
|
@scottf007 Thanks for this proposal, having headers correctly separated from table bodies would clearly be an improvement. However, this behaviour is exactly intended in the default serializer: "Spanning siblings render empty so a colspan=N header isn't concatenated N times." so we would like to keep it. Do you want to update the PR with this? |
I, scott <scott@fletchcorp.com>, hereby add my Signed-off-by to this commit: 93b77bc Signed-off-by: scott <scott@fletchcorp.com>
92c81c2 to
f9f83b2
Compare
|
Thanks @cau-git — appreciate you taking a look. Quick clarification so I update the right thing: the PR already preserves the "spanning siblings render empty so a So I'm not sure which part you'd like me to keep vs. change — happy to adjust, just want to make sure I understand. Is the concern that the multi-row collapse itself shouldn't be the default, or something about how the concatenation is done? Separately: I've fixed the DCO check (the previous remediation commit referenced the wrong SHA). Some broader feedback while I'm here, in case it's useful: the cases that bit me most were multi-column tables with sparse/empty columns where cells lose their x-position alignment, content occasionally getting dropped from the parse entirely, and flaky header detection. Alignment guards (inter-line/word/column distance heuristics) as a sanity check over the neural output might help — I know that's messy across hundreds of table shapes. Either way I think docling's a great project and want to help where I can. |
Summary
MarkdownTableSerializerhardcodedheaders=rows[0]androws[1:]. When TableFormer correctly marks multiple leading rows ascolumn_header=True(e.g. "Cash per Security" + "($)" on successive grid rows), only the first rendered as the markdown header and the continuation leaked into the body as a spurious data row.Mirrors
_export_to_dataframe_with_options(document.py:2219-2245): count leading grid rows containing anycolumn_headercell, concatenate per column for the header, use the rest as body. Spanning siblings render empty so acolspan=Nheader isn't concatenated N times.Falls back to "first row is header" when no
column_headercells are marked — preserves prior behaviour for that path.Empirical result
10-PDF corpus, byte-comparison of rendered markdown:
Related: docling-project/docling#2985.