Skip to content

perf: added simpler table cell representation for high volume parsing#657

Open
Michele-Zhu wants to merge 1 commit into
docling-project:mainfrom
Michele-Zhu:dev/data_type
Open

perf: added simpler table cell representation for high volume parsing#657
Michele-Zhu wants to merge 1 commit into
docling-project:mainfrom
Michele-Zhu:dev/data_type

Conversation

@Michele-Zhu

@Michele-Zhu Michele-Zhu commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

This PR follows issue number 3328 and the proposed solution at PR 3692 at the docling project for high-volume XLSX parsing.

TODO:

  • Check schema/backward compatibility issues between dataclass and pydantic models

# TODO

Signed-off-by: Michele-Zhu <71699139+Michele-Zhu@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown
Contributor

DCO Check Passed

Thanks @Michele-Zhu, all your commits are properly signed off. 🎉

@mergify

mergify Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

🟢 Merge protection satisfied — ready to merge.

Show 1 satisfied protection

🟢 Enforce conventional commit

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@gyx09212214-prog gyx09212214-prog left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a serialization round-trip test before merging. AnyTableCell is part of persisted TableData, and the new FastTableCell is a dataclass while the existing alternatives are Pydantic models. Could we add a test that builds a TableData with a FastTableCell, dumps a DoclingDocument to JSON, validates it back, and verifies the cell type/fields survive?

That would catch both union selection and schema/backward compatibility issues. Without it, a faster parser path could produce documents that cannot be loaded by downstream consumers.

@Michele-Zhu

Michele-Zhu commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

I think this needs a serialization round-trip test before merging. AnyTableCell is part of persisted TableData, and the new FastTableCell is a dataclass while the existing alternatives are Pydantic models. Could we add a test that builds a TableData with a FastTableCell, dumps a DoclingDocument to JSON, validates it back, and verifies the cell type/fields survive?

That would catch both union selection and schema/backward compatibility issues. Without it, a faster parser path could produce documents that cannot be loaded by downstream consumers.

Yeah, we should a test that uses FastTableData for a round-trip.
P.S. There is already a serialization to markdown in the performance analysis in the docling-project PR.

There are one questions that comes into my mind right now:

  • Would the serialization require conversion from FastTableCell to TableCell? (supposedly not if there are no implicit conversions/steps in the docling pipeline, which I'm unaware of)

I'll check it when writing the test. In case, do you have any idea?

Copy link
Copy Markdown

Thanks for pointing this out. I checked the round-trip behavior locally, and the current implementation does not preserve FastTableCell as a FastTableCell after JSON serialization.

The reason is the current union order:

Union[RichTableCell, TableCell, FastTableCell] with left_to_right validation.

The JSON shape emitted by FastTableCell is also valid input for TableCell, so when the document is loaded back, Pydantic selects TableCell before it reaches FastTableCell. There is no explicit conversion step, but validation effectively normalizes the cell back to TableCell.

So I think we need to make the intended contract explicit:

  1. If FastTableCell is only an in-memory optimization for high-volume XLSX parsing, then persisted documents can intentionally load back as TableCell. In that case I will add a round-trip test that verifies the document loads successfully and that all cell fields are preserved after normalization to TableCell.

  2. If FastTableCell is meant to be a persisted public document type, then this PR needs more than adding it to the union. We would need a way for Pydantic to distinguish it from TableCell, for example a discriminator or otherwise non-overlapping validation, and the test should assert that the loaded cell is still a FastTableCell.

Given the goal of this PR, I lean toward the first contract: use FastTableCell to avoid the per-cell Pydantic cost during parsing, then allow persisted JSON to load as the regular TableCell model. That keeps backward compatibility while still giving the high-volume parsing path the memory win.

@Michele-Zhu

Copy link
Copy Markdown
Contributor Author

Thank you for your work!
[For future readers: pydantic union]

I second the first option, which is the intended behavior of this PR. The reason I'm adding the dataclass here instead of leaving it in the MSExcelBackend is as follows: "In the case of high volume workloads, there will always a need for lighter representations".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants