perf: added simpler table cell representation for high volume parsing#657
perf: added simpler table cell representation for high volume parsing#657Michele-Zhu wants to merge 1 commit into
Conversation
# TODO Signed-off-by: Michele-Zhu <71699139+Michele-Zhu@users.noreply.github.com>
|
✅ DCO Check Passed Thanks @Michele-Zhu, all your commits are properly signed off. 🎉 |
Merge Protections🟢 Merge protection satisfied — ready to merge. Show 1 satisfied protection🟢 Enforce conventional commitMake sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
gyx09212214-prog
left a comment
There was a problem hiding this comment.
I think this needs a serialization round-trip test before merging. AnyTableCell is part of persisted TableData, and the new FastTableCell is a dataclass while the existing alternatives are Pydantic models. Could we add a test that builds a TableData with a FastTableCell, dumps a DoclingDocument to JSON, validates it back, and verifies the cell type/fields survive?
That would catch both union selection and schema/backward compatibility issues. Without it, a faster parser path could produce documents that cannot be loaded by downstream consumers.
Yeah, we should a test that uses There are one questions that comes into my mind right now:
I'll check it when writing the test. In case, do you have any idea? |
|
Thanks for pointing this out. I checked the round-trip behavior locally, and the current implementation does not preserve The reason is the current union order:
The JSON shape emitted by So I think we need to make the intended contract explicit:
Given the goal of this PR, I lean toward the first contract: use |
|
Thank you for your work! I second the first option, which is the intended behavior of this PR. The reason I'm adding the dataclass here instead of leaving it in the MSExcelBackend is as follows: "In the case of high volume workloads, there will always a need for lighter representations". |
This PR follows issue number 3328 and the proposed solution at PR 3692 at the docling project for high-volume XLSX parsing.
TODO: