feat(xlsx): added lighter data representation for high volume parsing#3692
Open
Michele-Zhu wants to merge 3 commits into
Open
feat(xlsx): added lighter data representation for high volume parsing#3692Michele-Zhu wants to merge 3 commits into
Michele-Zhu wants to merge 3 commits into
Conversation
Signed-off-by: Michele-Zhu <michele.zhu@polimi.it>
… (issue#3328) Signed-off-by: Michele-Zhu <michele.zhu@polimi.it>
Signed-off-by: Michele-Zhu <michele.zhu@polimi.it>
Contributor
|
✅ DCO Check Passed Thanks @Michele-Zhu, all your commits are properly signed off. 🎉 |
Contributor
Merge Protections🟢 Merge protection satisfied — ready to merge. Show 1 satisfied protection🟢 Enforce conventional commitMake sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
**Issue resolved by this Pull Request:
Resolves #3328
Problem
When parsing very large XLSX files (1M+ cells), users have noticed spikes in RAM usage and slow processing speed. This PR mainly solves the large RAM usage by avoiding Pydantic data validation for each cell. It also provides a 5-15% speedup in the observed workloads with 100k rows and 300k rows 1.
Solution
FastTableCell) instead of Pydantic models (data validation step is skipped)python3 perfs/simplexlsx_profiling.py --cell "TableCell" -o "./perfs/data/profiles/profiling_table_cell"python3 perfs/simplexlsx_profiling.py --cell "FastTableCell" -o "./perfs/data/profiles/profiling_fast_table_cell"Extra
An additional option is to disable Pydantic validation and only validate the text field when creating the table_cell, since the rest of the data should all be valid.
(suspected) It's the Pydantic BaseModel itself that adds too much metadata for a large number of text data, so this approach would not significantly improve the RAM usage.
Root-cause Analysis
To understand the root cause of the problem, we perform an instrumentation profiling with the library memray.
Considering the workload with 300k cells. The flamegraph shows that at peak memory usage, 2.7 GBs are consumed by the creation and validation of
TableCell, in particular, 2.5 GB are from the upstream Pydantic validation engine. WithFastTableCellthe cost is only 200 MBsflamegraph_300k_rows_table_cell.html
flamegraph_300k_rows_fast_table_cell.html
Same peak growth of 22 GB from issue #3328 has not been observer in the current version of docling (2.107.0) with a local environment.
Additional comments
FastTableCell, which does not completely capture the semantics of the class. The author of this PR proposes alternatives such as:CompactTableCellorSimpleTableCellto better capture the strip down of the validation parttable_cells field in TableData. A better execution model if feasible is "bulk"FastTableCellvalidation with pydanticAdditional possible improvements
While the Pydantic overhead has been removed, the upstream parser
openpyxlconsumes 800 MB by opening the file in eager mode.TODOS
Checklist:
Footnotes
timing measurements are not performed with Monte Carlo sampling or experiment isolation ↩