Skip to content

feat(xlsx): added lighter data representation for high volume parsing#3692

Open
Michele-Zhu wants to merge 3 commits into
docling-project:mainfrom
Michele-Zhu:dev/issue#3328_high_ram_usage_xlsx_large_files
Open

feat(xlsx): added lighter data representation for high volume parsing#3692
Michele-Zhu wants to merge 3 commits into
docling-project:mainfrom
Michele-Zhu:dev/issue#3328_high_ram_usage_xlsx_large_files

Conversation

@Michele-Zhu

@Michele-Zhu Michele-Zhu commented Jun 24, 2026

Copy link
Copy Markdown

**Issue resolved by this Pull Request:
Resolves #3328

Problem

When parsing very large XLSX files (1M+ cells), users have noticed spikes in RAM usage and slow processing speed. This PR mainly solves the large RAM usage by avoiding Pydantic data validation for each cell. It also provides a 5-15% speedup in the observed workloads with 100k rows and 300k rows 1.

Solution

  1. Perform root-cause analysis to understand the problem.
  2. Selection of profiler for root cause analysis (in this case we use memray, an instrumentation profiler)
  3. Minimal change solution with Pythons dataclasses (FastTableCell) instead of Pydantic models (data validation step is skipped)
  4. Added MsExcelBackendOptions to select the which class to use with minimal overhead (cell_cls)
  5. Provided script for profling, the root-cause analysis can be repeated by running the following commands:
    • python3 perfs/simplexlsx_profiling.py --cell "TableCell" -o "./perfs/data/profiles/profiling_table_cell"
    • python3 perfs/simplexlsx_profiling.py --cell "FastTableCell" -o "./perfs/data/profiles/profiling_fast_table_cell"

Extra

An additional option is to disable Pydantic validation and only validate the text field when creating the table_cell, since the rest of the data should all be valid.
(suspected) It's the Pydantic BaseModel itself that adds too much metadata for a large number of text data, so this approach would not significantly improve the RAM usage.

Root-cause Analysis

To understand the root cause of the problem, we perform an instrumentation profiling with the library memray.
Considering the workload with 300k cells. The flamegraph shows that at peak memory usage, 2.7 GBs are consumed by the creation and validation of TableCell, in particular, 2.5 GB are from the upstream Pydantic validation engine. With FastTableCell the cost is only 200 MBs

flamegraph_300k_rows_table_cell.html
flamegraph_300k_rows_fast_table_cell.html

Same peak growth of 22 GB from issue #3328 has not been observer in the current version of docling (2.107.0) with a local environment.

Additional comments

  • Currently the class has been named FastTableCell, which does not completely capture the semantics of the class. The author of this PR proposes alternatives such as: CompactTableCell or SimpleTableCell to better capture the strip down of the validation part
  • Integration tests: integration test was not perfomed, this change is minimal to the current architecture of the system and the "Pydantic tax" can be paid later downstream during access to table_cells field in TableData. A better execution model if feasible is "bulk" FastTableCell validation with pydantic

Additional possible improvements

While the Pydantic overhead has been removed, the upstream parser openpyxl consumes 800 MB by opening the file in eager mode.

TODOS

  1. Clean up comments
  2. Update docling core with lighter dataclass + annotation. At the following PR

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

Footnotes

  1. timing measurements are not performed with Monte Carlo sampling or experiment isolation

Signed-off-by: Michele-Zhu <michele.zhu@polimi.it>
… (issue#3328)

Signed-off-by: Michele-Zhu <michele.zhu@polimi.it>
Signed-off-by: Michele-Zhu <michele.zhu@polimi.it>
@github-actions

Copy link
Copy Markdown
Contributor

DCO Check Passed

Thanks @Michele-Zhu, all your commits are properly signed off. 🎉

@mergify

mergify Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

🟢 Merge protection satisfied — ready to merge.

Show 1 satisfied protection

🟢 Enforce conventional commit

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@Michele-Zhu Michele-Zhu changed the title Dev/issue#3328 high ram usage xlsx large files feat(xlsx): added lighter data representation for high volume parsing Jun 24, 2026
@Michele-Zhu Michele-Zhu marked this pull request as ready for review June 24, 2026 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

High RAM consumption when converting large XLSX files (300k rows, 7 columns)

1 participant