feat(xlsx): added lighter data representation for high volume parsing by Michele-Zhu · Pull Request #3692 · docling-project/docling

Michele-Zhu · 2026-06-24T13:33:17Z

**Issue resolved by this Pull Request:
Resolves #3328

Problem

When parsing very large XLSX files (1M+ cells), users have noticed spikes in RAM usage and slow processing speed. This PR mainly solves the large RAM usage by avoiding Pydantic data validation for each cell. It also provides a 5-15% speedup in the observed workloads with 100k rows and 300k rows ¹.

Solution

Perform root-cause analysis to understand the problem.
Selection of profiler for root cause analysis (in this case we use memray, an instrumentation profiler)
Minimal change solution with Pythons dataclasses (FastTableCell) instead of Pydantic models (data validation step is skipped)
Added MsExcelBackendOptions to select the which class to use with minimal overhead (cell_cls)
Provided script for profling, the root-cause analysis can be repeated by running the following commands:
- python3 perfs/simplexlsx_profiling.py --cell "TableCell" -o "./perfs/data/profiles/profiling_table_cell"
- python3 perfs/simplexlsx_profiling.py --cell "FastTableCell" -o "./perfs/data/profiles/profiling_fast_table_cell"

Extra

An additional option is to disable Pydantic validation and only validate the text field when creating the table_cell, since the rest of the data should all be valid.
(suspected) It's the Pydantic BaseModel itself that adds too much metadata for a large number of text data, so this approach would not significantly improve the RAM usage.

Root-cause Analysis

To understand the root cause of the problem, we perform an instrumentation profiling with the library memray.
Considering the workload with 300k cells. The flamegraph shows that at peak memory usage, 2.7 GBs are consumed by the creation and validation of TableCell, in particular, 2.5 GB are from the upstream Pydantic validation engine. With FastTableCell the cost is only 200 MBs

flamegraph_300k_rows_table_cell.html
flamegraph_300k_rows_fast_table_cell.html

Same peak growth of 22 GB from issue #3328 has not been observer in the current version of docling (2.107.0) with a local environment.

Additional comments

Currently the class has been named FastTableCell, which does not completely capture the semantics of the class. The author of this PR proposes alternatives such as: CompactTableCell or SimpleTableCell to better capture the strip down of the validation part
Integration tests: integration test was not perfomed, this change is minimal to the current architecture of the system and the "Pydantic tax" can be paid later downstream during access to table_cells field in TableData. A better execution model if feasible is "bulk" FastTableCell validation with pydantic

Additional possible improvements

While the Pydantic overhead has been removed, the upstream parser openpyxl consumes 800 MB by opening the file in eager mode.

TODOS

Clean up comments
Update docling core with lighter dataclass + annotation. At the following PR

Checklist:

Documentation has been updated, if necessary.
Examples have been added, if necessary.
Tests have been added, if necessary.

timing measurements are not performed with Monte Carlo sampling or experiment isolation ↩

Signed-off-by: Michele-Zhu <michele.zhu@polimi.it>

… (issue#3328) Signed-off-by: Michele-Zhu <michele.zhu@polimi.it>

Signed-off-by: Michele-Zhu <michele.zhu@polimi.it>

github-actions · 2026-06-24T13:33:28Z

✅ DCO Check Passed

Thanks @Michele-Zhu, all your commits are properly signed off. 🎉

mergify · 2026-06-24T13:33:53Z

Merge Protections

🟢 Merge protection satisfied — ready to merge.

Show 1 satisfied protection

🟢 Enforce conventional commit

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Michele-Zhu added 3 commits June 22, 2026 16:10

fix(tests): added testpaths to project.toml [tool.pytest.ini_options]

e42ec2f

Signed-off-by: Michele-Zhu <michele.zhu@polimi.it>

feat(xlsx): added lighter data representation for high volume parsing…

829be75

… (issue#3328) Signed-off-by: Michele-Zhu <michele.zhu@polimi.it>

chore: dev dependency (memray>=1.19.3)

cb8b8c7

Signed-off-by: Michele-Zhu <michele.zhu@polimi.it>

Michele-Zhu changed the title ~~Dev/issue#3328 high ram usage xlsx large files~~ feat(xlsx): added lighter data representation for high volume parsing Jun 24, 2026

Michele-Zhu mentioned this pull request Jun 24, 2026

perf: added simpler table cell representation for high volume parsing docling-project/docling-core#657

Open

Michele-Zhu marked this pull request as ready for review June 24, 2026 14:40

PeterStaar-IBM requested review from PeterStaar-IBM and ceberam June 25, 2026 03:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(xlsx): added lighter data representation for high volume parsing#3692

feat(xlsx): added lighter data representation for high volume parsing#3692
Michele-Zhu wants to merge 3 commits into
docling-project:mainfrom
Michele-Zhu:dev/issue#3328_high_ram_usage_xlsx_large_files

Michele-Zhu commented Jun 24, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

mergify Bot commented Jun 24, 2026 •

edited

Loading

🟢 Enforce conventional commit

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Michele-Zhu commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Extra

Root-cause Analysis

Additional comments

Additional possible improvements

TODOS

Footnotes

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

mergify Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🟢 Enforce conventional commit

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Michele-Zhu commented Jun 24, 2026 •

edited

Loading

mergify Bot commented Jun 24, 2026 •

edited

Loading