Spreadsheet: handle xlsx/xlsm directly through openpyxl #5737

risicle · 2026-01-12T16:19:17Z

Instead of using openpyxl through the extremely awkward gloves of pyexcel (which, among other things, causes unnecessary and expensive materialization of non-existent cells, makes it hard to detect the actual extents of cells defined in the file and has bugs in its merged cell support), use openpyxl directly for xlsx and xlsm files.

The approach of this parsing procedure is to minimize the extents of the region we need to "eagerly" iterate through
because accesses to undefined cells (or row/column dimensions) will result in the allocation of an object in its place.
however we obviously want to make these checks as cheap as possible themselves otherwise it will defeat the point.
therefore we make heavy use of the (internal!) sheet._cells dictionary to discover what cells are actually defined
in the file.

We manage to do this with only two iterations through sheet._cells (which isn't too bad - a mere naive access to
a sheet.max_column/max_row property will result in one iteration on its own, and a call to iter_rows without
a max_row/max_col specified will access that property will do that behind the scenes)

The main differences in the output (see conversion examples) are that the output is now properly "rectangular" instead
of having ragged line endings if final values are missing for some rows. There is also more "correct" handling of merged cells whose first cell is hidden. The header width determination is also now working as intended (previously disabled for xlsx for performance reasons).

This also calls sleep(0) after every operation that could have taken a while for extreme files.

This has also allowed me to include the test from #5686 because it passes now, correctly ignoring the stray cell.

app/models/spreadsheet.py

the previous Literal arrangement will probably not work very well

the approach of this parsing procedure is to minimize the extents of the region we need to "eagerly" iterate through because accesses to undefined cells (or row/column dimensions) will result in the allocation of an object in its place. however we obviously want to make these checks as cheap as possible themselves otherwise it will defeat the point. therefore we make heavy use of the (internal!) sheet._cells dictionary to discover what cells are *actually* defined in the file. we manage to do this with only two iterations through sheet._cells (which isn't too bad - a mere naive access to a sheet.max_column/max_row property will result in one iteration on its own, and a call to iter_rows without a max_row/max_col specified will access that property will do that behind the scenes) the main differences in the output (see conversion examples) are that the output is now properly "rectangular" instead of having ragged line endings if final values are missing for some rows. there is also more "correct" handling of merged cells whose first cell is hidden. the header width determination is also now working as intended (previously disabled for xlsx for performance reasons. this also calls sleep(0) after every operation that could have taken a while for extreme files.

…ottom cell set named QQQ123456 because that's the cell that's set files all created with LibreOffice 25.8.3.2

quis

Tested locally with a few random .xlsx files and it seems to handle them ok

risicle force-pushed the ris-spreadsheet-openpyxl-direct branch 2 times, most recently from 2638562 to 8812736 Compare January 16, 2026 11:23

risicle changed the title ~~Spreadsheet: add direct-openpyxl-based _from_xlsx method~~ Spreadsheet: handle xlsx/xlsm directly through openpyxl Jan 16, 2026

risicle marked this pull request as ready for review January 16, 2026 11:34

risicle requested a review from quis January 19, 2026 11:28