Skip to content

Conversation

@risicle
Copy link
Member

@risicle risicle commented Jan 12, 2026

Instead of using openpyxl through the extremely awkward gloves of pyexcel (which, among other things, causes unnecessary and expensive materialization of non-existent cells, makes it hard to detect the actual extents of cells defined in the file and has bugs in its merged cell support), use openpyxl directly for xlsx and xlsm files.

The approach of this parsing procedure is to minimize the extents of the region we need to "eagerly" iterate through
because accesses to undefined cells (or row/column dimensions) will result in the allocation of an object in its place.
however we obviously want to make these checks as cheap as possible themselves otherwise it will defeat the point.
therefore we make heavy use of the (internal!) sheet._cells dictionary to discover what cells are actually defined
in the file.

We manage to do this with only two iterations through sheet._cells (which isn't too bad - a mere naive access to
a sheet.max_column/max_row property will result in one iteration on its own, and a call to iter_rows without
a max_row/max_col specified will access that property will do that behind the scenes)

The main differences in the output (see conversion examples) are that the output is now properly "rectangular" instead
of having ragged line endings if final values are missing for some rows. There is also more "correct" handling of merged cells whose first cell is hidden. The header width determination is also now working as intended (previously disabled for xlsx for performance reasons).

This also calls sleep(0) after every operation that could have taken a while for extreme files.

This has also allowed me to include the test from #5686 because it passes now, correctly ignoring the stray cell.

@risicle risicle force-pushed the ris-spreadsheet-openpyxl-direct branch 2 times, most recently from 2638562 to 8812736 Compare January 16, 2026 11:23
@risicle risicle changed the title Spreadsheet: add direct-openpyxl-based _from_xlsx method Spreadsheet: handle xlsx/xlsm directly through openpyxl Jan 16, 2026
@risicle risicle marked this pull request as ready for review January 16, 2026 11:34
@risicle risicle requested a review from quis January 19, 2026 11:28
the previous Literal arrangement will probably not work very well
the approach of this parsing procedure is to minimize the
extents of the region we need to "eagerly" iterate through
because accesses to undefined cells (or row/column dimensions)
will result in the allocation of an object in its place.
however we obviously want to make these checks as cheap as
possible themselves otherwise it will defeat the point.
therefore we make heavy use of the (internal!) sheet._cells
dictionary to discover what cells are *actually* defined
in the file.

we manage to do this with only two iterations through
sheet._cells (which isn't too bad - a mere naive access to
a sheet.max_column/max_row property will result in one
iteration on its own, and a call to iter_rows without
a max_row/max_col specified will access that property will
do that behind the scenes)

the main differences in the output (see conversion examples)
are that the output is now properly "rectangular" instead
of having ragged line endings if final values are missing
for some rows. there is also more "correct" handling of
merged cells whose first cell is hidden. the header width
determination is also now working as intended (previously
disabled for xlsx for performance reasons.

this also calls sleep(0) after every operation that could
have taken a while for extreme files.
…ottom cell set

named QQQ123456 because that's the cell that's set

files all created with LibreOffice 25.8.3.2
@risicle risicle force-pushed the ris-spreadsheet-openpyxl-direct branch from 8812736 to ceca36e Compare January 22, 2026 11:45
@risicle risicle requested a review from quis January 22, 2026 11:52
Copy link
Member

@quis quis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally with a few random .xlsx files and it seems to handle them ok

@risicle risicle merged commit d69d3c1 into main Jan 29, 2026
6 checks passed
@risicle risicle deleted the ris-spreadsheet-openpyxl-direct branch January 29, 2026 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants