-
Notifications
You must be signed in to change notification settings - Fork 39
Spreadsheet: handle xlsx/xlsm directly through openpyxl #5737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+415
−53
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2638562 to
8812736
Compare
_from_xlsx method
quis
reviewed
Jan 19, 2026
quis
reviewed
Jan 19, 2026
quis
reviewed
Jan 19, 2026
quis
reviewed
Jan 19, 2026
quis
reviewed
Jan 19, 2026
quis
reviewed
Jan 19, 2026
quis
reviewed
Jan 19, 2026
the previous Literal arrangement will probably not work very well
the approach of this parsing procedure is to minimize the extents of the region we need to "eagerly" iterate through because accesses to undefined cells (or row/column dimensions) will result in the allocation of an object in its place. however we obviously want to make these checks as cheap as possible themselves otherwise it will defeat the point. therefore we make heavy use of the (internal!) sheet._cells dictionary to discover what cells are *actually* defined in the file. we manage to do this with only two iterations through sheet._cells (which isn't too bad - a mere naive access to a sheet.max_column/max_row property will result in one iteration on its own, and a call to iter_rows without a max_row/max_col specified will access that property will do that behind the scenes) the main differences in the output (see conversion examples) are that the output is now properly "rectangular" instead of having ragged line endings if final values are missing for some rows. there is also more "correct" handling of merged cells whose first cell is hidden. the header width determination is also now working as intended (previously disabled for xlsx for performance reasons. this also calls sleep(0) after every operation that could have taken a while for extreme files.
…ottom cell set named QQQ123456 because that's the cell that's set files all created with LibreOffice 25.8.3.2
8812736 to
ceca36e
Compare
quis
approved these changes
Jan 22, 2026
Member
quis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested locally with a few random .xlsx files and it seems to handle them ok
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Instead of using openpyxl through the extremely awkward gloves of pyexcel (which, among other things, causes unnecessary and expensive materialization of non-existent cells, makes it hard to detect the actual extents of cells defined in the file and has bugs in its merged cell support), use
openpyxldirectly forxlsxandxlsmfiles.The approach of this parsing procedure is to minimize the extents of the region we need to "eagerly" iterate through
because accesses to undefined cells (or row/column dimensions) will result in the allocation of an object in its place.
however we obviously want to make these checks as cheap as possible themselves otherwise it will defeat the point.
therefore we make heavy use of the (internal!)
sheet._cellsdictionary to discover what cells are actually definedin the file.
We manage to do this with only two iterations through
sheet._cells(which isn't too bad - a mere naive access toa
sheet.max_column/max_rowproperty will result in one iteration on its own, and a call toiter_rowswithouta
max_row/max_colspecified will access that property will do that behind the scenes)The main differences in the output (see conversion examples) are that the output is now properly "rectangular" instead
of having ragged line endings if final values are missing for some rows. There is also more "correct" handling of merged cells whose first cell is hidden. The header width determination is also now working as intended (previously disabled for xlsx for performance reasons).
This also calls
sleep(0)after every operation that could have taken a while for extreme files.This has also allowed me to include the test from #5686 because it passes now, correctly ignoring the stray cell.