-
Notifications
You must be signed in to change notification settings - Fork 678
Improve the performance of table-extraction by judging whether to do … #4797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…"make_chars" or "make_edges" by checking strategy
|
All contributors have signed the CLA ✍️ ✅ |
|
I have read the CLA Document and I hereby sign the CLA |
|
recheck |
|
Updates? this is much needed |
|
Hello @JorjMcKie @julian-smith-artifex-com would you please kindly look at this PR and give some comments and suggestions? |
I've found that the comparation must be done when both functions in the "pymupdf.tables.py". At first I thought I could accelerate the program without changing the package by writing a new function in my own code like my test code above, but the speed was slowed down instead. But if I change the function directly in the "table.py", and this time the speed does be fast a lot. So please have a look at this PR, thanks! In the source code, we can see that |
|
Sorry for the delay looking into this, and thanks again for considering ways for improvement. The fact that CHARS are needed for extracting cell text is enough reason to always extract it. Nobody in reality cares about a table when not intending to actually extract its text. In addition, I dislike adding another global variable (I regret to have started doing so anyway). Far for more logical would be to check whether the list is still empty. Apart from all that, the CHARS array is also needed when identifying the table header. A similar thought applies to the edges: This means that people won't use "text" - unless there are no edges. In which case trying to extract edges won't cost time anyway. |
Thank you for your kind reply! In fact I need to generate bookmarks from pdf contents, and for some cases it's hard to judge whether 2 lines of contents should be merged as they may have the same size and boldness, so I extract all the table cells' bboxes, if the 2 lines are in the same cell, they should be merged, otherwise they shouldn't, and I don't need text in this scenario. But I know that my request is rare. Anyway, thank you for reviewing this PR. I'll close it. |
|
Thanks a lot for your understanding! |
…"make_chars" or "make_edges" by checking strategy
Motivation
Hello, I used viztracer to do some table-extraction tests, and found that make_chars and make_edges would always be run whatever the strategy is, which would waste a lot time. So I change the code to run make_chars only if the strategies include "text" and run make_edges only if the strategies include "lines"/"lines_strict"
Below is my test code
From the result.json (checked by "vizviewer result.json"), we can see the performance improves a lot when "lines" and "text" strategies do not appear simultaneously.

"result.json" is too large to upload, so I paste a picture here. Old find_tables costs about 190ms and new find_tables costs about 101ms when both using default "lines" strategy.