fix: Improve PDF parser robustness and efficiency#138
Open
Dhruv-Sharma01 wants to merge 1 commit into
Open
Conversation
|
You messed up the formatting use Black. |
Author
|
Check it. |
👍🏻 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #137
Changes Implemented
This PR introduces a more robust, two-stage parsing architecture. Below are the specific changes made to each file.
Replaced IdentifyHeaders class with AdvancedHeaderIdentifier: The new class uses a multi-heuristic scoring system (boldness, all-caps, relative font size) to detect section headers, making it more resilient to different resume styles.
Enforced Linear Reading Order: Added a sort for the text_rects list (text_rects.sort(key=lambda r: (r.y0, r.x0))) immediately after text blocks are identified. This ensures a strict top-to-bottom, left-to-right processing flow, fixing parsing errors on multi-column layouts.
Added _split_markdown_by_headers method: This new helper function provides a deterministic way to pre-process the Markdown text, splitting it into a dictionary of sections based on ## headers before any LLM calls are made.
Refactored _extract_all_sections_separately method: The original logic, which made multiple, full-document LLM calls, has been replaced. The new implementation first uses _split_markdown_by_headers to get structured data and then sends only the small, relevant text chunk for each section to the LLM for analysis. This improves efficiency and reliability.