feat: remove PDF pipeline, add query expansion and foundational papers#14
Merged
Merged
Conversation
Remove PDF download/extraction logic across the entire codebase: - Delete pdf.py module and pypdf dependency - Remove --inject-pdfs and --stop-after-screening CLI flags - Remove PauseForPDFsError and all PDF-related state tracking - Simplify analysis to abstract-only screening and analysis - Update prompts to remove PDF references Add iterative query expansion: - New query_expansion stage analyzes initial candidates - Generates 1-2 additional search queries for underexplored directions - Pipeline runs a second discovery+enrichment round automatically - Configurable via enable_query_expansion and max_expansion_queries Add foundational paper detection: - Citation expansion now tracks in-set references - Identifies papers most cited by other top-ranked papers - Surfaces top 5 foundational papers in export report Also update default model from gpt-4o-mini to gpt-5.4-mini.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR simplifies the pipeline by removing PDF-related logic and reinvests that complexity into stronger discovery features.
PDF Removal
Real-world testing showed that abstracts and metadata alone drive good results, while PDF downloads rarely succeeded (1-2 PDFs per run). Removing the PDF pipeline:
--inject-pdfsand--stop-after-screeningCLI flagspdf.py,pypdfdependency, and all PDF state trackingIterative Query Expansion
After initial discovery, a new stage analyzes top candidate abstracts and asks the LLM to identify underexplored directions. It generates 1-2 additional search queries, and the pipeline automatically runs a second discovery+enrichment round.
enable_query_expansion(default: true) andmax_expansion_queriesquery_expansion.mdstages/query_expansion.pyFoundational Paper Detection
The citation expansion stage now also counts references that are already in the candidate set. Papers cited by the most top-ranked papers are surfaced as "foundational" — the key works you should always cite.
enable_foundational_detection(default: true) andfoundational_papers_countModel Update
Default model upgraded from
openai/gpt-4o-minitoopenai/gpt-5.4-mini.Validation
uv run noxpasses: lint, typecheck, test (57/57)Breaking Changes
--inject-pdfs,--stop-after-screeningpdf_first_pages,pdf_last_pages,pdf_extraction_mode,pdf_token_budget,abstract_fallback,inject_pdf_dir