Skip to content

feat: remove PDF pipeline, add query expansion and foundational papers#14

Merged
spignotti merged 2 commits into
mainfrom
feat/remove-pdf-add-discovery-enhancements
May 10, 2026
Merged

feat: remove PDF pipeline, add query expansion and foundational papers#14
spignotti merged 2 commits into
mainfrom
feat/remove-pdf-add-discovery-enhancements

Conversation

@spignotti
Copy link
Copy Markdown
Owner

Summary

This PR simplifies the pipeline by removing PDF-related logic and reinvests that complexity into stronger discovery features.

PDF Removal

Real-world testing showed that abstracts and metadata alone drive good results, while PDF downloads rarely succeeded (1-2 PDFs per run). Removing the PDF pipeline:

  • Eliminates --inject-pdfs and --stop-after-screening CLI flags
  • Removes pdf.py, pypdf dependency, and all PDF state tracking
  • Simplifies analysis to abstract-only screening and deep analysis
  • Reduces config surface area and failure modes

Iterative Query Expansion

After initial discovery, a new stage analyzes top candidate abstracts and asks the LLM to identify underexplored directions. It generates 1-2 additional search queries, and the pipeline automatically runs a second discovery+enrichment round.

  • Configurable via enable_query_expansion (default: true) and max_expansion_queries
  • New prompt: query_expansion.md
  • New stage: stages/query_expansion.py

Foundational Paper Detection

The citation expansion stage now also counts references that are already in the candidate set. Papers cited by the most top-ranked papers are surfaced as "foundational" — the key works you should always cite.

  • Configurable via enable_foundational_detection (default: true) and foundational_papers_count
  • Added to export report as a dedicated section

Model Update

Default model upgraded from openai/gpt-4o-mini to openai/gpt-5.4-mini.

Validation

  • uv run nox passes: lint, typecheck, test (57/57)

Breaking Changes

  • CLI flags removed: --inject-pdfs, --stop-after-screening
  • Config settings removed: pdf_first_pages, pdf_last_pages, pdf_extraction_mode, pdf_token_budget, abstract_fallback, inject_pdf_dir

spignotti and others added 2 commits May 10, 2026 19:33
Remove PDF download/extraction logic across the entire codebase:
- Delete pdf.py module and pypdf dependency
- Remove --inject-pdfs and --stop-after-screening CLI flags
- Remove PauseForPDFsError and all PDF-related state tracking
- Simplify analysis to abstract-only screening and analysis
- Update prompts to remove PDF references

Add iterative query expansion:
- New query_expansion stage analyzes initial candidates
- Generates 1-2 additional search queries for underexplored directions
- Pipeline runs a second discovery+enrichment round automatically
- Configurable via enable_query_expansion and max_expansion_queries

Add foundational paper detection:
- Citation expansion now tracks in-set references
- Identifies papers most cited by other top-ranked papers
- Surfaces top 5 foundational papers in export report

Also update default model from gpt-4o-mini to gpt-5.4-mini.
@spignotti spignotti merged commit 5446621 into main May 10, 2026
2 checks passed
@spignotti spignotti deleted the feat/remove-pdf-add-discovery-enhancements branch May 10, 2026 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant