feat: remove PDF pipeline, add query expansion and foundational papers by spignotti · Pull Request #14 · spignotti/litresearch

spignotti · 2026-05-10T17:33:49Z

Summary

This PR simplifies the pipeline by removing PDF-related logic and reinvests that complexity into stronger discovery features.

PDF Removal

Real-world testing showed that abstracts and metadata alone drive good results, while PDF downloads rarely succeeded (1-2 PDFs per run). Removing the PDF pipeline:

Eliminates --inject-pdfs and --stop-after-screening CLI flags
Removes pdf.py, pypdf dependency, and all PDF state tracking
Simplifies analysis to abstract-only screening and deep analysis
Reduces config surface area and failure modes

Iterative Query Expansion

After initial discovery, a new stage analyzes top candidate abstracts and asks the LLM to identify underexplored directions. It generates 1-2 additional search queries, and the pipeline automatically runs a second discovery+enrichment round.

Configurable via enable_query_expansion (default: true) and max_expansion_queries
New prompt: query_expansion.md
New stage: stages/query_expansion.py

Foundational Paper Detection

The citation expansion stage now also counts references that are already in the candidate set. Papers cited by the most top-ranked papers are surfaced as "foundational" — the key works you should always cite.

Configurable via enable_foundational_detection (default: true) and foundational_papers_count
Added to export report as a dedicated section

Model Update

Default model upgraded from openai/gpt-4o-mini to openai/gpt-5.4-mini.

Validation

uv run nox passes: lint, typecheck, test (57/57)

Breaking Changes

CLI flags removed: --inject-pdfs, --stop-after-screening
Config settings removed: pdf_first_pages, pdf_last_pages, pdf_extraction_mode, pdf_token_budget, abstract_fallback, inject_pdf_dir

Remove PDF download/extraction logic across the entire codebase: - Delete pdf.py module and pypdf dependency - Remove --inject-pdfs and --stop-after-screening CLI flags - Remove PauseForPDFsError and all PDF-related state tracking - Simplify analysis to abstract-only screening and analysis - Update prompts to remove PDF references Add iterative query expansion: - New query_expansion stage analyzes initial candidates - Generates 1-2 additional search queries for underexplored directions - Pipeline runs a second discovery+enrichment round automatically - Configurable via enable_query_expansion and max_expansion_queries Add foundational paper detection: - Citation expansion now tracks in-set references - Identifies papers most cited by other top-ranked papers - Surfaces top 5 foundational papers in export report Also update default model from gpt-4o-mini to gpt-5.4-mini.

spignotti and others added 2 commits May 10, 2026 19:33

Merge branch 'main' into feat/remove-pdf-add-discovery-enhancements

c134ecc

spignotti merged commit 5446621 into main May 10, 2026
2 checks passed

spignotti deleted the feat/remove-pdf-add-discovery-enhancements branch May 10, 2026 20:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: remove PDF pipeline, add query expansion and foundational papers#14

feat: remove PDF pipeline, add query expansion and foundational papers#14
spignotti merged 2 commits into
mainfrom
feat/remove-pdf-add-discovery-enhancements

spignotti commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

spignotti commented May 10, 2026

Summary

PDF Removal

Iterative Query Expansion

Foundational Paper Detection

Model Update

Validation

Breaking Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant