Skip to content

phira-ai/Embed-Papers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

embed-papers

embed-papers crawls OpenReview submissions and runs semantic search with OpenAI embeddings.

This is a helper package for my agentic research workflow. Originally forked from gyj155/SearchPaperByEmbedding.

embed-papers supports two workflows:

  • For agents: a stable CLI contract (JSON stdout) that is safe to automate and parse.
  • For humans: a Streamlit viewer for interactive search, exploration, and positioning your work within a conference’s paper space.

Installation

Base package

pip install embed_papers

Set your API key for embeddings:

export OPENAI_API_KEY="<your-key>"

Viewer (extra dependency)

pip install "embed_papers[viewer]"

For Agents

CLI contract

  • stdout always prints one JSON object
  • stderr is reserved for logs/progress
  • non-zero exit codes still emit JSON on stdout

Success envelope:

{
  "ok": true,
  "schema_version": "1",
  "command": "search",
  "data": {}
}

Error envelope:

{
  "ok": false,
  "schema_version": "1",
  "command": "search",
  "error": {
    "type": "InvalidPapersFileError",
    "message": "..."
  }
}

CLI usage

Crawl

embed-papers crawl --venue-id "ICLR.cc/2026/Conference" --skip-if-exists

By default, crawl fails when zero papers are found (to catch wrong venue ids early).

Use --skip-if-exists to reuse an existing output file and skip calling OpenReview.

If --output-file is omitted, crawl defaults to:

  • ~/.cache/embed-papers/papers/<venue-id-slug>.json

Warm cache

export OPENAI_API_KEY="<your-key>"
embed-papers warm-cache \
  --papers-file iclr2026_papers.json \
  --venue-id "ICLR.cc/2026/Conference"

--papers-file is optional if --venue-id is provided. In that case, it defaults to ~/.cache/embed-papers/papers/<venue-id-slug>.json.

If --cache-dir is omitted, embeddings default to:

  • ~/.cache/embed-papers/embeddings

Search

embed-papers search \
  --papers-file iclr2026_papers.json \
  --venue-id "ICLR.cc/2026/Conference" \
  --query "foundation models for planning" \
  --top-k 20

--papers-file is optional if --venue-id is provided. In that case, it defaults to ~/.cache/embed-papers/papers/<venue-id-slug>.json.

search uses the same default embeddings cache dir (~/.cache/embed-papers/embeddings) unless --cache-dir is provided.

For Humans

Make sure you have set an OPENAI_API_KEY in your shell. In the command line, run:

embed-papers host

This launches a local Streamlit UI in your browser for interactive use.

Viewer flow:

  • enter conference abbreviation + year (auto-builds venue id)
  • choose direct query or examples upload
  • set top-k and run search
  • auto-crawl papers if missing
  • auto-build embeddings cache if missing

Cache directories used by viewer:

  • ~/.cache/embed-papers/papers
  • ~/.cache/embed-papers/embeddings
  • ~/.cache/embed-papers/atlas

Python API

1) Crawl conference papers

from embed_papers import crawl_papers

_ = crawl_papers(
    venue_id="ICLR.cc/2026/Conference",
    output_file="iclr2026_papers.json",
)

2) Warm cache / search

from embed_papers import PaperSearcher

searcher = PaperSearcher(
    papers_file="iclr2026_papers.json",
    venue_id="ICLR.cc/2026/Conference",
    model_name="text-embedding-3-large",
)

searcher.ensure_embeddings()
results = searcher.search(query="robotics planning language model", top_k=100)
searcher.display(results, n=10, show_abstract=True, abstract_max_chars=500)
searcher.save(results, "results.json")

About

Semantic search for conference papers via OpenReview API. This is a helper package for my agentic research workflow.

Resources

License

Stars

Watchers

Forks

Contributors

Languages