embed-papers

embed-papers crawls OpenReview submissions and runs semantic search with OpenAI embeddings.

This is a helper package for my agentic research workflow. Originally forked from gyj155/SearchPaperByEmbedding.

embed-papers supports two workflows:

For agents: a stable CLI contract (JSON stdout) that is safe to automate and parse.
For humans: a Streamlit viewer for interactive search, exploration, and positioning your work within a conference’s paper space.

Installation

Base package

pip install embed_papers

Set your API key for embeddings:

export OPENAI_API_KEY="<your-key>"

Viewer (extra dependency)

pip install "embed_papers[viewer]"

For Agents

CLI contract

stdout always prints one JSON object
stderr is reserved for logs/progress
non-zero exit codes still emit JSON on stdout

Success envelope:

{
  "ok": true,
  "schema_version": "1",
  "command": "search",
  "data": {}
}

Error envelope:

{
  "ok": false,
  "schema_version": "1",
  "command": "search",
  "error": {
    "type": "InvalidPapersFileError",
    "message": "..."
  }
}

CLI usage

Crawl

embed-papers crawl --venue-id "ICLR.cc/2026/Conference" --skip-if-exists

By default, crawl fails when zero papers are found (to catch wrong venue ids early).

Use --skip-if-exists to reuse an existing output file and skip calling OpenReview.

If --output-file is omitted, crawl defaults to:

~/.cache/embed-papers/papers/<venue-id-slug>.json

Warm cache

export OPENAI_API_KEY="<your-key>"
embed-papers warm-cache \
  --papers-file iclr2026_papers.json \
  --venue-id "ICLR.cc/2026/Conference"

--papers-file is optional if --venue-id is provided. In that case, it defaults to ~/.cache/embed-papers/papers/<venue-id-slug>.json.

If --cache-dir is omitted, embeddings default to:

~/.cache/embed-papers/embeddings

Search

embed-papers search \
  --papers-file iclr2026_papers.json \
  --venue-id "ICLR.cc/2026/Conference" \
  --query "foundation models for planning" \
  --top-k 20

--papers-file is optional if --venue-id is provided. In that case, it defaults to ~/.cache/embed-papers/papers/<venue-id-slug>.json.

search uses the same default embeddings cache dir (~/.cache/embed-papers/embeddings) unless --cache-dir is provided.

For Humans

Make sure you have set an OPENAI_API_KEY in your shell. In the command line, run:

embed-papers host

This launches a local Streamlit UI in your browser for interactive use.

Viewer flow:

enter conference abbreviation + year (auto-builds venue id)
choose direct query or examples upload
set top-k and run search
auto-crawl papers if missing
auto-build embeddings cache if missing

Cache directories used by viewer:

~/.cache/embed-papers/papers
~/.cache/embed-papers/embeddings
~/.cache/embed-papers/atlas

Python API

1) Crawl conference papers

from embed_papers import crawl_papers

_ = crawl_papers(
    venue_id="ICLR.cc/2026/Conference",
    output_file="iclr2026_papers.json",
)

2) Warm cache / search

from embed_papers import PaperSearcher

searcher = PaperSearcher(
    papers_file="iclr2026_papers.json",
    venue_id="ICLR.cc/2026/Conference",
    model_name="text-embedding-3-large",
)

searcher.ensure_embeddings()
results = searcher.search(query="robotics planning language model", top_k=100)
searcher.display(results, n=10, show_abstract=True, abstract_max_chars=500)
searcher.save(results, "results.json")

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
src/embed_papers		src/embed_papers
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

embed-papers

Installation

Base package

Viewer (extra dependency)

For Agents

CLI contract

CLI usage

Crawl

Warm cache

Search

For Humans

Python API

1) Crawl conference papers

2) Warm cache / search

About

Uh oh!

Releases 5

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

embed-papers

Installation

Base package

Viewer (extra dependency)

For Agents

CLI contract

CLI usage

Crawl

Warm cache

Search

For Humans

Python API

1) Crawl conference papers

2) Warm cache / search

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Contributors

Uh oh!

Languages