Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 31 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,12 @@
[badge-zenodo]: https://zenodo.org/badge/899554552.svg


🧬 CellAnnotator is an [scverse ecosystem package](https://scverse.org/packages/#ecosystem), designed to annotate cell types in scRNA-seq data based on marker genes using large language models (LLMs). It supports OpenAI, Google Gemini, and Anthropic Claude models out of the box, with more providers planned for the future.
🧬 CellAnnotator is an [scverse ecosystem package](https://scverse.org/packages/#ecosystem), designed to annotate cell types in scRNA-seq data based on marker genes using large language models (LLMs). It supports OpenAI, Google Gemini, Anthropic Claude, and OpenRouter models out of the box.


## ✨ Key Features

- 🤖 **LLM-agnostic backend**: Seamlessly use models from OpenAI, Anthropic (Claude), and Gemini (Google) — just set your provider and API key.
- 🤖 **LLM-agnostic backend**: Seamlessly use models from OpenAI, Anthropic (Claude), Gemini (Google), or OpenRouter — just set your provider and API key.
- 🧬 **Automatically annotate cells** including type, state, and confidence fields.
- 🔄 **Consistent annotations** across all samples in your study.
- 🧠 **Infuse prior knowledge** by providing information about your biological system.
Expand Down Expand Up @@ -60,6 +60,7 @@ After installation, head over to the LLM provider of your choice to generate an
- OpenAI: [API key](https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key)
- Google (Gemini): [API key](https://ai.google.dev/gemini-api/docs/api-key)
- Anthropic (Claude): [API key](https://docs.anthropic.com/en/docs/get-started)
- OpenRouter: [API key](https://openrouter.ai/settings/keys)


🔒 Keep this key private and don't share it with anyone. `CellAnnotator` will try to read the key as an environmental variable - either expose it to the environment yourself, or store it as an `.env` file anywhere within the repository where you conduct your analysis and plan to run `CellAnnotator`. The package will then use [dotenv](https://pypi.org/project/python-dotenv/) to export the key from the `env` file as an environmental variable.
Expand All @@ -78,6 +79,31 @@ cell_ann = CellAnnotator(

By default, this will store annotations in `adata.obs['cell_type_predicted']`. Head over to our 📚 [tutorials](https://cell-annotator.readthedocs.io/en/latest/notebooks/tutorials/index.html) to see more advanced use cases, and learn how to adapt this to your own data. You can run `CellAnnotator` for just a single sample of data, or across multiple samples. In the latter case, it will attempt to harmonize annotations across samples.

### Advanced provider options

`CellAnnotator` can also be used in single-sample mode by setting `sample_key=None`.

Example:

```python
from cell_annotator import CellAnnotator

cell_ann = CellAnnotator(
adata=adata,
species="human",
tissue="pancreas",
cluster_key="leiden_1",
sample_key=None, # single-sample mode
provider="openrouter",
model="openai/gpt-4o-mini",
api_key="YOUR_OPENROUTER_API_KEY",
)

cell_ann.get_expected_cell_type_markers(n_markers=3)
cell_ann.get_cluster_markers()
cell_ann.annotate_clusters(key_added="cell_type_predicted")
```



## 💸 Costs and models
Expand All @@ -89,12 +115,14 @@ CellAnnotator is LLM-agnostic and works with multiple providers:

- **Anthropic Claude:** Claude models are supported. See the [Anthropic pricing page](https://docs.anthropic.com/claude/docs/pricing) for details.

- **OpenRouter:** OpenRouter routes requests to many model families (including OpenAI, Anthropic, and others) behind a single API key. Use `provider="openrouter"` and pass a model slug such as `openai/gpt-4o-mini` or `anthropic/claude-3.5-sonnet`.

You can select your provider and model by setting the appropriate parameters. More providers may be supported in the future as the LLM ecosystem evolves.



## 🔐 Data privacy
This package sends cluster marker genes, and the `species` and `tissue` you define, to the selected LLM provider (e.g., OpenAI, Google, or Anthropic). **No actual gene expression values are sent.**
This package sends cluster marker genes, and the `species` and `tissue` you define, to the selected LLM provider (e.g., OpenAI, Google, Anthropic, or OpenRouter routes). **No actual gene expression values are sent.**
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since OpenRouter forwards requests to a chosen upstream provider under OpenRouter's own ToS, the effective attack surface for OpenRouter users is wider than for the other three providers. Could you strengthen this paragraph along the lines of:

When using OpenRouter, requests are forwarded to the upstream provider implied by your model slug (e.g. openai/..., anthropic/...). Review both OpenRouter's privacy policy and the upstream provider's. Some OpenRouter model tiers may log prompts by default; users who need privacy guarantees should configure this via their OpenRouter account settings.


Please ensure your usage of this package aligns with your institution's guidelines on data privacy and the use of external AI models. Each provider has its own privacy policy and terms of service. Review these carefully before using CellAnnotator with sensitive or regulated data.

Expand Down
174 changes: 174 additions & 0 deletions docs/notebooks/tutorials/110_openrouter_sample_annotation.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
{
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[blocking] Please rework the tutorial to match the style of the existing tutorials (see 100_heart_atlas.ipynb and 200_human_bmmcs.ipynb as references).

Three specific points:

  1. API key handling. The existing tutorials assume the key is loaded from an .env file via the dotenv flow the package already supports — they don't require the user to paste it inline. Please drop the OPENROUTER_API_KEY = "" / api_key=OPENROUTER_API_KEY pattern and let the key come in from the environment like the other tutorials do.
  2. Real downloadable data. Existing tutorials fetch real data via sc.read(..., backup_url=...) (e.g. the heart atlas from figshare) so readers can actually run the notebook end-to-end and see real annotations. Requiring users to supply their own ADATA_PATH makes this non-executable out of the box. Please reuse one of the datasets from the existing tutorials (e.g. the heart-atlas subsample) so readers see concrete OpenRouter-based annotations on real data.
  3. Narrative and structure. Match the preliminaries → annotation → evaluation structure of 100_heart_atlas.ipynb. A good target is: a brief intro explaining what OpenRouter is and why someone might pick it, then the same end-to-end flow as the heart-atlas notebook, just with provider="openrouter" and an appropriate model slug.

The notebook already gets picked up by the :glob: in tutorials/index.rst, so once it's reworked it will slot in next to the others without further config changes.

"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# OpenRouter sample annotation with Leiden clusters\n",
"\n",
"This tutorial shows how to annotate one or more samples with `CellAnnotator`\n",
"using an OpenRouter model and a user-provided Leiden key."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import scanpy as sc\n",
"from cell_annotator import CellAnnotator"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Configuration\n",
"\n",
"- `OPENROUTER_API_KEY`: your OpenRouter API key\n",
"- `OPENROUTER_MODEL`: model slug (e.g. `openai/gpt-4o-mini`)\n",
"- `LEIDEN_KEY`: cluster column in `adata.obs`\n",
"- `SAMPLE_KEY`: sample column in `adata.obs`, or `None` for a single sample\n",
"- `ADATA_PATH`: path to your `.h5ad` dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"OPENROUTER_API_KEY = \"\" # e.g. sk-or-v1-...\n",
"OPENROUTER_MODEL = \"openai/gpt-4o-mini\"\n",
"LEIDEN_KEY = \"leiden\"\n",
"ADATA_PATH = \"path/to/your_data.h5ad\"\n",
"\n",
"SPECIES = \"human\"\n",
"TISSUE = \"pancreas\"\n",
"STAGE = \"adult\"\n",
"SAMPLE_KEY = \"sample\" # set to None for single-sample datasets\n",
"\n",
"if not OPENROUTER_API_KEY:\n",
" raise ValueError(\"Set OPENROUTER_API_KEY before continuing.\")\n",
"if not OPENROUTER_MODEL:\n",
" raise ValueError(\"Set OPENROUTER_MODEL before continuing.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"adata = sc.read_h5ad(ADATA_PATH)\n",
"\n",
"if LEIDEN_KEY not in adata.obs.columns:\n",
" raise KeyError(f\"Column '{LEIDEN_KEY}' was not found in adata.obs\")\n",
"if SAMPLE_KEY is not None and SAMPLE_KEY not in adata.obs.columns:\n",
" raise KeyError(f\"Column '{SAMPLE_KEY}' was not found in adata.obs\")\n",
"\n",
"print(adata)\n",
"print(\"Leiden key:\", LEIDEN_KEY)\n",
"print(\"Sample key:\", SAMPLE_KEY)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialize annotator"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_ann = CellAnnotator(\n",
" adata=adata,\n",
" species=SPECIES,\n",
" tissue=TISSUE,\n",
" stage=STAGE,\n",
" cluster_key=LEIDEN_KEY,\n",
" sample_key=SAMPLE_KEY,\n",
" provider=\"openrouter\",\n",
" model=OPENROUTER_MODEL,\n",
" api_key=OPENROUTER_API_KEY,\n",
")\n",
"\n",
"cell_ann"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Run annotation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_ann.get_expected_cell_type_markers(n_markers=3)\n",
"cell_ann.get_cluster_markers()\n",
"cell_ann.annotate_clusters(key_added=\"cell_type_predicted\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Inspect and save results"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"adata.obs[[LEIDEN_KEY, \"cell_type_predicted\"]].head(10)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"if \"X_umap\" in adata.obsm:\n",
" sc.pl.umap(adata, color=[LEIDEN_KEY, \"cell_type_predicted\"], wspace=0.35)\n",
"else:\n",
" print(\"No UMAP embedding found; skipping plot.\")\n",
"\n",
"output_path = ADATA_PATH.replace(\".h5ad\", \"_annotated.h5ad\")\n",
"adata.write(output_path)\n",
"print(f\"Saved annotated object to: {output_path}\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
3 changes: 2 additions & 1 deletion src/cell_annotator/_constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,10 @@ class PackageConstants:
"openai": "gpt-4o-mini",
"gemini": "gemini-2.5-flash-lite",
"anthropic": "claude-haiku-4-5",
"openrouter": "openai/gpt-4o-mini",
}
# Supported LLM providers
supported_providers: list[str] = ["openai", "gemini", "anthropic"]
supported_providers: list[str] = ["openai", "gemini", "anthropic", "openrouter"]
default_cluster_key: str = "leiden"
cell_type_key: str = "cell_type_harmonized"

Expand Down
9 changes: 9 additions & 0 deletions src/cell_annotator/model/_api_keys.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,11 @@ class APIKeyManager:
"setup_url": "https://console.anthropic.com/settings/keys",
"description": "Anthropic Claude models",
},
"openrouter": {
"env_var": "OPENROUTER_API_KEY",
"setup_url": "https://openrouter.ai/settings/keys",
"description": "OpenRouter models (aggregated providers)",
},
}

def __init__(self, auto_load_env: bool = True):
Expand Down Expand Up @@ -186,6 +191,10 @@ def validate_model_access(self, model: str) -> tuple[bool, str | None]:
provider = "gemini"
elif any(claude_name in model_lower for claude_name in ["claude", "anthropic"]):
provider = "anthropic"
elif "/" in model and not model_lower.startswith("models/"):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit / follow-up: this "/" in model and not model_lower.startswith("models/") rule is duplicated in llm_interface.py:126. Worth pulling into a small shared helper in a follow-up PR — not blocking here.

# OpenRouter uses '<provider>/<model>' slugs (e.g. 'openai/gpt-4o-mini').
# The 'models/' guard avoids false-matching Gemini IDs like 'models/gemini-1.5-flash'.
provider = "openrouter"
Comment thread
Marius1311 marked this conversation as resolved.
elif any(openai_name in model_lower for openai_name in ["gpt", "o1", "davinci", "curie", "babbage", "ada"]):
provider = "openai"
else:
Expand Down
Loading
Loading