Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
## Summary

Describe the change in 2-5 sentences.

## Behavior Or Invariants Changed

Call out any behavior changes and any repo invariants touched.
If none, say "None".

## Tests Run

List the commands you ran and anything you intentionally did not run.

## Reviewer Focus

Point reviewers to the highest-risk files, code paths, or assumptions.

## Context

Add any domain context, dataset assumptions, or implementation background that is not obvious from the diff.

## Open Questions Or Follow-Ups

List anything intentionally deferred, uncertain, or worth extra scrutiny.
207 changes: 4 additions & 203 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
@@ -1,206 +1,7 @@
# Copilot Instructions for CellMapper

## Important Notes
- Avoid drafting summary documents or endless markdown files. Just summarize in chat what you did, why, and any open questions.
- Don't update Jupyter notebooks - those are managed manually.
- When running terminal commands, use `uv run` to execute commands within the project's virtual environment (e.g., `uv run python script.py`).
- **Testing: ALWAYS use `hatch test`, NEVER `uv run pytest` or standalone pytest.** Hatch manages the test matrix (Python versions, dependencies) that CI uses. See "Testing Strategy" section for details.
- Rather than making assumptions, ask for clarification when uncertain.
- **GitHub workflows**: Use GitHub CLI (`gh`) when possible. For GitHub MCP server tools, ensure Docker Desktop is running first (`open -a "Docker Desktop"`).
Canonical repo guidance lives in:
- `AGENTS.md` — architecture, invariants, and validation commands
- `REVIEW_GUIDE.md` — PR review workflow, risk areas, and changed-path test lookup

## Project Overview

**CellMapper** is a k-NN-based tool for mapping cells across representations to transfer labels, embeddings, and expression values. It works for millions of cells, on CPU and GPU, across molecular modalities, between spatial and non-spatial data. The core idea is to separate the method (k-NN graph with kernels) from the application (mapping across arbitrary representations).

### Domain Context (Brief)
- **AnnData**: Standard single-cell data structure. Contains `.X`, `.obs`, `.var`, `.obsm` (embeddings), `.layers`.
- **k-NN mapping**: Compute k-nearest neighbors between query and reference datasets, apply graph kernel to create mapping matrix, use it to transfer labels/embeddings/expression.
- **Joint embeddings**: CellMapper expects pre-computed joint embeddings in `.obsm` from tools like scVI, scANVI, GimVI, ENVI, GLUE, or implements baseline methods (PCA, CCA).
- **Use cases**: Transfer labels from dissociated to spatial data, map embeddings between datasets, compute presence scores in atlases, identify spatial niches, evaluate mapping quality.

### Key Dependencies
- **Core**: anndata, scanpy, numpy, pandas, scipy, scikit-learn
- **k-NN backends**: pynndescent, sklearn, faiss (CPU/GPU), rapids (GPU)
- **Optional**: squidpy (for spatial), scvi-tools, harmony-pytorch (for tutorials)

## Architecture & Code Organization

### Module Structure (follows scverse conventions)
- Use `AnnData` objects as primary data structure
- Type annotations use modern syntax: `str | None` instead of `Optional[str]`
- Supports Python 3.11, 3.12, 3.13 (see `pyproject.toml`)
- Avoid local imports unless necessary for circular import resolution

### Core Components
1. **`src/cellmapper/model/cellmapper.py`**: Main `CellMapper` class with `map()` method
- Inherits from `EvaluationMixin` and `EmbeddingMixin`
- Handles both query-to-reference and self-mapping modes
- Core methods: `map()`, `map_obs()`, `map_obsm()`, `map_layers()`
2. **`src/cellmapper/model/neighbors.py`**: k-NN graph computation with multiple backends
3. **`src/cellmapper/model/kernel.py`**: Graph kernels for creating mapping matrices
4. **`src/cellmapper/model/mapping_operator.py`**: Encapsulates mapping matrix with matrix powers for diffusion
5. **`src/cellmapper/model/evaluate.py`**: Metrics for evaluating label/expression transfer quality
6. **`src/cellmapper/model/embedding.py`**: Baseline joint embedding methods (PCA, CCA)
7. **`src/cellmapper/utils.py`**: Utilities (library size adjustment, imputed data creation)

## Development Workflow

### Environment Management (uv-based)
```bash
# Create/sync virtual environment
uv sync # install project with default dependencies
uv sync --extra test # include test dependencies
uv sync --extra doc # include documentation dependencies
uv sync --all-extras # include all optional dependencies

# Run commands in virtual environment
uv run python script.py # run any Python script
uv run pytest tests/ # run tests directly (alternative to hatch)

# Testing via hatch (recommended, runs test matrix, uses uv internally)
hatch test # test with highest Python version
hatch test --all # test all Python 3.11, 3.13, pre-release deps

# Documentation
hatch run docs:build # build Sphinx docs
hatch run docs:open # open in browser
hatch run docs:clean # clean build artifacts

# Environment inspection
hatch env show # list environments
```

### Testing Strategy
- Test matrix defined in `[[tool.hatch.envs.hatch-test.matrix]]` in `pyproject.toml`
- Tests Python 3.11 & 3.13 with stable deps, 3.13 with pre-release deps
- CI extracts test config from pyproject.toml (`.github/workflows/test.yaml`)
- Tests live in `tests/`, fixtures in `tests/conftest.py`
- **Always run tests via `hatch test`**, NOT standalone pytest

### Code Quality Tools
- **Ruff**: Linting and formatting (120 char line length)
- **Biome**: JSON/JSONC formatting with trailing commas
- **Pre-commit**: Auto-runs ruff, biome. Install with `pre-commit install`
- Use `git pull --rebase` if pre-commit.ci commits to your branch

## Documentation Conventions

### Docstring Style (NumPy format via Napoleon)
```python
def map_obs(
self,
obs_keys: str | list[str],
*, # keyword-only marker
prediction_postfix: str = "_predicted",
confidence_postfix: str = "_confidence",
) -> pd.DataFrame:
"""Short one-line description.

Extended description if needed.

Parameters
----------
obs_keys
Keys in reference.obs to transfer to query.
prediction_postfix
Suffix for predicted column names.
confidence_postfix
Suffix for confidence score column names.

Returns
-------
DataFrame with transferred labels and confidence scores.
"""
```

### Sphinx & Documentation
- API docs auto-generated from `docs/api.md` using `autosummary`
- Tutorials in `docs/notebooks/tutorials/` rendered via myst-nb (`.ipynb` only)
- Add external packages to `intersphinx_mapping` in `docs/conf.py`
- See `docs/contributing.md` for detailed documentation guidelines

## Key Configuration Files

### `pyproject.toml`
- **Build**: `hatchling` with `hatch-vcs` for git-based versioning
- **Dependencies**: Minimal runtime deps; optional extras for `[test]`, `[doc]`, `[tutorials]`
- **Ruff**: 120 char line length, NumPy docstring convention
- **Test matrix**: Python 3.11 & 3.13 (stable), 3.13 (pre-release)

### Version Management
- Version from git tags via `hatch-vcs`
- Release: Create GitHub release with tag `vX.X.X`
- Follows **Semantic Versioning**

## Project-Specific Patterns

### Basic Usage Pattern
```python
from cellmapper import CellMapper

# Assume query and reference have joint embedding in .obsm["X_joint"]
cmap = CellMapper(query, reference).map(
use_rep="X_joint",
obs_keys="celltype", # transfer labels
obsm_keys="X_umap", # transfer UMAP
layer_key="counts", # transfer expression
)

# Self-mapping (for spatial contextualization, denoising)
cmap_self = CellMapper(query).map(
use_rep="X_pca",
layer_key="counts",
)
```

### k-NN Backends
- **pynndescent**: Fast approximate k-NN, CPU-only
- **sklearn**: Exact k-NN, CPU-only, slower for large datasets
- **faiss**: Exact/approximate k-NN, supports CPU and GPU (via faiss-gpu)
- **rapids**: GPU-accelerated k-NN using cuML

### Mapping Workflow
1. Compute k-NN graph between query and reference (or self)
2. Apply kernel to k-NN graph to create mapping matrix M
3. Transfer data: `query_data = M @ reference_data`
4. Optionally apply matrix powers `M^t` for diffusion
5. Evaluate transfer quality with metrics

### AnnData Conventions
- Check matrix format: `adata.X` may be sparse or dense
- Use `adata.layers[key]` for alternative representations (e.g., counts, log-normalized)
- Joint embeddings stored in `adata.obsm["X_<method>"]`
- Transferred data goes back into query's `.obs`, `.obsm`, `.layers`

### Testing with AnnData
```python
# From conftest.py - example fixture pattern
@pytest.fixture
def adata_spatial():
"""Small spatial AnnData object with spatial coordinates."""
adata = ad.AnnData(
X=np.random.randn(100, 50).astype(np.float32),
obs=pd.DataFrame({"celltype": ["A", "B"] * 50}),
obsm={"spatial": np.random.rand(100, 2)},
)
sc.pp.pca(adata)
return adata
```

## Common Gotchas

1. **Hatch for testing**: Always use `hatch test`, never standalone `pytest`. CI matches hatch test matrix.
2. **Joint embeddings required**: Most use cases require pre-computed joint embedding in `.obsm`. Don't assume PCA is sufficient for complex mappings.
3. **Sparse matrices**: Check `scipy.sparse.issparse(adata.X)` before operations. Mapping matrices are typically dense.
4. **Self-mapping mode**: If `reference` is `None` or same as `query`, automatically enters self-mapping mode.
5. **k-NN backends**: faiss requires `faiss-cpu` or `faiss-gpu`, rapids requires CUDA environment. Handle gracefully with fallbacks.
6. **Pre-commit conflicts**: Use `git pull --rebase` to integrate pre-commit.ci fixes.
7. **Line length**: Ruff set to 120 chars, but keep docstrings readable (~80 chars per line).

## Related Resources

- **Contributing guide**: `docs/contributing.md`
- **Tutorials**: `docs/notebooks/tutorials/`
- **scanpy docs**: https://scanpy.readthedocs.io/
- **faiss docs**: https://github.com/facebookresearch/faiss
- **squidpy docs**: https://squidpy.readthedocs.io/ (for spatial analysis)
If this file conflicts with those guides, the guides win.
71 changes: 71 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# AGENTS.md — CellMapper

CellMapper is a Python package for k-NN-based mapping of cells across
representations to transfer labels, embeddings, and expression values. It works
for millions of cells, on CPU and GPU, across molecular modalities, and between
spatial and non-spatial data. The core idea is to separate the method (k-NN
graph + kernel → mapping matrix) from the application (transfer across arbitrary
representations).
Key frameworks: AnnData/Scanpy, numpy/scipy, scikit-learn, pynndescent, faiss,
RAPIDS cuML.

## Trust Order

When sources disagree:
1. PR description and changed code
2. This file (`AGENTS.md`)
3. `REVIEW_GUIDE.md`
4. Tests and fixtures
5. Public docs in `docs/` and `README.md`

Every fact should have one owner. This file owns invariants and the reference
table below — everything else is delegated.

## Where To Find What

| Topic | Source of truth |
|-------|----------------|
| User-facing overview, use cases, quickstart | `README.md` |
| Public API reference | `docs/api.md` and autosummary under `docs/generated/` |
| Tutorials (query→reference, spatial mapping, spatial smoothing, data denoising) | `docs/notebooks/tutorials/` |
| Contributor setup, environments, docs build | `docs/contributing.md` |
| Release notes | `docs/changelog.md` |
| PR review workflow and risk areas | `REVIEW_GUIDE.md` |
| Test fixtures | `tests/conftest.py`, `tests/data/` |
| Kernel taxonomy and tunable thresholds (sklearn warning cutoff, spectral threshold) | `src/cellmapper/constants.py` |
| Optional-dependency gating | `src/cellmapper/check.py` |
| Method-level behavior (parameters, return shapes) | docstrings in `src/cellmapper/model/` |

## Critical Invariants

- **Self-mapping mode** activates when `reference is None` **or** `reference is query` (object identity). See `CellMapper.__init__` in `src/cellmapper/model/cellmapper.py`.
- **Reference is read-only.** `.map()` never mutates the reference AnnData. `query` is mutated in place for `map_obs` / `map_obsm`. Expression transfer produces a separate `query_imputed` AnnData object, not a view.
- **Output key naming** follows `{key}{prediction_postfix}` and `{key}{confidence_postfix}` in `query.obs` / `query.obsm`. Postfixes are user-controllable on the per-method entrypoints (`map_obs`, `map_obsm`); `.map()` also exposes `prediction_postfix`.
- **`.map()` auto-chains** `compute_neighbors` → `compute_mapping_matrix` → `map_obs/obsm/layers` based on missing state. Callers that use these methods directly must respect that ordering.
- **Mapping matrix is row-stochastic and float32.** Sparse inputs are stored as `scipy.sparse.csr_matrix`; dense inputs stay dense. Zero-neighbor rows are left as-is. See `MappingOperator._validate_and_normalize_mapping_matrix`.
- **Matrix powers `t > 1` are self-mapping-only** (`MappingOperator._validate_power` raises otherwise).
- **Optional k-NN backends fail fast.** `check.check_deps()` is called at backend construction with clear install hints — no silent fallback. Supported backends: `sklearn`, `pynndescent`, `faiss-cpu`, `faiss-gpu`, `rapids`.
- **Kernel taxonomy lives in `constants.py`** (`JACCARD_BASED_KERNELS`, `CONNECTIVITY_BASED_KERNELS`, `SELF_MAPPING_ONLY_KERNELS`). Kernels in `SELF_MAPPING_ONLY_KERNELS` require a square neighbor matrix.
- **`Neighbors` strips self-edges** from storage for square matrices; `n_neighbors` counts non-self neighbors.
- **`query_imputed` is always assembled via `utils.create_imputed_anndata`**. Setter accepts `AnnData | ndarray | csr_matrix | DataFrame | None`; result has `obs`/`obsm` from query and `var`/`varm` from reference.
- **Public API surface = `__init__.py` `__all__`**: `CellMapper`, `Kernel`, `Neighbors`, `logger`. `EvaluationMixin` and `EmbeddingMixin` are internal. Do not re-export helpers from the top-level package.
- **Tests mirror source layout.** `src/cellmapper/X.py` → `tests/test_X.py`; `src/cellmapper/model/X.py` → `tests/model/test_X.py`.

## Development Commands

Python 3.11 and 3.14 (see the `hatch-test` matrix in `pyproject.toml`).

```bash
hatch test # run tests (highest Python)
hatch test --all # full matrix
hatch run docs:build # build Sphinx docs
pre-commit run --all-files # lint and format
```

Focused runs (with `uv`):

```bash
uv run pytest tests/model
uv run pytest tests/model/test_mapping_operator.py
uv run pytest tests/test_utils.py
```
5 changes: 5 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# CellMapper Agent Entry Point

@AGENTS.md

Use `REVIEW_GUIDE.md` for automated PR review workflow, risk areas, and changed-path test lookup.
74 changes: 74 additions & 0 deletions REVIEW_GUIDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# CellMapper Review Guide

Agent-neutral PR review playbook. Written for **review agents running on GitHub** — use the imperative voice.

**Scope: review only.** Produce comments and suggestions. Do **not** push commits, modify files, or apply fixes. Flag issues and suggest diffs in comments; leave the edits to the author.

Architecture, invariants, and commands live in `AGENTS.md`. Do not restate them here — link.

## Workflow

1. Read the PR body.
2. Check CI (`gh pr checks <num>`, `gh run view <run-id> --log-failed`) and investigate failures before commenting.
3. Map changed paths to tests (see below) and check whether the change touches an invariant from `AGENTS.md`.
4. Prioritize behavioral regressions, numerical correctness, and public-contract changes over style.

## High-Risk Areas

Pointers only — see `AGENTS.md` for the actual invariants.

- **Mapping-matrix construction** (`model/mapping_operator.py`): normalization, dtype, sparsity handling. Silent shifts possible without test failures.
- **Matrix powers / diffusion** (`model/mapping_operator.py`, `_validate_power`, `_apply_iterative`, `_apply_spectral`): self-mapping gate, iterative-vs-spectral behavior.
- **Self-mapping detection** (`model/cellmapper.py::CellMapper.__init__`): identity check drives all downstream mode-dependent logic.
- **Kernel taxonomy** (`constants.py`, `model/kernel.py`, `model/neighbors.py`): new kernels must land in the right set in `constants.py`.
- **k-NN backend gating** (`model/_knn_backend.py`, `check.py`): optional deps must route through `check.check_deps()`.
- **AnnData output contract** (`model/cellmapper.py::map_obs / map_obsm / map_layers`): key naming, what gets written where, `query_imputed` construction via `utils.create_imputed_anndata`.
- **Public API surface** (`src/cellmapper/__init__.py`): new re-exports commit the project to an API.

## Changed-Path Test Lookup

Tests mirror the source tree.

| Changed path | Primary tests |
|--------------|---------------|
| `src/cellmapper/model/cellmapper.py` | `tests/model/test_query_to_reference_mapping.py`, `tests/model/test_self_mapping.py` |
| `src/cellmapper/model/kernel.py` | `tests/model/test_kernel.py` |
| `src/cellmapper/model/mapping_operator.py` | `tests/model/test_mapping_operator.py` |
| `src/cellmapper/model/neighbors.py` | `tests/model/test_neighbors.py` |
| `src/cellmapper/model/embedding.py` | `tests/model/test_embedding.py` |
| `src/cellmapper/model/evaluate.py` | `tests/model/test_evaluate.py` |
| `src/cellmapper/model/_knn_backend.py` | `tests/model/test_neighbors.py`, `tests/model/test_kernel.py` |
| `src/cellmapper/check.py` | `tests/test_check.py` |
| `src/cellmapper/utils.py` | `tests/test_utils.py` |
| End-to-end behavioral change | also `tests/test_basic.py` |
| Fixture changes | `tests/conftest.py`, `tests/data/` |

## Testing

- **New code** should be covered. Reuse fixtures from `tests/conftest.py`; prefer `pytest.mark.parametrize`; favor few meaningful tests over many redundant ones.
- **Failing CI** is not to be waved through. Distinguish critical regressions from flakes; escalate critical ones.
- **Modified tests** — scrutinize *how*. Relaxed tolerances, removed assertions, deleted cases, or loosened matrices are red flags. Require explicit justification in the PR body.

## Documentation Impact

Behavior or API changes often touch docs in multiple places. Point at the **owning file** (see the `AGENTS.md` "Where To Find What" table) — don't duplicate content in the review.

- Public symbol / API changes → `docs/api.md`, autosummary, `README.md` quickstart, source docstrings.
- Contributor workflow or env changes → `docs/contributing.md`.
- Tutorials under `docs/notebooks/tutorials/` → flag stale imports, outputs, or API usage.
- Invariants / commands → `AGENTS.md`.
- Review workflow / risk areas / test lookup → this file.
- `CLAUDE.md` and `.github/copilot-instructions.md` should stay thin pointers — flag any PR that re-adds content here.

## Checklist

- Invariants in `AGENTS.md` preserved?
- CI green (or failures investigated)?
- Test coverage adequate and not silently weakened?
- Public contracts (AnnData output, mapping matrix format, public API surface) unchanged — or explicitly called out in the PR body?
- Affected human- and agent-facing docs updated?
- PR scope tight, no unrelated bundling?

## PR Metadata

This repo uses `.github/PULL_REQUEST_TEMPLATE.md`. Treat its sections as the preferred summary surface.
Loading