Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ repos:
hooks:
- id: pyproject-fmt
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.13.2
rev: v0.14.14
hooks:
- id: ruff-check
types_or: [python, pyi, jupyter]
Expand Down
28 changes: 16 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,12 @@

# annbatch

> [!CAUTION]
> This package does not have a stable API.
> However, we do not anticipate the on-disk format to change in a fully incompatible manner.
> Small changes to how we store the shuffled data may occur but you should always be able to load your data somehow i.e., they will never be fully breaking.
> We will always provide lower-level APIs that should make this guarantee possible.
```{warning}
This package does not have a stable API.
However, we do not anticipate the on-disk format to change in a fully incompatible manner.
Small changes to how we store the shuffled data may occur but you should always be able to load your data somehow i.e., they will never be fully breaking.
We will always provide lower-level APIs that should make this guarantee possible.
```

[![Tests][badge-tests]][tests]
[![Documentation][badge-docs]][documentation]
Expand Down Expand Up @@ -61,8 +62,9 @@ pip install annbatch

We provide extras for `torch`, `cupy-cuda12`, `cupy-cuda13`, and [zarrs-python][].
`cupy` provides accelerated handling of the data via `preload_to_gpu` once it has been read off disk and does not need to be used in conjunction with `torch`.
> [!IMPORTANT]
> [zarrs-python][] gives the necessary performance boost for the sharded data produced by our preprocessing functions to be useful when loading data off a local filesystem.
```{important}
[zarrs-python][] gives the necessary performance boost for the sharded data produced by our preprocessing functions to be useful when loading data off a local filesystem.
```

## Detailed tutorial

Expand Down Expand Up @@ -97,8 +99,9 @@ collection.add_adatas(

Data loading:

> [!IMPORTANT]
> Without custom loading via `Loader.load_adata` *all* obs columns will be loaded and yielded potentially degrading performance.
```{important}
Without custom loading via `Loader.load_adata` *all* obs columns will be loaded and yielded potentially degrading performance.
```

```python
from pathlib import Path
Expand Down Expand Up @@ -136,9 +139,10 @@ for batch in ds:
data, obs = batch["X"], batch["obs"]
```

> [!IMPORTANT]
> For usage of our loader inside of `torch`, please see [this note](https://annbatch.readthedocs.io/en/latest/#user-configurable-sampling-strategy) for more info.
> At the minimum, be aware that deadlocking will occur on linux unless you pass `multiprocessing_context="spawn"` to the `torch.utils.data.DataLoader` class.
```{important}
For usage of our loader inside of `torch`, please see [this note](https://annbatch.readthedocs.io/en/latest/#user-configurable-sampling-strategy) for more info.
At the minimum, be aware that deadlocking will occur on linux unless you pass `multiprocessing_context="spawn"` to the `torch.utils.data.DataLoader` class.
```

<!--FOOTER-->

Expand Down
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,8 +115,8 @@ However, once you have too much data to fit into memory, for whatever reason, th

api.md
zarr-configuration.md
notebooks/index
changelog.md
contributing.md
references.md
notebooks/index
```
30 changes: 8 additions & 22 deletions docs/notebooks/example.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -19,35 +19,26 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": null,
"metadata": {
"tags": [
"hide-output"
]
},
"outputs": [],
"source": [
"# !pip install annbatch[zarrs, torch]"
"# !pip install annbatch[zarrs,torch]"
]
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": null,
"metadata": {
"tags": [
"hide-output"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"zsh:1: command not found: wget\n",
"zsh:1: command not found: wget\n"
]
}
],
"outputs": [],
"source": [
"# Download two example datasets from CELLxGENE\n",
"!wget https://datasets.cellxgene.cziscience.com/866d7d5e-436b-4dbd-b7c1-7696487d452e.h5ad\n",
Expand Down Expand Up @@ -133,7 +124,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": null,
"metadata": {
"tags": [
"hide-output"
Expand Down Expand Up @@ -171,7 +162,9 @@
"import anndata as ad\n",
"from annbatch import DatasetCollection\n",
"\n",
"# let's write out only shared colunms - otherwise DatasetCollection will warn about all the columns we are missing for good reason - mismatched columns can lead to unexpected data and missing values.\n",
"# write out only shared colunms\n",
"# otherwise DatasetCollection will warn about all the columns we are missing for good reason\n",
"# mismatched columns can lead to unexpected data and missing values.\n",
"shared_columns = ad.experimental.read_lazy(\"866d7d5e-436b-4dbd-b7c1-7696487d452e.h5ad\").obs.columns.intersection(\n",
" ad.experimental.read_lazy(\"f81463b8-4986-4904-a0ea-20ff02cbb317.h5ad\").obs.columns\n",
")\n",
Expand Down Expand Up @@ -370,13 +363,6 @@
" load_adata=read_lazy_x_and_obs_only,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand Down
2 changes: 1 addition & 1 deletion docs/notebooks/index.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Notebooks
# Tutorials

```{toctree}
:hidden: false
Expand Down
Loading