diff --git a/.gitignore b/.gitignore index c961afc..b2a5a88 100644 --- a/.gitignore +++ b/.gitignore @@ -28,3 +28,5 @@ slides/lightning-talk/index_files/ docs/notebooks/data/ dev-docs/plans/ + +issue-drafts/ diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index d700a7f..318d808 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -55,7 +55,7 @@ src/lazycogs/ 7. Calls `compute_output_grid()` to get the output affine transform and dimensions (width, height). No eager coordinate arrays are produced. 8. Creates a single `MultiBandStacBackendArray` (a dataclass) with shape `(band, time, y, x)` holding all the parameters needed to materialise any chunk later, then wraps it in one `xarray.core.indexing.LazilyIndexedArray`. This avoids `xr.concat` (used internally by `ds.to_array()`), which would eagerly load `LazilyIndexedArray`-backed objects. 9. Uses `rasterix.RasterIndex` for spatial indexing, but materialises the x/y coordinate variables eagerly as numpy arrays so chunked scalar spatial selections compute reliably. -10. Constructs the `xr.DataArray` directly from the 4-D variable. If `chunks` is provided, calls `.chunk(chunks)` to convert to a dask-backed array; otherwise the `LazilyIndexedArray` remains in play so narrow slices (e.g. a single pixel) translate to minimal I/O. When output nodata is known, the returned array sets `da.attrs["_FillValue"]` and `da.encoding["_FillValue"]` for downstream serialization. When unknown, no `_FillValue` metadata is attached. +10. Constructs the `xr.DataArray` directly from the 4-D variable. If `chunks` is provided, calls `.chunk(chunks)` to convert to a dask-backed array; otherwise the `LazilyIndexedArray` remains in play so narrow slices (e.g. a single pixel) translate to minimal I/O. Spatial metadata is serialized for both GeoZarr-style consumers and GDAL/rioxarray consumers: `spatial:transform` stays in affine coefficient order, while `spatial_ref.attrs["GeoTransform"]` uses GDAL geotransform order. When output nodata is known, the returned array sets `da.attrs["_FillValue"]` without duplicating it in `da.encoding`, which keeps rioxarray export paths compatible with xarray CF encoding. When unknown, no `_FillValue` metadata is attached. 11. Keeps lazy runtime state on the backing array rather than in `da.attrs`. This lets xarray operations such as `sortby()` and deep copies clone metadata safely without trying to pickle live objects like `DuckdbClient`. ## Explain: dry-run read estimator diff --git a/README.md b/README.md index 4ffc11c..83ae8ed 100644 --- a/README.md +++ b/README.md @@ -118,11 +118,11 @@ asks you to pass `dtype=` explicitly. When you omit `nodata=`: - if sampled bands all agree on one scalar nodata sentinel, the returned - `DataArray` sets `attrs["_FillValue"]` and `encoding["_FillValue"]`, and - masked mosaic output materializes with that same sentinel instead of zero + `DataArray` sets `attrs["_FillValue"]`, and masked mosaic output materializes + with that same sentinel instead of zero - if sampled bands disagree, `open()` raises `ValueError` and asks you to pass `nodata=` explicitly -- if sampled bands have no nodata sentinel, no `_FillValue` encoding is +- if sampled bands have no nodata sentinel, no `_FillValue` metadata is attached and `0` remains only an implementation fill value for uncovered regions - if later chunk reads encounter a conflicting source nodata value, compute @@ -131,6 +131,12 @@ When you omit `nodata=`: Explicit `dtype=` and `nodata=` stay authoritative even when source assets are heterogeneous. +`lazycogs.open()` also attaches CF/rioxarray-compatible spatial metadata. The +GeoZarr-style `spatial:transform` attribute stays in affine coefficient order, +while `spatial_ref.attrs["GeoTransform"]` is written in GDAL geotransform order +so sliced 2D images and 3D band stacks can be read by rioxarray without repairing +the transform metadata. + Float-only mosaic methods such as `MeanMethod`, `MedianMethod`, and `StdevMethod` auto-promote inferred integer outputs to `float32`. If you pass an explicit integer `dtype=` with one of those methods, `open()` raises and diff --git a/docs/guides/dtype-nodata.md b/docs/guides/dtype-nodata.md index d28d9da..2ed6759 100644 --- a/docs/guides/dtype-nodata.md +++ b/docs/guides/dtype-nodata.md @@ -40,10 +40,12 @@ When you omit `nodata=`: - if sampled bands all have `nodata=None`, lazycogs leaves output nodata unknown - if sampled bands disagree, `open()` raises and asks you to pass `nodata=` explicitly -When nodata is known, the returned `DataArray` advertises it in both: +When nodata is known, the returned `DataArray` advertises it with: - `da.attrs["_FillValue"]` -- `da.encoding["_FillValue"]` + +lazycogs intentionally does not duplicate `_FillValue` into `da.encoding` because +that collides with xarray's CF encoding step during rioxarray exports. ## What output nodata means @@ -102,7 +104,7 @@ If a later asset conflicts with the inferred output contract, compute raises ins ```python da.dtype -da.encoding.get("_FillValue") +da.attrs.get("_FillValue") ``` Those two values tell you most of what lazycogs has promised about the returned array. diff --git a/docs/notebooks/rioxarray.ipynb b/docs/notebooks/rioxarray.ipynb new file mode 100644 index 0000000..2b31e62 --- /dev/null +++ b/docs/notebooks/rioxarray.ipynb @@ -0,0 +1,1525 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "9f730340-3855-4312-bd39-0f62fdb7b3ff", + "metadata": {}, + "source": [ + "# rioxarray interoperability\n", + "\n", + "lazycogs generates DataArrays that are interoperable with rioxarray. This notebook demonstrates how you can use rioxarray's `to_raster` function to export a 3D lazycogs array `(band, y, x)` to a COG." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a498a820-ad12-43d2-8346-22934802f526", + "metadata": {}, + "outputs": [], + "source": [ + "from pathlib import Path\n", + "\n", + "import rioxarray # noqa: F401 - registers the .rio accessor\n", + "import rustac\n", + "from pyproj import Transformer\n", + "\n", + "import lazycogs\n", + "\n", + "dst_crs = \"epsg:5070\"\n", + "dst_bbox = (250_000, 2_600_000, 350_000, 2_700_000)\n", + "\n", + "# transform to epsg:4326 for STAC search\n", + "transformer = Transformer.from_crs(dst_crs, \"epsg:4326\", always_xy=True)\n", + "bbox_4326 = transformer.transform_bounds(*dst_bbox)\n", + "\n", + "PARQUET = \"data/midwest_summer_2025.parquet\"\n", + "\n", + "if not Path(PARQUET).exists():\n", + " Path(PARQUET).parent.mkdir(exist_ok=True)\n", + " await rustac.search_to(\n", + " PARQUET,\n", + " href=\"https://earth-search.aws.element84.com/v1\",\n", + " collections=[\"sentinel-2-c1-l2a\"],\n", + " datetime=\"2025-06-01/2025-09-30\",\n", + " bbox=bbox_4326,\n", + " limit=100,\n", + " )\n", + "\n", + "print(f\"Using {PARQUET}\")\n", + "\n", + "store = lazycogs.store_for(PARQUET, skip_signature=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "5a78fdb0-e346-4f6e-a771-81b7dca89fdb", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
<xarray.DataArray (band: 3, time: 54, y: 1000, x: 1000)> Size: 324MB\n",
+ "[162000000 values with dtype=int16]\n",
+ "Coordinates:\n",
+ " * band (band) <U5 60B 'red' 'green' 'blue'\n",
+ " * time (time) datetime64[s] 432B 2025-06-02 2025-06-04 ... 2025-09-27\n",
+ " * y (y) float64 8kB 2.7e+06 2.7e+06 2.7e+06 ... 2.6e+06 2.6e+06\n",
+ " * x (x) float64 8kB 2.5e+05 2.502e+05 ... 3.498e+05 3.5e+05\n",
+ " spatial_ref int64 8B 0\n",
+ "Indexes:\n",
+ " ┌ x RasterIndex (crs=EPSG:5070)\n",
+ " └ y\n",
+ "Attributes:\n",
+ " grid_mapping: spatial_ref\n",
+ " zarr_conventions: [{'schema_url': 'https://raw.githubusercontent.c...\n",
+ " spatial:dimensions: ['y', 'x']\n",
+ " spatial:bbox: (250000, 2600000, 350000, 2700000)\n",
+ " spatial:transform_type: affine\n",
+ " spatial:transform: [100.0, 0.0, 250000.0, 0.0, -100.0, 2700000.0]\n",
+ " spatial:shape: [1000, 1000]\n",
+ " spatial:registration: pixel\n",
+ " proj:code: EPSG:5070\n",
+ " _FillValue: 0.0<xarray.DataArray (band: 3, y: 1000, x: 1000)> Size: 6MB\n",
+ "[3000000 values with dtype=int16]\n",
+ "Coordinates:\n",
+ " * band (band) <U5 60B 'red' 'green' 'blue'\n",
+ " * y (y) float64 8kB 2.7e+06 2.7e+06 2.7e+06 ... 2.6e+06 2.6e+06\n",
+ " * x (x) float64 8kB 2.5e+05 2.502e+05 ... 3.498e+05 3.5e+05\n",
+ " time datetime64[s] 8B 2025-06-04\n",
+ " spatial_ref int64 8B 0\n",
+ "Indexes:\n",
+ " ┌ x RasterIndex (crs=EPSG:5070)\n",
+ " └ y\n",
+ "Attributes:\n",
+ " grid_mapping: spatial_ref\n",
+ " zarr_conventions: [{'schema_url': 'https://raw.githubusercontent.c...\n",
+ " spatial:dimensions: ['y', 'x']\n",
+ " spatial:bbox: (250000, 2600000, 350000, 2700000)\n",
+ " spatial:transform_type: affine\n",
+ " spatial:transform: [100.0, 0.0, 250000.0, 0.0, -100.0, 2700000.0]\n",
+ " spatial:shape: [1000, 1000]\n",
+ " spatial:registration: pixel\n",
+ " proj:code: EPSG:5070\n",
+ " _FillValue: 0.0"
+ ],
+ "text/plain": [
+ "